Summary
The market for business intelligence has been going through an evolutionary shift in recent years. One of the driving forces for that change has been the rise of analytics engineering powered by dbt. Lightdash has fully embraced that shift by building an entire open source business intelligence framework that is powered by dbt models. In this episode Oliver Laslett describes why dashboards aren’t sufficient for business analytics, how Lightdash promotes the work that you are already doing in your data warehouse modeling with dbt, and how they are focusing on bridging the divide between data teams and business teams and the requirements that they have for data workflows.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
- Your host is Tobias Macey and today I’m interviewing Oliver Laslett about Lightdash, an open source business intelligence system powered by your dbt models
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Lightdash is and the story behind it?
- What are the main goals of the project?
- Who are the target users, and how has that profile informed your feature priorities?
- Business intelligence is a market that has gone through several generational shifts, with products targeting numerous personas and purposes. What are the capabilities that make Lightdash stand out from the other options?
- Can you describe how Lightdash is architected?
- How have the design and goals of the system changed or evolved since you first began working on it?
- What have been the most challenging engineering problems that you have dealt with?
- How does the approach that you are taking with Lightdash compare to systems such as Transform and Metriql that aim to provide a dedicated metrics layer?
- Can you describe the workflow for someone building an analysis in Lightdash?
- What are the points of collaboration around Lightdash for different roles in the organization?
- What are the methods that you use to expose information about the state of the underlying dbt models to the end users?
- How do they use that information in their exploration and decision making?
- What was your motivation for releasing Lightdash as open source?
- How are you handling the governance and long-term viability of the project?
- What are the most interesting, innovative, or unexpected ways that you have seen Lightdash used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Lightdash?
- When is Lightdash the wrong choice?
- What do you have planned for the future of Lightdash?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Lightdash
- Looker
- PowerBI
- Redash
- Metabase
- dbt
- Superset
- Streamlit
- Kubernetes
- JDBC
- SQLAlchemy
- SQLPad
- Singer
- Airbyte
- Meltano
- Transform
- Metriql
- Cube.js
- OpenLineage
- dbt Packages
- Rudderstack
- PostHog
- Firebolt
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Of your own. And don't forget to thank them for their continued support of this show. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads?
Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. Your host is Tobias Macy. And today, I'm interviewing Oliver Laslett about Lightdash, an open source business intelligence system powered by your dbt models. So, Oliver, can you start by introducing yourself? Yeah. Sure. So I'm the cofounder and CTO at Lightash.
[00:01:45] Unknown:
Previously, did a stint in Fintech startups in London here in the UK. Before that, a PhD in Theoretical Physics. It's funny how many times people come from physics into computing, particularly in the data space. Yeah. I mean, physics was just too hard, you know. That's the problem.
[00:02:01] Unknown:
Yeah. I actually went into university initially thinking I wanted to get my PhD in theoretical physics and was quickly disabused of that notion.
[00:02:11] Unknown:
Yeah. Exactly. Luckily, my work was mostly computational rather than textbook. So I ended up leaving with, yeah, a lot of, like, interesting data experience, Very different experience to what I'm doing now, but, yeah, I guess I kinda got the itch at that point.
[00:02:22] Unknown:
And so do you remember how you first got involved in data management?
[00:02:26] Unknown:
Probably would have been doing my PhD. So I was, like, running big simulations. I was actually working in magnetism, which I still don't really know how magnets work, but, I studied them for about 4 years. And a lot of the data management there was, I guess, a bit more old school. So you have these really big clusters. A lot of the universities and colleges, you know, have these big, like, supercomputing clusters, and your job sit in a big queue, and then they fire off for, like, 10 hours. And the output of these jobs is always a huge amount, like, big dump of data, either a CSVs you see all the time, maybe, like, something in a HDFS store. But that was probably my, like, first experience of data and was my first experience of programming.
And so I guess from the start, my programming experience has always been very data heavy. And the start of my career was very much on the research side of things. Whereas, I guess, what we're doing now with Lightdash is really really different, and we're really focused on the business use case and how data can power, you know, kind of better outcomes. Yeah. Very different to studying magnets.
[00:03:23] Unknown:
And so digging more into Lightdash, can you give a bit of an overview about what it is that you're building there and some of the story behind how you ended up at this particular problem and the approach that you've taken to solve it? Yeah. So Lightash is an open source tool that companies use to basically explore and visualize their internal company data. So the general idea is that
[00:03:43] Unknown:
a data professional kind of helps pull together all of your business logic and all your analytics data into 1 place. That kind of empowers everyone in the business to answer their own questions using data. You know, it's becoming easier and easier to bring in a lot of data from things like SaaS tools, so your CRM or your internal databases. The tooling has just accelerated massively over the last, like, 5 to 10 years. And BI Lesso, okay, so it's the first time I mentioned BI. So, you know, if you're familiar with the term business intelligence, Lightash is a business intelligence tool, and that just means you have a kind of visual layer to all of your internal company data. And the goal of that is to enable anybody within the company to explore and kind of answer their own questions.
[00:04:23] Unknown:
As far as the focus of Lightdash, you know, business intelligence is something that has been around for decades now, you know, probably even predating computers. But the way that we understand it is, you know, from at least the mid nineties, if not sort of early 2000. And so I'm wondering what are the primary goals of Lightdash given the fact that business intelligence is such a legacy market? Yeah. Absolutely. And
[00:04:49] Unknown:
sometimes even the phrase BI, I'm hesitant to say because the legacy I'm actually, I guess, a relative newcomer to the space, you know, only having done data for, you know, over 10 years. But again, more in a research context and analytics, maybe only 3 years. So definitely not long in the tooth when it comes to BI. And I think, earlier on in my career, I kind of did a lot of data consulting with my co founder Hamza and with Katie as well. And we kind of saw a lot of these BI tools in the wild. So all of the names, you know, Looker, Power BI, Google Data Studio, Redash Metabase. All of these, we kinda got our hands dirty and got them deployed in companies. And 1 thing that we saw work really well was this semantic layer. So the idea that I guess there's a couple of ways that you can do BI.
You can give everybody access to all of the raw data and say, take it from here. There's like the internal company data. And, actually, if everyone in the organization is comfortable with SQL, which sometimes in, like, startups where there's a, like, a strong technical founding team, This is often the case. It can work really, really well. There's, like, everyone knows SQL. Everyone has, like, the right context on the data, and everyone can kind of just get the answers they need. And another alternative is you have data analysts who work full time, and they answer questions on behalf of everybody else. I guess the problem that people have is it just tends to, like, not scale so well, and sometimes isn't a really enjoyable experience for the analyst. But the idea is that you have so the idea here is that you have, you know, completed reports, maybe in the form of dashboards or just like, you know, a quick message over Slack. Hey. Can you tell me, like, what the number of active users were over the last 7 days? And that kind of nobody has the ability to, like, tweak the questions and everything. It kind of ends up making a data team a bottleneck.
And 1 thing we kind of found deploying BI tools like Looker was that this semantic layer, which is where you give just enough self serve to the user, we could talk a bit more about that, works really really well. So you kind of define some metrics in advance. For example, total number of active users, and total revenue, and things like that. And you you centralize them all into 1 place, and you allow people to kind of pick and choose metrics and break them down, in, I guess, like, in a way that works for them.
[00:06:55] Unknown:
Given the focus on leveraging the sort of newest evolution of the data stack, you know, what what a lot of people are terming the modern data stack, what are the sort of areas of focus that you're targeting, and who are the sort of target users and the target scale of organization that you're focusing on? And how has that particular area of focus informed the prioritization of features and the overall user experience that you're building into Lightdash?
[00:07:23] Unknown:
Yes. So as the modern data stack has evolved and the raw and there's been sort of a rise of SQL based analytics, we've also seen, you know, new roles appear, new jobs such as a analytics engineer. And and the job of the data analyst has changed. There's definitely been a shift away from this big data engineering overhead that there was managing these complex, you know, kind of streaming data pipelines and things like that. So just having this centralized data warehouse that has a SQL interface and I I think, like, the amount of engineering overhead has really been reduced and has freed up these new positions for analytics engineers to go in and clean up data. And, you know, a lot of these data teams now have a huge amount of context on what's inside that data warehouse and how best to to access it. And their job is to kind of, like, release that knowledge onto the rest of the organization and and enable the rest of their company to to query that data and to and start putting it to use. So they're really our target profile. These analytics engineers really felt like the BI tools that were out there were a struggle for kind of modern data analysts who were copying a lot of business logic that existed in DBT, which they loved. But it's actually probably 1 of the most loved tools I've ever seen. I don't know if you've been at DBT Slack. It's almost like a cult. I don't know. Can you say that?
But it's an awesome tool and an awesome community. That community has been really boosted by, I guess, how new the role is and has been, like, a real sense of camaraderie between data analysts who I think previously were quite kind of shut away in a broom closet in companies. And now they're, like, really step stepping into center stage, and they're they're becoming, like, huge productivity, workhouses. But we just kinda felt like the tools to support data analysts, like, at the BI layer were kinda failing. So they were copying a lot of stuff from DBT into their BI tools. A lot of the amazing experiences.
So a lot of the amazing experiences that data analysts have been getting in the transform layer and in modeling, like, version control and things like that was all kind of missing in the BI layer. So we wanted to bring some of that, I guess, engineering rigor and some of that analytics engineer approach to the BI tool. So they're really our target audience. The problem there is that the job of the BI tool is really to bring 2 very different user profiles together. It's basically an interface for the data analyst and the rest of the business. In a way, our end user is actually a secondary user. And, you know, Lightash really has to be successful to enable less technical folk to access the data in the data warehouse and get insights without too much assistance from the data analyst. But the starting point for us, because we felt like they were underserved, is we're trying to create a great experience for data analysts.
[00:09:56] Unknown:
And I feel like once we have that, then we're gonna double down on helping the rest of the organization to interact with that data. In terms of the overall market for business intelligence, you mentioned a few of the sort of what I would term maybe 2nd or 3rd generation business intelligence tools with Looker and Power BI and Google's Data Studio. And, you know, going further back, we've got the sort of JasperSoft Suite and, you know, SQL Server Integration Studios and the, you know, previous generation of Microsoft Business Intelligence and, you know, a whole suite of other sort of older legacy, you know, Pentaho. But then in the sort of more current set, there's also the evolving uses for things like superset as another tool that's gaining a lot of attention.
I've also seen tools coming out recently, and I apologize to whoever created it. I've forgotten the name, but there's 1 that I saw where it's essentially you write some markdown and some SQL, and then it will just create a static site that will use JavaScript to query your data warehouse to keep the charts up to date. And I'm wondering what you see as kind of the breakdown of sort of general use cases for business intelligence and some of the ways that these different of BI are sort of targeting different scales of organization or some of the different sort of philosophies that go into
[00:11:17] Unknown:
how the BI should be designed and the ways that it should be employed? Yeah. It seems like such a fertile ground at the moment in the BI space. And you're right. There's a whole bunch of new tools that are coming in. Actually, I was trying to remember the name of this markdown tool, but I was just chatting to the founder the other day and gave it took it for a spin. It's really, really cool. And it feels like something that data science kind of progressed that way in the last 5 years. So notebooks got popular, and actually these took off in academia, I guess, over, like, simultaneously with how they did in a lot of tech companies, which is rare. Right? So, like, usually, the university and colleges are like years behind. But notebooks really, really took off there. Some of my earliest programming experience was in was in Python notebooks. And I I was fortunate enough to do a stint with some of the IPython devs as well, which was cool. And they really changed the way that people thought about kind of like what you see is what you get in interactive programming.
And I think bringing back to SQL makes so much sense. It's gonna be a huge productivity boost. And so, yeah, I'm really excited about a lot of those more notebook y style tools on the 1 hand. Those also brought a lot of downsides. So we know from like, we've also been burned from notebooks, I think, in the data science and data engineering community. We know there's problems with versioning control and things like that. So they're definitely not a 1 size fits all. But there's a huge need for those tools. Yeah. I thought, you know, the question was a bit around like, how do we bisect like the modern BI market? And I think there's, you know, the whole load of new data visualization tools that enable new workflows. And I think it's gonna be a huge boost of productivity.
Right now, I'm just trying to choose which 1 to use. I'm trying them all because they're all a lot of fun. But, yeah, I think this is gonna be really cool. And, actually, some of the newer players in the BI space, thinking back over the last 5 years, so things like redash a Metabase, they became quite cool SQL developer tools as well as BI tools because they just simplify the interface down. You got like a SQL box and you got a visualization. And I feel like that was the first step towards a good kind of SQL developer environment. And I think these notebooks are gonna be another big step up. So I think that's exciting. And so data products, I see, is something where you have much more engineering firepower on kind of the front end. So you have someone who's able to maybe build a specific app. And again, this is something that happened in data science as people move from notebooks to okay, I have a specific machine learning model, and now I wanna create, like, a kind of an interactive experience so people can, like, slide a slider and and see how the predictions of that change. Streamlit comes to mind. I don't know if you've seen this also. It's geared towards data science people. So it doesn't require too much front end development. That enables you to build a pretty customized UI on top of some of your, like, data science models. And we we're seeing that comes to BI as well. Know, a lot of products looking to do embedded analytics. So how do you not only serve your own business with all of the data in your data warehouse, but how do you then pass that on to your customers? Do you ever get these weekly update emails? So it's like, here's your week with x products. And, you know, it tells you how many queries you ran or how much you spend or, you know, how many times you logged on. But I actually find those, like, quite fun. The ones from Sentry are really good. They really said a long time ago. So it's like, here's how many errors you had, how it compares to the last month. And also, CSS in email is absolutely horrible. So, like, kudos to whoever wrote that because they're they're really beautiful emails.
But all of that kind of stuff about, like, passing analytics back to the user. And we're seeing a lot of new tools that are enabling the modern data stack to power those kind of workflows. And those were also really, really underserved. And then coming back to Lightash, there's a third, I guess, more like traditional BI vector, which is that you have all your data in your data warehouse, and you wanna increase visibility across the whole org and kind of enable people within the organization to explore the data and visualize it. And I think that there's still really big market there for companies to not only create dashboards. I mean, dashboards get a lot of hate.
But ultimately, it's about offering people to kind of, like, probe all of the internal data without too much flexibility. And we can come back, I guess, a bit back to the the metrics layer and the semantic layer. But to, like, enable all of that internal data to be shared, like, throughout the business. And I think the quickest way to do that is often through some of these more traditional BI tools, which is like I know what SQL query I need to write. And then from there, I can create a visualization that helps me kind of tell a certain data story. And sometimes, having the flexibility to build an app from scratch is not completely necessary. Maybe even changing the colors on the chart isn't that necessary. I don't know. That's not something you can do in Lightdash now.
So, yeah, I think I would kinda split it that way. It's the notebook tools, the kind of data experiences, like really customized data experiences. And then these traditional BI products, which are about kind of getting from 0 to a data insight, like, very quickly and kind of on rails, if that makes sense.
[00:16:00] Unknown:
Bringing us into Lightdash itself, can you talk through some of the ways that it's architected and some of the sort of design goals that you have oriented around as you have iterated on this product?
[00:16:12] Unknown:
Yeah. So I guess the first big interesting part of architecture is that you need to be on DBT to use it. So right now, it's a really awesome experience if you already have a DBC product in your organization. So just again, if you're not familiar with it, a DBC product is basically, you know, a big repo, a big GitHub repo full of SQL files. And in those files, you basically are transforming, cleaning up all the data in your data warehouse. The logo of dbt is like a big crank handle, and I always think of it as like a crank. You turn this crank and you take all the dirty data from your data warehouse, transform and clean it up, and then you stick it back in the warehouse. You usually have a bunch of raw tables and then a bunch of, like, analysis ready tables.
And those are where you've kind of injected some business logic as well. So it usually takes something called, like, timestamp that comes from your CRM, and you'll call that, like, account created at or something like that and start introducing those kind of, like, that kind of business logic. And a lot of the transformations would also inject business logic. So 1 great thing about that is, like, you have this 1 repository where a lot of how you define data objects around the company have become centralized in 1 place. And 1 thing that kinda sucked about not having your BI tool connected to that repository of truth is you were having to, like, rewrite it again in another tool. So you would have dbt create all of these tables in your warehouse with lots of documentation and lots of lineage about how they were created. And it was like a great data experience. And then you connected to your BI tool, and the connection was just give me the credentials for the data warehouse, and all it really knew was the column names. And from there, there was nothing else. So you would end up redocumenting this stuff all over again in the BI tool. And, yeah, they're kinda like bringing all this context forward.
So what we did was there's a meta tag in DBT, which is so it's a a lot of it is also kind of parameterized as YAML. And what we did is we just piggybacked off that meta tag. And now, by the way, this mess tag is like a huge hot pot of, like, applications jumping in. It's it's really exciting. It kinda reminds me a little of Kubernetes in some way where you have these annotations. So again, like, Kubernetes is a way to provision compute resources to deploy applications. And that has also been a real kind of like hotbed for the open source community because anyone can jump in and kind of integrate that application into that spec. Yeah. What we were able to do with dbt is to, like, use that metadata to put in all the configuration we needed for light dash. And also, a lot of that config having been something that's duplicated in dbt, we could just read that straight from dbt. So, like, 1 very simple example is column descriptions.
So for most tools at the moment, all of your descriptions for every column means in DBT, you're very well disciplined. And then in your BI tool, you are having to redocument them. Whereas with Ladash, we basically pull all of the columns under the documentation straight from dbt and surface down the BI tool right away. And they stay in sync too, which is cool. So if you delete something or if you rename something, it just gets updated straight away. So it's about treating that repo as a source of truth for even mortals in the business
[00:19:04] Unknown:
and not just as the transform layer. Because of the fact that you have layered light dash on top of DBT, in a lot of ways, you've kind of hitched your fortunes to that of DBT and its community. And I'm wondering what are some of the sort of benefits that you're seeing from that, and what are some of the risks that you anticipate as a result of that choice?
[00:19:23] Unknown:
So the upside is much bigger than the downside. Dbt is a great tool that's I think it's already really proven itself to data analysts. And like I said, the community is really really strong. Before we launched Lightash, you know, we were already a part of the Dbt community. So it's quite nice. It's kind of like you're at home and you're among friends and colleagues. It's a really fun place to kind of, like, build a new product. And we've segmented the BI market, you know, and our target audience is really DBT users that wanna visualize the tables and data they're creating with DBT.
And that gives us, like, a very clear mission, I suppose. And so when we design features and when we make product trade offs, we're able to optimize, like, very easily for that market. Some of the downsides, of course, is that we're tied to another technology. Because it's open source, we actually able to contribute to it which is fantastic. And that's a big part of the reason that we Lightash open source. But the downside is that you do lose there's a big part of your stack that you don't have full control over. An example of that is we rely on a kind of server that's bundled with dbt. That code recently got, like, completely ripped out of the codebase and moved to kind of like an archive state, which is fine. We're adapting. That's like supported until, like, the end of next year. We're gonna be, like, migrating to new infrastructure in the next few days. But, you know, sometimes we had a bit of a heads up, but like it's a bit of an unexpected change. And I guess there could be more problems like that. I'm not sure. But it's working really well. And with everything being based on a kind of YAML based syntax, you can have a lot of shared tooling around that. It's very easy for everyone to, like, read those YAML files and to interpret the configuration in their own way. And I think, like, we've got a pretty lightweight hand off between us and dbt. The integration goes deep but, essentially, the the layer that sticks us together is pretty flexible. It's mostly configuration.
So I don't foresee too many problems there. So, yeah, definitely pros and cons.
[00:21:12] Unknown:
As far as the engineering that's going into Lightdash, what have been some of the most challenging problems that you've had to tackle? And what have been some of the most sort of interesting or sort of thorny, just logical complexities that you've had to deal with as far as being able to build out this product and be able to meet the needs of these 2 different user bases that you mentioned earlier of the data analyst and the business user?
[00:21:38] Unknown:
So some concept we're full stack JavaScript, TypeScript, and that enables us to move really fast. So we kind of, you know, we had a React front end. And so, of course, like, you have to pick up JavaScript for the front end. And because we were a really small team, while we're still a really small team, it was just easier us to stick to, like, the same language for the front end and back end. We have a lot of shared definitions between the front end and the back end. Kind of simplified all the tooling. So we picked up Node for the server. In the team, we had like kind of mixed experience there. But an interesting challenge with that is in Node, in the Node. Js community, there isn't a great SQL abstraction. The best example I can think of is Java.
So there's this shared Java driver, which you could basically as long as you configure it, you can connect to any database with the exact same interface. The JDBC driver or jbdc. I always mix up the 2 middle letters. And in Python, there's SQLAlchemy, which often gets used as an ORM. But underneath that is powered by a kind of a SQL abstraction, which means if you run all your SQL queries with SQLAlchemy and Python, as long as somebody adds, like, a new connector to SQLAlchemy for that database, then you're kind of automatically compatible with it, which is very useful, especially as we're seeing, like, a lot of innovation in the database technology space, which is cool because that hasn't happened for a long time. And so, yeah, if being tied to something like SQLAlchemy, which you asked about Superset earlier, I think they are and I heard them talking about this on a previous podcast.
That's a huge benefit they get. In Node, we don't have anything like that. Like a NPM is full of a lot of like wacky dependencies, like the long tail. Like if you want to build something in Node, like someone's definitely already done it and published it. But there's not much standardization of Sol. And I think that is a challenge for us. And it means that we're kind of having to build it a bit. We're not dedicating time to it at the moment. But we have a few other open source projects we're talking to, where we're trying to say, like, can we help each other out a bit and build this SQL querying layer? And just standardize a little bit the configuration and and the queries. So that is a big challenge, which is yet unsolved. We actually started by piggybacking off DBT to run the queries. They already have some of this logic inside DBT. But it's faster for us to basically run those ourselves. And with the change with the DBT architecture I mentioned earlier, it's kind of a necessity for us. So, yeah, I would say that that's an unexpected challenge of Node. So, yeah, if anyone out there is developing a SQL based abstraction
[00:24:00] Unknown:
for node, I'm very interested. You don't wanna just figure out how to, you know, make some bindings to the ODBC drivers because that's easy and fun. Right? Yeah. Exactly.
[00:24:12] Unknown:
So, like, I think connecting to those drivers is a bit of a pain. But, funnily enough, so there's a tool out there called SQLpad. It's a BI tool but quite lightweight. It's not pitched as a BI tool but more as a developer tool for a SQL analyst. And I was chatting to Rick that runs it. So he does this completely in his spare time. It's amazing. And he had a lot of it. So it's also full stack JS. And he has exactly what you described, the bindings to the ODBC drivers. And we had like a long conversation about this. And he was like, do not do it. Like whatever you do, don't do this, which saved me actually because we were just putting on the roadmap to attempt to do this. So, yeah, in chatting with Rick, we're trying to get to something a little simpler. But for now, we're basically calling all the native libraries. That's definitely a big challenge. I would say another challenge we've had on the engineering side. So dbt is written in Python, and we're written in JavaScript.
And instead of trying to, like, marry those on any technical level, we just run it as a subprocess. And I would say that is absolutely horrible to do that. That's that can go down a lot. The tooling we get a node to run, subprocessors is pretty nice actually. But we there's some pretty flaky code in there that we're just like ripping out this week actually. There's like checking the logs coming from DBT. And if there's if a certain message string appears in the logs, then we take an action based on that. So that that kind of never felt great. We pulled that out now. It's great for an MVP. So, yeah, there's definitely a challenge there about crossing boundaries. And I think that's 1 frustrating thing. Actually, maybe if I can make, like, a bigger point around that is, like, SQL is mostly very portable.
And 1 thing that has been a frustration, I think, in the data space is that as people build better and better tools, you you end up reaching for, you know, your general purpose programming language to implement it. And it means that it becomes harder to integrate stuff. So, for example, in the singer ecosystem, if the listeners the new singer, I I maybe I'm not giving credit. It's the data engineering podcast. It's a spec for defining, you know, how you almost scrape a lot of data from an API and then load that into into some kind of database. So maybe you wanna so we use it, for example, for pulling all of our GitHub data. So who's starring the we repo and who's opening issues, and we put that all into a database. And a lot of implementations of that are in Python. And it makes it hard to build tooling around it. They're like is, I guess, polyglot. So something like, can we use a Python loader and a Java loader and, you know, the ELT space as well is kind of blowing up a bit, I suppose. And you see so Airbyte's approach with this is to kind of containerize everything, which has a lot of pros and cons. Meltano, on the other hand, is all Python based. And so they are kind of making a lot of design decisions now about, you know, what can I do if I've written my loader in Java or in Go? So, yeah. So I guess this is a problem in general. And although I said SQL is portable, of course, like the biggest engineering headache is that it pretends to be portable.
[00:27:01] Unknown:
But every database has something slightly different, usually around dates. That's how, like, datetimes are probably the major headache. And even just between Postgres and MySQL, there are differences in the rules around how quotes are used and which style of quotes and
[00:27:15] Unknown:
Yeah. Definitely. This is the most common early issues that got raised on the repository with someone trying a new database. They're like, I get this error back. And we're just like, great. It's not backticks. It's double quotes. Some you can do some really wacky stuff. But there's a convergence between the big tools, which is kinda nice. But, yeah, we have to manage all of those small discrepancies.
[00:27:32] Unknown:
Another interesting element of this problem space and something that we've touched on already a few times is the idea of the semantic layer or what's being termed now the metrics layer. And there have been projects that are coming out such as transform and metrocool that are focusing on providing that sort of metrics business sort of semantics layer as a dedicated project and something that is a discrete element within your overall platform. And I'm wondering what your thoughts are on the approach that Lightdash is taking where it's relying on DBT to provide some of that metrics metadata and doing those transforms and propagating that into Lightdash versus having it as a completely distinct sort of service within the overall platform and then consuming that through the business intelligence layer and just some of the trade offs that brings about and where you're focusing on the semantic layer and how it manifests in Lightdash?
[00:28:31] Unknown:
I love this question because someone asks every day about about Lightdash and the metrics layer. Some interesting historical context, maybe even history isn't the right word. Like when we launched Lightdash, I guess, about 6 months ago, the phrase like metrics layer wasn't out in the wild yet. Like the initial vision for Lightdash wasn't necessarily around building a metrics layer. We had to do that out of necessity. But maybe it's it's helpful to, I guess because I also mentioned metrics layer a few times and semantic layer. Just talk a bit about, like, what I mean by that. So I mentioned before, you can kind of give people access to the raw tables and they can write SQL themselves. What the semantic layer does, it kind of enables you to an analyst or somebody with technical skills that understands the data to write, I guess, small Lego blocks of like SQL code. So it's like here's how we define revenue over time. Like this is a revenue over time calculation. And here's how we define an active user and here's how you do active users per week. And there's always something weird in these things like computing revenue, like remove a certain week where something went wrong, you know, something like that. And that all gets encoded partially in the transform layer in dbt, but a lot of stuff you only kind of know at runtime.
And that's usually because you cut the data for a certain way. So you could pre compute revenue per day. That would be quite normal, have a revenue daily table. But then someone's gonna say, like, hey, but I want, like, revenue daily split by x or by y, like, by geography or something like that. And those interactive calculations, you can't kind of like preempt them without computing infinite configurations of like a dashboard or chart someone might wanna create. So what metrics do, they kind of enable you to build the small components that build up the biggest SQL query. It's basically a SQL template. It's a big old SQL template, which is really reductive. So apologies to anyone.
But, yeah. They're just like rewinding a little bit. So, yeah. The metrics there is basically a big SQL template. That's what I mean when I talk about it, which is how do you translate this idea of a metric and I guess like how you group it down to, like, compile it down to SQL. And I think the history again, like I'm not old school BI, but it actually seemed My first experience for that was with Looker, for me. So in Looker, you have measures and dimensions And measures are computations, total revenue over time, and then dimensions like how you split it. So show me revenue by geography or show me revenue by day. Yeah. So that's pretty much how I define the metrics there. And so when we built Lightash, like I mentioned, we kinda had to do this by necessity. We actually tried to leverage cube. Js. I don't know if you know this project. So cube. Js is a JavaScript project that allows you to build some of these data experience. It's now been some more like headless BI, which is the idea that it gives you everything you need to get the data to compute charts that you would use in your business intelligence tool. Like, total active users over time and things like that. But doesn't give you the UI or the front end for it, which is really, really cool because you get all of the infrastructure out of the box, which includes things like caching the queries and things like that, which is a, you know, really really complex problem. And then, like, build the UI that you want on top. So as long as you have some front end developers would like to work on this and you want something really specific, that's awesome. And we tried to leverage cube. Js early on to power Lydash. And the problem that we had there was that it kind of uses a kind of sort of JavaScript files to define the metrics and dimensions to, like, define the semantic layer and to define metrics.
And that was really hard for us to integrate with because it was hard to build tooling on top of it. And we went back and forth a little bit with the founder and it was like, you know, it's a decided decision that we've made and and it's hard to go back on, which is a shame. It's a really cool project, but it was hard for us to, like, extend. And then 1 day we were like, hey. We're just gonna have to build this, like, internal metrics layer. So, yeah, we have 1. But we didn't know it was called the metrics layer when we built it. And we just thought it was like a SQL template. And I think having it in a separate tool makes a lot of sense. And with Lightash, we're already ready for the metrics layer. I think the future, when metrics layer comes in, is uncertain. I'm not sure exactly how it's gonna fit in into the space and a lot of the tooling is really is still quite immature. MetricQL has done an amazing job. So definitely, like, check out their docs. And they launched Earnest like a couple of months ago. They've been developing it for a while. And I think I would describe Lightdash, I guess, as we have a metrics layer, but we're also kind of like ready for the external metrics layers. I think the metrics layer ultimately I don't know how similar the layers are gonna look. And truth be told, couldn't say, does everybody need the same metrics layer or do they need something slightly different? You know, when you pick up something like Looker or like Lightdash, they're they have an opinionated way of how a metric is computed and how filters are applied and how they render as a SQL query and BigQuery versus Redshift. And people might wanna have control over that. And so, I don't know if there'll ever be 1 metrics layer to rule them all. But I think, like, there'll be a bit of an open standard that kind of evolves around the metrics layer. I think most use cases are served by 1 metrics layer and 1 kind of SQL template.
So I'm interested to see how this space evolves. But for Lightdash, we have a metrics layer. I guess we don't see it as a defensible part of the product. Like, it feels like something that should be owned by the community. Is that too inflammatory to say? I don't know. I think there'll be an open standard. We've been talking to people about how we can align all of our metrics there's a little bit to help people jump from tool to tool and how we can all read from the same configuration. I mentioned something like Kubernetes before. It'd be cool to see something like that for metrics. And so, yeah, we're definitely ready for it. I think, like, what we're really passionate about is the kinda end to end experience for the analyst. And for the technical teams, it's an interesting discussion, you know, should the metrics layer be spun out as, like, a separate part of the stack? Should we unbundle it? Is it better to be bundled? But for, like, the end data consumers, it doesn't make a huge difference for them. And so sometimes I feel like there's been a lot of talk about the metrics there recently.
I definitely feel like sometimes we lost sight a little bit of the end use cases for some of the data, which is like, does this help the end user out? And I think if unbundling the metrics layer means we lose the ability to, like build user experiences on top of it, which is what the BI tools with metrics layers have done. So, for example, like Looker, the whole user experience is built around their definition of a metric, and filters, and things like that. If you kind of unbundle that, and if there's kind of like a range of very different metrics layers, I wonder how difficult difficult it will be to, like, build experiences on top of those that are really smooth. And so, yeah, that's 1 thing we're trying to figure out. It's like, do we need to hold on to our metrics layer, or are we all gonna be able to kind of merge a little bit and create a more holistic experience? I think 1 thing that's really exciting is the idea that just like how DBT standardized transformations and kind of centralized all that knowledge in 1 place, The metrics there could help you centralize a lot of the business logic on these, like, on the fly calculations.
So, you know, where you wanna compute the revenue or active users over time. And that's definitely pretty exciting.
[00:35:04] Unknown:
Another interesting element of the sort of broader data ecosystem, but also how into business intelligence is the current sort of strong focus on metadata and lineage tracking and data quality reporting and being able to surface that information at the point of access and being able to see as I'm querying this set of date tables or as I'm querying, you know, these metrics, when was it last computed, how up to date is this data, what are some of the quality errors that have happened that might inform how much sort of skepticism I might have about the answers that I'm getting? And I'm curious how you're factoring that into Lightdash or what your current focus is for being able to take advantage of some of that action and activity that's happening in the ecosystem.
[00:35:54] Unknown:
I think the lineage is really exciting because it was something that it felt like nobody cared too much about outside of enterprise. As the tooling improved, like, getting lineage became, like, easy. Like, we got it for free. I'll bring it back to the transform layer once somebody decided, like, hey, if we kind of add references between our tables in the transform layer, so, you know, I select from this table and this table, like, what if we just stored that reference and we use it later to build like a big graph of the dependencies. And then we started getting lineage for free, and then we realized like, wow, this is actually really, really useful for debugging and for documentation, especially for the technical teams.
I guess, like, how all the transform layer fits together is, like, not necessarily as useful for the end consumers. But it's been like a huge benefit to technical teams. But then a lot of the tools are developing the lineage, and it was like dropping off between the individual tools which just felt like a and a real pain. It's like, we're so close. We're almost there. And so, yeah, that was 1 thing I mentioned earlier. It was like almost 1 of the, like, earliest ideas that brought us to building the iDash. It was like, oh, what if we could bring a lot of this rich metadata and lineage through the pipeline?
And, you know, there's also the Open Lineage project which I'll plug which is a proposal for a standard for a lot of these tools to have a shared API around lineage which would also do the same thing, which is let's get all of these tools talking the same language in terms of where resources come from and and what depends on what. And the reason why lineage and data testing and freshness all matters is because if you've worked in data for more than like 6 months, then this will definitely ring true. Like trust around data is extremely fragile in an organization, especially if there's not a rich history and already a lot of buy in. And when there are errors or when numbers don't match up or if someone's intuition is really different to how the data looks and and that ends up being a bug, it's really, really, really hard to recover.
And I think where things like tests and freshness and lineage comes into play is that you can really prove that the results are correct in the face of someone saying, you know, the data is telling me like 1 answer, but my gut, my intuition tells me another. And often those are like the more interesting places to be. And I would say it's quite common now that you'd go debug and say something must be wrong in the pipeline. That's the happy path for the engineers, by the way. The unhappy 1 is there is a bug and someone says this looks great and we take action on it and then there's a huge problem. And usually people have eyes on your dashboards and charts long after the analysts or the engineers have set up those experiences. And so you kind of lose, like, monitoring over the quality of the data there. It's 1 thing we talked about before is, like, the data can all be correct as well, but, users can still shoot themselves in the foot and kinda get the wrong answer. There are still errors even if all the data is correct. And I think propagating all of this context right the way to the end user instills confidence and also reduces the chance of errors. I think a lot of that metadata is gonna really improve trust in organizations.
And that's 1 thing we're really excited about with Lightdash. Before we built Lightdash, actually, we explored some of the tooling around data quality and things like that. And now we've seen a huge amount of companies last year that are monitoring data quality. And I think that this whole wave, I'm hoping is gonna improve the adoption of data across the board.
[00:39:11] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.
DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. Digging into the use of Lightdash, can you talk through the overall workflow of getting set up and hooking it into your DBT models and maybe some of the data modeling that's useful to perform in order to be able to take advantage of what Lightdash provides and then just the workflow of setting up the system so that end users can be able to ask and answer their questions.
[00:40:29] Unknown:
If you're a DBT user, you you already have a DBT project with some SQL files and some YAML files that describe the state of your clean, ready to analyze data. With lightash, we just inject a few more tags into those files. So for example, this column has my total sales in. But if you sum that up, that'll give you total revenue for something. And it's a very lightweight amount of syntax. And then what you do is you just load up Lightdash, just clone it off the repo on GitHub. You can run it locally, and you just point it to your dbt project. And out of the box, you'll see all of your models and you'll see all of the metrics and dimensions that you can query and you can start clicking through those and spinning up plots.
Because everything's as code, there is an even simpler way to get started. So 1 thing with transformations being written as code and with a lot of the data sources being standardized, you know, so everybody wants to pull Salesforce data. Not all HubSpot comes customers wanna pull HubSpot data. We have quite like standardized schemers in database and data warehouses that reflect the state of data in those tools. And a lot of the transformation tools now also have standardized packages to do that. So dbt has dbt packages. So if I pull data out of, let's say, Salesforce, all of my sales data and CRM data, there's a package in DBT called Salesforce.
And you basically install that into a DBT project, and it already has an opinionated cleaning and transformations to take that raw data from the Salesforce API and turn it into something really useful and queryable. And because Lightash is also as code and also in DBT projects, what we've started doing is marking up those packages with metrics and dimensions as well. And so what that means is I'll use GitHub as an example because we were working on it recently. You can get a GitHub repo. You can initialize an empty DBC project and load lightash. And And you can basically just install this GitHub pipeline and you can download all your GitHub data. You can transform it into a clean format. And then when you load Live Dash, you'll already see a lot of recommended metrics out of the box. You can see, here's total stars over time or here's, number of issues closed in the last 7 days. Here's the ratio of feature request to bugs that's appearing on the repo over the last month. So that that you kind of get all of that out of the box. I guess That's like 1 thing that you get from having things as code and also having open standards is that we've all got to collaborate on these, like, end to end pipelines.
So that's 1 very cool way to get started with lightash. It increase it works for pretty niche use cases right now. Increasingly, we're adding this syntax to more dbt packages. So definitely look out for those. But that's basically the getting started point. You add a few more tags to your DBT project. And from there, you get to visualization much faster. The way this happened traditionally was you would have your dbt project with a lot of rich metadata about where the columns and how the tables relate to each other and their types. And then in your BI tool, the starting point would be like, tell me where your warehouse is. Okay. Here's my warehouse. And then, like, hopefully, it shows all the tables. And you're like, okay. Which 1 of these are, like, the raw ones that I need to ignore? Which ones are, you know, dbt cleaned up ones that are ready to analyze and okay, click on that. Yeah. Then it's an empty project from there. So it's kind of trying to just bootstrap a lot of the stuff needed for for visualization and kind of like take a big step from going from clean tables to visualization and make that really fast. Because of the fact that the business intelligence system is this interface between the business team and the data team,
[00:43:47] Unknown:
what are the opportunities and interfaces for collaboration that Lightdash provides for being able to sort of complete the cycle of presenting the data to the business users, then being able to ask and answer questions or and then provide feedback on what is useful, what do they need more of, what are some of the additional sources of data or metrics or sort of semantics about the data that they need to have exposed to them and just how that cycle continues?
[00:44:17] Unknown:
As we've designed Lightdash, we've tried to inject points of collaboration or, you know, things that we really wish we had in the other BI tools that we've used. And that's kind of meant making early design decisions in preparation for those. So a lot of the collaboration features we're excited about, we haven't managed to release yet but we've kind of like laid a lot of the groundwork. So it's things like having versioning of a lot of the resources in Lightdash even if they don't come from code. So you can define a lot of resources in Lightash's code, which is great, but, you know, often sometimes it's hard for less technical users to contribute that way. So, you know, that also enables people to come on and create their own chart and and save that straight into Lightdash without worrying about a kind of YAML or JSON representation of it. So we're also having things like versioning there enable people to leave comments like a Google Doc. That's something that we've wanted for a long time. We spoke a bit about flagging some of the more technical stuff to the end user.
But a lot of those like collaboration features, we're kind of like building out at this point for the end user. 1 thing that you mentioned before which is how do you track usage of the product? We also already have those kind of things in the database. So which metrics are the most used, which dimensions and which tables are the most kind of like looked at, all the dashboards that aren't getting looked at. And I think this is a long way down the line. But I think getting people again, we're getting people from a state of, you know, here's all the data and here's all the metrics and dimensions, which can be truly overwhelming unless it's really well curated as much as sometimes as being shown the raw data. If you wanna be able to make some recommendations out of the box, We have some work in progress where we want to update the Ladash homepage so that when you load a model, it just recommends a few questions you could ask by combining some metrics and dimensions. Like, hey, would you like to see total revenue split by week? Or would you like to see, like, the total active users split by the first touch marketing campaign they were hit with? So I think having more automation there can really help the end consumers find what they need. And I think already by surfacing a lot of the documentation that the engineers are writing to the end users is really helpful. Analytics around that, is something that's quite exciting.
We wanna push more of that back into the user's warehouse. With the tool being open source, we have some tracking in there. So you can basically stick in your own tracking key. And you can kind of figure out how people are interacting with the app. And you can hose that back into your warehouse and then reanalyze it in lightash as well if you like. So, yeah, pretty hold on collaboration. I think 1 thing I mentioned at the start was we're building out now a lot of the experiences for the end consumers, interviewing a lot of people that are spinning up Lightdash in their companies. And we feel like, you know, we're like hitting a pain point for data analysts and they're trying to create a BI tool. They really love because they spend a lot of time in there curating, you know, this kind of data stage for the rest of the business.
And the next step for us is to go to the rest of the business and say what's working
[00:47:07] Unknown:
for you with Lightash and and what isn't working. And you mentioned a few times that Lightash is an open source project. I'm wondering if you can talk through the motivation and reasoning behind that decision and some of the ways that you're approaching the governance and long term viability of the project.
[00:47:24] Unknown:
Yes. I think 1 thing I mentioned before was with a lot of the new tools in the space being open source, it's just like a real driver for innovation in the space. It means that like a lot of the people building these tools get the insights into what's coming in the other tools. It gives you insights into what technologies they're using and what design decisions they're making. And I think it's helped us all kind of like you know, it's a very, like, decentralized system, like, ants all kind of like working together towards a common goal. Like, we're seeing more alignment between the tools. So landing on specific YAML syntax, having shared drivers to communicate with the databases.
And I think like different parts of the technical stack might continue to get unbundled and get spun out because all of the people building the components of the stack have a shared understanding of the other tools in the space. So I think we saw some of the success of open source tools and having deployed them for other companies. We saw the benefit of being able to, like, extend them, being able to, like, make suggestions and speak to the developers really easily. It, for me, feels like a modern way to build software. So we didn't think too much about whether we were gonna be open source or closed source. I mean, there's a billion counterexamples to show you can build closed software and be extremely profitable.
But for us, open source in the data space was really important to to like I mentioned, because of this fertile ground and because things are developing quickly, it kinda enables these tools to, I hope, drive innovation and and enable more plug and play. And that's what we wanna do. And and when we talk to developers of all these other tools often to try to figure out how can we kind of merge our road maps a bit and, you know, how are we gonna enable people to get value across the whole chain. And it it comes back to the fact that the raw data is never useful until it gets to the end before it gets to the end user. And that there's increasingly more steps between those 2 points. I mentioned the engineering experience and dev experience has got a lot better, which is true. But still you overall wanna optimize from getting from kind of like 0 to a 100 quickly. And I think a lot of these open source tools are reducing that. For example, with Meltano, which is an open source project for extract, load and transform, they are kind of stitching together a lot of tools like Airflow, DBT, Singertabs for putting data from APIs and orchestrate that altogether to give you overall like a very quick like 0 to something setup. And that kind of stuff's really cool. And I think that can only happen in open source properly. And I think I guess, the the important thing is more open standards than open source. So a shared standard and a shared interface that a lot of people agree on and agree kind of like makes all of those tools together like much greater than the sum of the individual components.
And, you know, if you kinda think about, like, what APIs did for, like, a lot of these SaaS tools, I think something similar is coming for data. So I say, like, that's 1 part. I will say more so than other dev tools, the data space, there is a lot of sensitive data and there's a lot of industries that need to self host. So that's a given that they need to be able to do that. Of course, you can have on prem deployments with closed source and proprietary software. But having an open source really gives those companies the ability to test stuff out. They get signed off much faster if you can say, okay, like we can run this in a sandbox internally, you know, make sure it doesn't touch any of the other systems. That's like a very, very easy thing to do with open source. So there is a small subset of customers that will always run these tools an open source way, not hosted. You know, like a lot of these open source tools, Lightdash 2, we have the Lightdash Cloud which is in beta. There's always gonna be a bunch of customers that aren't able to do that for regulatory reasons. And so open source just means you get to serve them and add some value there to a customer base that usually you might not be able to support just by open sourcing the code. So that's good. The other part you asked was around covenants, right, which is cool. So I would say with Lightash I guess for Lightash, we're we're a commercial entity and we want to build a sustainable business around Lightash. And that is tied extremely closely to the success of the open source products.
And so that's a 2 way street. Like we're very motivated to support Lightdash and to keep it going. And, you know, as we build a sustainable business around it, it means that we have people working on it full time which And by the way, to do that now, to work on open source full time, I feel very, very fortunate. And the rest of the team, it's really exciting. And something there, I mean, there are didn't even need to name, but there's there's, you know, huge figureheads in the software engineering community that have put aside huge amounts of time outside of their career to develop products that all of us use. And now with this wave of kind of open source and open core business, there's increasingly more people who are able to work on open source software and add value that people can reuse and extend and also kind of like get paid for the privilege.
And I think it's a really special time that we're able to do that. I think it's exciting. But, yeah, I would say in terms of governance, it's MIT licensed. So with the code, do with it as you will. And in terms of the long term, I guess, sustainability, it's out in the wild which in a way gives people a little bit of insurance against the product ever dying because you always have access to the code. It doesn't mean you'll have support forever. But, you know, Lightdash is tied to a commercial entity, which we'd like to be successful. And that is what we see as the future of kind of sustaining the project.
[00:52:36] Unknown:
As you have been working with Lightdash and working with the end users and your customers, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:52:46] Unknown:
I would actually I so I gave a shout out earlier to the Meltano team. In a nutshell, what they're trying to do is orchestrate a lot of the end to end extract load transform steps which are kind of split among a few tools and it's trying to orchestrate those together, which I thought was an interesting and curious problem. I'd like to try out Meltano and enjoyed it. And AJ and Meltano basically opened up, we'd like spoken a bit about Lightdash, but it opened up a merge request on on Malterno to basically bring the visualization layer into this kind of ELT orchestration. So it's like how can you extract the data from an API, load it into a database, transform it, and clean it up. And then also do the visualization. Is there a way that we can ship people that entire stack written as code in 1 go? And I would say for me, it was like a light bulb moment where already you could configure Lightdash's code, which is how how this integration came around. But I don't think for me, I'd had that moment where I thought, okay, actually, like, we could stitch this together with the rest of the the data stack and start shipping like end to end experiences to people, you know, going from that like nothing to something really quickly. I would say for me, that was really unexpected. I didn't think it would come so early.
And so, you know, chatting to Taylor and AJ and the rest of the the team at Miltano was really exciting to hear some of their vision and kind of like be part of that. So I would say that's probably the most surprising thing and exciting thing that somebody did with Lightdash that I really didn't expect.
[00:54:06] Unknown:
As far as your own experience of building the tool and working with the community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:54:14] Unknown:
I would probably say my biggest learning, which comes more from the origin story of Lightdash, kinda like before we we started building it, was that in data organizations, everything falls back to like the lowest technical denominator in terms of data. So if you say in order to access data, you have to know SQL, then like that works for everybody that knows SQL. And for that like 1 person that doesn't, if there's like 1 or 2 people that don't, you eventually have to like roll everything back and kind of like fall back to the lowest coding denominator. It was like in that dynamic that I first saw the semantic layer come out. So I was like, okay, well, what if somebody spent time kind of putting data consumers on rails a little bit, like just enough exploration? They'll pick and choose some metrics and cut them how they like. And I think the surprise for me, and I said my background was less in analytics but more in research. I was just really surprised to see the success of that in companies and see like, okay, like this is actually quite quite interesting.
Like, I don't know. I was surprised by the success of the abstraction. It wasn't immediately obvious to me. As someone who knows SQL, actually, it doesn't click straight away. They're like, okay. Like, I click. Sometimes knowing the technical implementation, you like end up asking too many questions of the interface. Now I would say for those technical users, it seems to click really quickly. And for me, it was a really good cure for what I'd seen a lot of which was like low code SQL building. My experience like trying to onboard people onto these things was you still need to know what a join is, you know. You still need to know what like a group buy is. And there's a huge number of people in your organization where it's like not their job to learn that. Like, they have a a lot of other problems.
And understanding like why left join is different from the inner join is is definitely not 1 of them. And so even if you can build a friendly UI so you don't have to write code, you ultimately, like, run into a lot of problems. So, yeah, I think the first time I saw data modeling and not even just a semantic layer, but this idea of a data analyst that cleans up the data and pre joins data to like maximize the chance that people with less technical skills in the organization can access what they need and self serve. I was really born to self serve was a myth before that. I I was like, nobody can have a self serve. It's not gonna happen. Centralize your data team and, like, they're just gonna have to ask for what they need. And then as this kind of analytics engineer role grew and I saw and I worked with more companies that had dedicated data analysts and also, you know, played the role of data analysts for a couple of years, that was when I saw like, oh, wow, like this really works. It's really cool. You know, to use Lightash, you really need to have a data person in the company. So our target users are people that, you know, are invested into data already and they have a full time data analyst.
Our feeling is that like most companies will have full time data analyst roles at this point. There's still a lot of off the shelf tooling which is cool. So you know, if you want product analytics out of the box, you have things like Amplitude or Posthog. If you need AB testing out of the box, you have like Optimizely. And I would probably just say, like, my experience is that, like, 95% of the time, you hit, like, your business curiosity in about 10 minutes, you know. It's like, okay, don't quote me on those stats. But usually, you do this thing, you're like, oh, okay. Like, yes, usually, if someone clicks this button, that should count as a click. But if it's like somebody who did something before like that doesn't count as a click. And, yeah, there's some weird business thing like that. Like, if you've wrangled data, you've seen them. I guess, the analyst role with all the tooling improving, would say, overall, like, the success of the analyst role was kind of exciting. And we see more companies having analysts who are basically like setting up these curated data experiences for the rest of the company. And so for people who are interested in being able to
[00:57:36] Unknown:
have this sort of self serve exploratory capability for their their business data? What are the cases where Lightdash is the wrong choice and they might be better served by some other category of business intelligence system or by just building a self serve versus sort of homegrown solution?
[00:57:54] Unknown:
So first of all, I mentioned this before. If you don't have a full time data analyst, LiveDash isn't a great choice because I guess the flip side to having metrics and dimensions dimensions is that you've introduced a new layer that needs to be maintained. So you got all of your tables with your raw data in. Now you need someone to maintain the clean version of those tables. And, you know, as new business requirements come in and new data comes in, they have to clean it up. And then when you bring in the metrics, requirements come in and new data comes in, they have to clean it up. And then when you bring in the metrics layer and the semantic layer, that's, like, a whole other section where somebody say, like, okay. The definition of revenue's changed. Do I need to recalculate this metric and, you know, change the code? So light definitely requires all of that to be in place. And so I would say if you don't have a full time data person, if you need something quick and out of the box for product analytics, then, you know, yeah, pick up something like RudderStack or Segment or GA. But, yeah, if you need something out of the box and you don't have someone full time to manage it, then I think Lightdash, isn't a great choice.
[00:58:44] Unknown:
As you continue to build out the project and iterate on it, what are some of the things you have planned for the near to medium term, either for the open source or the hosted solution?
[00:58:53] Unknown:
So right now, those 2 things are exactly the same. If you get onto Lightdash cloud, you just get 247 support from the team mostly. We don't sleep. You know, you don't have to manage the servers and deploying in and authentication and things like that. But, otherwise, feature wise, they're exactly the same. So if you use the open source product, you get Lightdash goodness, which is really cool. That won't always be the case, but the foreseeable is for like the medium term. That will definitely be true. So I guess, like, upcoming stuff is we wanna improve some of those collaboration features that we mentioned. So we're kind of like entering this small pivot now where we take the focus less off the analyst, which was kind of, like, the starting point for LiveDash and look more towards the end users and understand more of what they're missing. We already have a lot of good ideas around that, bringing in more of the dbt metadata front and center. So 1 thing we don't have now is the dashboards in Lightash kind of stand alone, but we technically can tie those all the way back to broad data tables. What we'd like to do is start surfacing if there are failing tests just to kind of flag that on the dashboard. There are also some things in Lightash which aren't yet configurable as code so not all resources can be represented as YAML. And over the next few weeks, we're just fixing that so your dbt and your LiveDash project can have more and more of those kind of like mission critical charts and dashboards.
Inversion control, so they're locked down. Some, like, user testing we've done is showing, like, a simple tick, you know, kind of similar to GitHub. You know, you open a PR and you're mostly quite for me, I experience a lot of stress until I see like the calming green tick on my pull request on GitHub. And so, the plan is to start introducing more of this verified content in Lightash. You know, just to help the end users to build some trust around the data or the resource they're looking at. So, yeah, I would say bringing in even more of that context is kind of the main feature. And then I guess in the long term, all of the enterprise features you'd expect, so this is part of the business model of Lightash, is that we take kind of a GitLab approach where some of the features that are only needed by much bigger teams is something that we can monetize. So we're not working on that in the medium term, but that's something that we'll see coming. So SSO, granular roles and permissions, and everything that you'd kind of expect from a enterprise ready product. Are there any other aspects of the work that you're doing at Lightdash
[01:01:01] Unknown:
or the overall space of business intelligence and the capabilities that the modern data stack brings to that space that we didn't discuss yet that you'd like to cover before we close out the show? I think 1 thing we touched on really quickly that I guess is kind of interesting is, you know, the number of like new data warehouses.
[01:01:16] Unknown:
This was like somewhere where I thought like, oh, we'd already made great progress. Like, we had these analytical kind of a column store stores where we could start like crunching huge amounts of numbers really, really quickly. BigQuery brought like a serverless element to that, too. And like for me, I kind of thought like we were stopping there. And now we're seeing tools like Firebolt come out that are making kind of analytical queries even faster. And what that actually is gonna enable a lot of the BI tools to use and a lot of the downstream tool to use is to kinda like forget about caching and performance which is really exciting. Like a big job of BI tools until now they were like having to cache queries or you had to pre aggregate stuff in advance because the databases were quite slow to return things and you have a kind of like constraint on resources or in cost. And I think what we're doing with Lightash is a lot of those performance problems. We're now pushing them to the data warehouse layer. So we're making things as interactive as possible, and we're trying to make it so that you just leverage a lot of the benefits that come from the data warehouse. Now this is actually coming to all of the tools. So it's actually quite funny. I think like there's a whole marketing thing from Fireball which is like make your Looker dashboards 10x faster, like a 100x faster. And that's because like the bottleneck wasn't in Looker's control per se. A lot of the slowness that people were feeling with that tool was coming from the data warehouse layer.
And, you know, caching is like could be 1 of the hardest problems in like software engineering. Like getting the caching right is a really tricky problem. And and like a lot of BI schools have tried to solve that to an extent, but nothing is as good as fixing it at source. And it kind of feels like we're starting a new wave where that might happen. So, yeah, I'd say with Lightdash, we're really doubling down on just interactive queries becoming cheaper and much much faster. And that just gives the end user ultimate flexibility, which is, I think, yeah, yeah, kind of exciting.
[01:02:58] Unknown:
Alright. Well, for anybody who wants to follow along with the work that you're doing or get in touch or get involved with the open source project, I'll have you add your preferred contact information to the show notes. And so with that, I'd like to ask the final question of, from your perspective, what do you see as being the biggest gap in the tooling or technology that's available for data management today? I I think we touched it a few times, which shows that
[01:03:21] Unknown:
it's something that I feel is a problem. I'd like to see more open standards. It would be great to see a lot of these tools converge on standards. So so the open lineage is a great example, and I think we could build that for more things like metrics, things like save charts and like test metadata and start seeing those passed around more of the tools in the space. Our integration with dbt is really deep to to pull that metadata through to Lightash. And we had to make that choice. It's like we're gonna build on top of dbt. I think there's a future where, like, tools will be able to leverage the metadata that they're holding without having to have such deep integrations.
I'm pretty stoked about that. And I mentioned it a few times. Okay. Kubernetes for data. It sounds like a bit of a like buzzword. But I think it's true. I think provisioning data type resources as code has huge benefits. And I think it's sufficiently different to, you know, provisioning compute resources like Kubernetes does. But, you know, has the same kind of approach. And I think that is quite exciting. But also, let's build a better UX around it. Like, managing a big repo of Kubernetes. YAML files is still not like a lovely experience. We'll get there.
[01:04:26] Unknown:
Yes. Managing a pile of YAML files for any tool is not a fun experience. Yeah. Absolutely. Which is ironic that it's become sort of the de facto programming language for most ecosystems.
[01:04:38] Unknown:
Yeah. Pretty much. It's like blurring the line. Whereas, like, moving from programming to code It's definitely
[01:04:43] Unknown:
a tricky 1. I digress. But I imagine more like auto generation being useful there. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Lightdash and your perspective on the business intelligence space and how it fits in the broader data ecosystem. So definitely appreciate the time and energy you're putting into Lightdash and look forward to seeing some of the ways that it grows and evolves in the future. So appreciate that. And thank you again for your time, and I hope you enjoy the rest of your day. Thanks, the boss. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Oliver Laslett and Lightdash
Oliver's Journey into Data Management
Focus and Target Users of Lightdash
Modern BI Market and Tooling
Architecture and Design Goals of Lightdash
Challenges in Engineering Lightdash
Metrics Layer and Semantic Layer in Lightdash
Metadata, Lineage, and Data Quality
Getting Started with Lightdash
Collaboration and Feedback in Lightdash
Open Source Decision and Governance
Lessons Learned and Challenges Faced
When Lightdash is the Wrong Choice
Future Plans for Lightdash
Final Thoughts on BI and Data Management