Summary
Data lineage is something that has grown from a convenient feature to a critical need as data systems have grown in scale, complexity, and centrality to business. Alvin is a platform that aims to provide a low effort solution for data lineage capabilities focused on simplifying the work of data engineers. In this episode co-founder Martin Sahlen explains the impact that easy access to lineage information can have on the work of data engineers and analysts, and how he and his team have designed their platform to offer that information to engineers and stakeholders in the places that they interact with data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Martin Sahlen about his work on data lineage at Alvin and how it factors into the day-to-day work of data engineers
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Alvin is and the story behind it?
- What is the core problem that you are trying to solve at Alvin?
- Data lineage has quickly become an overloaded term. What are the elements of lineage that you are focused on addressing?
- What are some of the other sources/pieces of information that you integrate into the lineage graph?
- How does data lineage show up in the work of data engineers?
- In what ways does your focus on data engineers inform the way that you model the lineage information?
- As with every data asset/product, the lineage graph is only as useful as the data that it stores. What are some of the ways that you focus on establishing and ensuring a complete view of lineage?
- How do you account for assets (e.g. tables, dashboards, exports, etc.) that are created outside of the "officially supported" methods? (e.g. someone manually runs a SQL create statement, etc.)
- Can you describe how you have implemented the Alvin platform?
- How have the design and goals shifted from when you first started exploring the problem?
- What are the types of data systems/assets that you are focused on supporting? (e.g. data warehouses vs. lakes, structured vs. unstructured, which BI tools, etc.)
- How does Alvin fit into the workflow of data engineers and their downstream customers/collaborators?
- What are some of the design choices (both visual and functional) that you focused on to avoid friction in the data engineer’s workflow?
- What are some of the open questions/areas for investigation/improvement in the space of data lineage?
- What are the factors that contribute to the difficulty of a truly holistic and complete view of lineage across an organization?
- What are the most interesting, innovative, or unexpected ways that you have seen Alvin used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Alvin?
- When is Alvin the wrong choice?
- What do you have planned for the future of Alvin?
Contact Info
- @martinsahlen on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Alvin
- Unacast
- sqlparse Python library
- Cython
- Antlr
- Kotlin programming language
- PostgreSQL
- OpenSearch
- ElasticSearch
- Redis
- Kubernetes
- Airflow
- BigQuery
- Spark
- Looker
- Mode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
You wake up to a Slack message from your CEO who's upset because the company's revenue dashboard is broken. You're told to fix it before this morning's board meeting, which is just minutes away. Enter Metaplane, the industry's only self serve data observability tool. In just a few clicks, you identify the issue's root cause, conduct an impact analysis, and save the day. Data leaders at Imperfect Foods, Drift, and Vendor love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free forever plan at data engineering podcast dotcom/metaplane, or try out their most advanced features with a 14 day free trial. And if you mentioned the podcast, you get a free in data we trust world tour t shirt. Your host is Tobias Macy. And today, I'm interviewing Martin Salin about his work on data lineage at Alvin and how it factors into the day to day work of data engineers. So, Martin, can you start by introducing yourself?
[00:01:47] Unknown:
Yeah. Hey. Thanks for having me. My name is Martin, cofounder and CTO in in Alvin. Yeah. Previously been involved in in startups all my career. Always been very technical, hands on on focus. So really excited to be here today and get into the weeds of data lineage and and how we solve it in ALDEN. And do you remember how you first got started working in data? I think the proper real exposure I had to data was in in my previous company called Unicast, still alive and kicking. There, we were very much dealing with large datasets around GPS and mobility data.
Kind of my main main role there was to evaluate and work with data ingestion. So evaluating data vendors, ingesting data, making sure it had the right quality. And this was quite important for us because we kind of pay the vendors based on certain metrics of the data. So, you know, I was sitting there trying to figure out ways how can we automate this data quality? How can I equip the business based in New York and Oslo, so a lot around adtech and media? So it was quite interesting. The company was was based in New York and Oslo, so a lot around ad tech and media. So, you know, big, big volumes, quite a lot of money involved too. Kind of got a sense that this tangible ROI. So the coolest things I did there was creating these Jupyter Notebooks, automating template in them, and I'm kind of generating these PDFs for the salespeople.
And I think it kind of got me into, like, there's so much slow hanging fruits here in this data space that that can be automated. I think everyone that's really involved know that, you know, you can't really automate everything. A lot goes into people, process, and culture, but there's so much that can be automated. I think we just kind of scratched the surface of it. So well well, kind of, you know, going back back into my my background. So I think that's kind of where I cut my teeth in data. We were very early actually in Google Cloud, even before Spotify had this article where they were moving all of their stuff to GCP. We worked on App Engine, Pub Sub, BigQuery quite in the early days and had a lot of fun doing that.
That's kind of the background I have in data, I would say, when the time is right to take a leap and and start your own company. And
[00:04:08] Unknown:
So in terms of the Alvin project, can you describe a bit about what it is that you're building there and why you decided that data lineage is the area you wanted to spend your time and focus on? I think when you start a company, you always
[00:04:22] Unknown:
have, like, a strong conviction. And in many cases, that's based on your experiences in in your own company and the pain points, right, that you have. I think immediately when we started, we had this, like, very big focus on the analyst experience and how can we improve the lives of the data analysts and, you know, data quality and their understanding of what they have to work with. But we quite quickly found that we probably were a bit too biased. And you talk to data people and you get this sense of respect that the domains are so different, the process is so different. So so you need to have a lot of humility going into these conversations and and kind of understanding, you know, what are the pains to solve here? What what are the jobs to be done? This is pretty relevant to Alvin's story because then we kind of took a step back and looked at that picture, I guess, of the state of data in general. And what we found was it's a lot more pain in data engineering, I think, than for the analyst.
Once we really kind of went down that rabbit hole, still just trying to work on hypothesis level, talking to people, building prototypes, and and kind of iterating forwards. I think it was sometime during 2020, mid 2020, when we started working on analyzing query logs for usage actually. This was related to cost optimization and understanding what are the drivers of costs in the data warehouse on the, you know, table level, query level, user level. And once we started really digging into SQL parsing, which, you know, I can go on for a long, long time to to talk about, We found that actually data lineage is a hugely interesting topic. I think from a technical perspective and a more conceptual, you know, also kind of hypothesized was that lineage is actually a dataset as any other dataset, and it can have, you know, quality and other aspects and metrics of a dataset. But it can also be the foundation to build very useful products that work on metadata, useful products that can solve, real pains for data engineers and data analysts.
I think that was when, you know, Alwin kind of started, I guess, in its, you know, current form. That's when me and my cofounder just said, okay, this is super hard. We need to just, you know, hunker down. I go into build mode and build. I'm MVP basically for the lineage and I guess the, you know, the SQL parser basically. Yeah. So we spent quite a lot of time in that phase, and now we're just in this stage where we're kind of coming a bit more out. And I think we feel that the product and the technology is is good enough.
[00:07:12] Unknown:
In terms of the kind of core utility and core value of data lineage, there are a lot of different ways that it's being used, represented, you know, generated. And I'm wondering if you can talk to some of the elements of data lineage that you're specifically focused on addressing at Alvin and
[00:07:35] Unknown:
the kind of core workflows and use cases that you're aiming for? You know, I think it always makes sense to start a little bit, you know, how we do it. I think that's interesting. So, you know, there are other companies in the space that, you know, are focused on on legacy tools and systems. And with that, the kind of the intrinsic focus becomes a bit more enterprise data migrations and and, you know, how to move from, you know, Oracle or something into in cloud. While we made a choice to, you know, we want the lineage to be really plug and play and automated. So, you know, just connect your service accounts or your credentials or API keys, you know, whatever applies, And we automatically generate this lineage graph and we extend it, you know, but if you add another system, we automatically kind of connect all the all the dots. How we do it, it varies from system to system. But if you look at the data warehouse, like I said, we don't wanna focus too much on legacy tools. We focus on the ones that have documented query logs where you can actually access all the statements that have executed.
So we use that to generate the lineage. And generating lineage is really about, you know, going through all the SQL line by line, parse it, understand, you know, at the table level, at the column level, what is being done, like what columns are feeding into what columns, what are the transformations. Essentially, you know, how do these change over time. So as we kind of, you know, going back, I said, it's it's a dataset. When we talk about lineage, we we have an angle that, you know, the granularity and accuracy is super important. If you really want to drive, you know, high impact and high value use cases, it's important that the data you base it on is very accurate. As in any other dataset, if the data is bad, the actions you can do upon it are going to be bad as well.
So the story of the parser, like I said, you know, that can go on for days days, but I think it's an interesting, you know, journey to, you know, dive a bit into. So when we started, it was very much like, okay, table level lineage is probably okay. So we started out with this SQL parse library in Python, which I'm sure, you know, most of the listeners have some kind of, you know, have dabbled with with some hackathons or or something like that. And we came pretty far with that, but this is, I think, is generally true for writing parsers in Python and maybe it's an incendiary statements, but the performance is simply not not good enough if you want to run it at scale. We we found that, you know, you get 1200, 1500 line SQL statement, which is not that uncommon if you have now these bigger companies and more complex data environments.
And then you'll have just 10, 15 seconds just tokenizing the string into these, you know, tokens. And then you have to start processing, which, you know, is horrible. So we did all of the optimizations we could. We had this, like, Python code that would take the Python parts that were performance critical, make them into C code and import them like lots of these things. And we got, you know, 200% increase in performance, but it still wasn't good enough. I think if your company is going to be based on Linea, just the core foundation of the product, you kind of need to think a bit differently than, you know, just having a hackathon project. So that's when we said we need to just build our own grammar and build our own kind of from scratch. So I think a tool that a lot of people probably have some familiarity with is Antler that we are using. So we actually use Antler on Kotlin because Kotlin is a great language. It's like, you know, what Java or what Scala should have been, I think many people say.
So there it's, you know, just defining the grammar. You have all of this generated Kotlin code or Java code. We also use, for certain parts of BigQuery, we use Zeta SQL as well. By using this and combining all of this into, like it's a unified parsing framework. So it's kind of like using certain parsers for certain things, and then it all goes into this same big structure that we can then process later. This is a pretty nice architecture because it's almost like you generate this abstract syntax tree for any type of query, and then we can process that later for lineage and usage and statistics on the queries.
[00:11:55] Unknown:
1 of the things that you brought up in terms of the question of performance, obviously, if you're providing this as a service, you wanna make sure that you're able to process updates in a scalable fashion. But it also brings up the question of kind of latency concerns where if I, as a data engineer, am building, you know, a new workflow where maybe I'm adding a new dbt model or I'm building a new, you know, airflow task graph, and I wanna make sure that my lineage is getting updated. I'm wondering if there are any kind of latency concerns around how fast people typically want to see those changes in lineage reflected after that, you know, task graph is executed the first time.
And some of the ways that latency issues can crop up if you have a lot of, particularly as the size of the team scales where multiple people are working in the data infrastructure, and they need to know, you know, in a timely fashion if there are changes to the dependency graph or the, you know, the downstream assets that are being generated from the piece that you're working on?
[00:12:57] Unknown:
Generally, we would rely on just hourly fetching of the query logs. We haven't, you know, received huge amount of kind of reports that that's not enough, I guess. So I think it also depends a little bit on the systems you are talking about, which I think is an important distinction. So for certain integrations, we define them as real time or batch based. And, you know, Snowflake and Redshift and BigQuery. Well, BigQuery, you actually have this Google Cloud audit logs where you can do PubSub and and send in real time. But for tools like Snowflake and Looker and Tableau and all of those, you kind of have to fetch the data, you know, at some certain cadence because they don't support these use cases. But for Airflow and Spark, you know, and Databricks that we support, the integration is more real time. So we basically have listeners and this instrumenting code that would send changes in basically real time. So you know, from our perspective, we obviously want things to be as real time as possible, but there are also intrinsic limitations in the systems that make this hard at times.
[00:14:03] Unknown:
In terms of the ways that data lineage is incorporated into the ways that data engineers work, I'm wondering what are some of the kind of stages of the data life cycle or stages of development where data engineers are typically looking or even, you know, it's kind of the maintenance aspect where data engineers and data platform engineers and other people who are interacting with those data assets are looking to the lineage graph to inform or direct the work that they're doing.
[00:14:32] Unknown:
This is like an interesting question where you also, you know, it can be a bit, you know, not controversial, I guess. But but our take on this is really that lineage, you know, as again, it's just a dataset and technology. And I think a lot of products or companies like you would default to just providing people with, you know, huge graphs like a scrolling diagram that is is very hard to process mentally even for people that know the data stack. So our thinking and our approach is really to, you know, try to be a bit more what are the processes, what are the kind of use cases where lineage can be useful and where it can provide more actionable insights than having to, you know, look through an impossibly large large graph. So something that we call testing or regression testing, other people might call it impact analysis. I think it's a use case that's really has been been resonating where we have have a UI where you can write, you know, any SQL state, which I think is quite cool because you could say that you, you know, you want to look at a table or a column or something and click on it and say, like, what if I did some change to this? What would happen?
And that's kind of, you know, good. But we think, you know, most people write SQL. So we have this more SQL based testing where you can just write any statement that, you know, if it's a create table as select, we would kind of, okay, what's the output schema of this select statement and compare it to the current state of that table. And then you would look at the lineage and the usage and basically, you know, say that, well, this is a potential destructive action that would affect these people that continually are, you know, doing ad hoc queries on these tables or connected to these dashboards. So it's really like thinking about, like, would the data engineer want to go into even though I think our UI is nice, you know, would they want to go there or would they just want to, you know, write some SQL and, you know, understand what this actually, you know, semantically will do.
[00:16:30] Unknown:
In terms of the ways that you envision the role of data lineage in the work of those data engineers, how does that factor into the ways that you consider how to store and represent and expose the lineage graph that you're constructing in the Elven system.
[00:16:49] Unknown:
Our main vision, I guess, you know, I kind of skipped that a little bit and maybe also the problem that we're solving. So maybe we can just, like, go a little bit back to that. I think, you know, 1 interesting thing that we have observed when we started around, you know, 2019, I guess. So, you know, in the data space that's, you know, age old, right? It's moving so fast and so many things happening. And I think what we see is that there's this huge proliferation of of dbt. Right? You know, it's been there for a long time. I think a lot of people, you know, it's been there for a long time, but still, it's really accelerated now. And I think with that and with these changes, everyone is now able to, you know, write their models. Everyone is able to, you know, be a data engineer. But I think a bit of the problem that we see is that there's so much things happening. Like, there's so many tables, like, what's useful, what's not useful.
It's a little bit like I'm definitely not saying dbt is the problem or is kind of part of the problem, but it's, like, 1 of the reasons that the problem exists because, you know, you can't replace good process and good culture with just tools. I think, you know, it's a lot easier to just create a lot of models and tables, but it's a lot harder to remove them or change them. Like, it's a little bit like, you know, the dark side of the moon that you have all of this data that is being created, but you have no insight into how it's being used and how it's being consumed. And I think that is really, you know, the core pain of what we're solving. And it sounds a bit, you know, maybe unclear, and I think it is because there's so many problems that data engineers are facing. And so it's hard to sometimes pinpoint exactly, you know, this 1 problem. And I think everyone can, like, you know, not 1 day is the same. Right? It's not like you're solving the same problem every day. Then it would be easy and and the problem would be solved. But I think it gets down to these classic problems of, you know, what can we change? You know, what changed? How can we kind of move forward without breaking everything?
I think even more recently, cost has become quite a bigger topic. You know, we talked to data teams a couple of years ago and, you know, if we ask them about cost, they would be like, no, no, like, it's like not even a topic. I think now if you look at certain people on LinkedIn, you know, that talk about Snowflake cost, Databricks cost. No, it's really kind of becoming a topic. And I think a good comparison to what we see. I don't if you look at software engineering, you know, 20 years ago there was FTP servers and people were like, Hey, I'm writing this. I'm working on this file now. Don't, you know, work on it. And there were all of these like very processes you look at now and be like, Wow, You know, this is crazy. This is like stone age.
Hope I don't offend anyone that still does FTP stuff. But, like, actually, a fun fact is that in ad tech and media, most of the data exchange is through FTP. But even though they have this real time bidding and these super fast exchanges, like, most of the data is, like, daily batches on, on FTP. So it's it's quite funny. But, but anyway, kind of, you know, going off that tangent that software engineering has, you know, now in a state where there's all these toolings, there's, you know, automated tests, there's CICD. There's, like, an amazing possibility to really understand how your code behaves and acts in the wild, which is, you know, hugely valuable because it allows you to move quickly and revert changes if, you know, latency is going up. Okay. We can, you know, revert the change and try to identify what what the issue was.
We're not saying that we are gonna bring all of that to to data engineering because that's, like, you know, a lot. But we definitely see our vision as being a company that can really help data engineers be more effective and be more efficient and have this tooling that just integrates smoothly with whatever they're doing as, you know, a lot of these tools like Datadog, for instance, I think is a very good example. CICD is also a good example of this.
[00:20:58] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder. In terms of the kind of modeling at the lineage graph layer, there's obviously the question of kind of what are the connections and what is the actual asset that you're referring to, but I guess what are the some of the general sources and representations of information that you are incorporating into that graph to be able to facilitate the use cases of, you know, debugging, you know, the type of information that you need to be able to manage in the graph to make it useful.
[00:22:01] Unknown:
Like I said, I'm kind of reiterating on this a lot, but, like, lineage is this dataset. I think an important thing to bring into this is the temporal aspect of lineage. I think, you know, you can't just whack everything, you know, up there and assume it's going to be be useful. So we really want to highlight the operational and temporal aspect of lineage because, you know, when something happens, something bad happens, for example, then you kind of want to understand more specifically what it was, when it happened, and kind of resolve in that way. So in our lineage graph, statement type, the time it ran, the user that executed it, these things on the job level is quite important. So identifying, you know, is this SQL statement something that has executed, you know, many times in the future? We have this thing we call query fingerprinting where, you know, you look at a statement to remove all the literals, you remove all the parameters, and you kind of find the structure, I guess, of the statement. And you look at this over time to kind of bring information like, okay, is this, you know, the lineage here or there was a failed job? Is this something that, you know, should have happened, you know, based on the previous history? Or is this just some ad hoc lineage like some analyst just created a table to do some analysis and, you know, it's not a problem.
No. We're definitely not there yet. I think we have a lot of really interesting ideas for how to, you know, take the lineage forward. We do also layer in this information, let's say, like from Airflow and DBT. So our integrations with Airflow and DBT, they go quite deep. So we fetched associations within DBT runs and Airflow DAG runs and task instances with the underlying query IDs from the data warehouses on the source system. So this means that on the lineage graph itself, you can look at, okay, how does this lineage look like purely from the query history perspective? Just, you know, if it's Snowflake, just the Snowflake tables. But then you can also overlay, well, what does this lineage look like from the DBT perspective?
And then also if you have airflow, then how would this lineage look from the airflow perspective? So it's kind of the idea that, like, you know, multiple people have multiple ideas of lineage, how it should look to them. And that's kind of what we want to cater to. And I can't remember exactly the name of it, but there was this really interesting article the other day about the layers of lineage, of data lineage, which sometimes you read articles and you're just like, oh, I wish I wrote that because it was very aligned with kind of our thinking really around lineage, that there's a business layer, there's a technical layer.
So obviously we, you know, now are most focused on the technical layer and providing something more than just a visual interface, but also more operational aspect. When it comes to the other systems, we really you know, we think the most value is in this cross system lineage. So you basically you connect, connect Tableau and Looker and your data warehouse. Then what we do is we kind of go into the APIs and the metadata in those systems and we look for the connections that are set up there and then we go into, okay, what are the connections we have already from the warehouse? And then we kind of match those up. So this enables us to do this cross system lineage, you know, quite, I guess, quite elegantly because then you connect, like, these systems, you know, within some minutes or some hours depending on the amount of metadata and queries. The the lineage is automatically generated.
[00:25:35] Unknown:
Given the fact that you are trying to track lineage across different systems and you have to kind of pull and generate and process that lineage differently depending on what type of technology you're working with. And they might all have different kind of syntaxes, particularly when you get into the BI layers. And then there's also the aspect of even if you start at the data warehouse, there are potentially other upstream sources of lineage that where processing is being done, or, you know, there might be manual steps that people are doing that don't actually take place in the context of an airflow or a Daxter or don't necessarily get driven by DBT in the warehouse. And so I'm curious how you think about ensuring the completeness of the lineage graph and being able to account for any out of band operations or even just communicating the importance of making sure that operations are, you know, executed in that context so that you do have the complete lineage graph?
[00:26:34] Unknown:
I think, you know, on the first 1, there's different syntaxes on different dialects even within SQL. You know, I think our advantage there is that, you know, it's not like we built a data catalog and then, okay, we need to whack in data lineage there. It's been like a year and a half, 2 years project of building a very strong data lineage foundation and then we built everything around that. So there's a dedicated team working on just the parser, there's a dedicated team working on integrations. So I think that's that's kind of our DNA. That's what we really, really focus on.
When it comes to yes. There's definitely cases where there might be processes or things that are putting data into warehouse that just can't be tracked for different reasons. In those cases, we do, of course, have, like, you know, the the option to do manual lineage so you can extend the lineage graph with with those things. Yeah, that's kind of works quite well. What what we have seen is that companies seem to be quite willing to invest there because at the end of the day, you know, nothing's perfect. But if you can get 90%, it's a lot better than, you know, having to do everything yourself.
[00:27:45] Unknown:
In terms of the implementation of the Elven platform, I'm wondering if you can talk through some of the design and technical considerations and kind of engineering questions that you had to iterate on as you went from, this is a project that I'm interested in exploring to where you are now, where you actually have a product that is user facing that people are starting to onboard onto?
[00:28:09] Unknown:
That's a long, long story. I think, you know, any company that goes through these phases of fundraising and growth will kind of have to deal with that. But I think, like I said, I'm kind of coming back to the parser. I think, you know, has been and it's like, you know, a continuous work in progress. I'm really adding, you know, to that restructuring that's working on the data model, making sure that we kind of just always, you know, stay on top of things, you know, reading, you know, the Google Cloud blogs, the Snowflake blogs, like, all of these things. Are we staying on top of the the latest feature? So that's like, you know, it's just been it doesn't change as much as just being like a, you know, continuous improvement.
I think on the platform side, we definitely, you know, I wholeheartedly believe that, you know, boring technology is, you know, great technology because you understand it. So, you know, while a lot of people might think that we use, you know, some graph database or whatnot, we actually rely mainly on just Postgres and OpenSearch, Elasticsearch OpenSearch. Like, you know, there was this schism with AWS. And, you know, people probably know that that story where, Elasticsearch got a bit upset with Amazon, you know, taking advantage of their technology and forked it. But it's basically, you know, having a search index on a primary database and Redis for some of the caching has been the been the core of the of the stack since the get go. You know, as we are moving along, seeing that, you know, we are getting to the limits of table sizes on Postgres, You start to add partitioning and these things that you only heard about and now we have to have to do them. So doing this like live migration has certainly been also interesting.
I think we've also really seen the value of, you know, instrumentation and observability in when you're doing this syncing, you're grabbing data from potentially huge amounts of data from the, you know, client's data infrastructure. In those cases, it's quite important to, you know, be mindful that you are, you know, calling their APIs, and you might, you know, cause those systems to be slower if you don't do things correctly. So that's also something we, you know, had, like, a very big focus on to just, you know, be really mindful of, you know, minimizing the amount of time spent on client connections, kind of getting the data and then processing it on on our side.
So I think it's, you know, the story is a little bit, I guess, boring in that sense that I think, you know, these things you have to scale and as you grow up engineering organization. Stack itself, you know, runs on Kubernetes. We use Airflow to handle all the syncing needs. So I think we have built a pretty, I would say, like, clever integration with Airflow and Dynamic and DAGs. So we're able to basically integrate Airflow quite nicely into our infrastructure using using Kubernetes Operator as well. So everything just runs in our cluster, and it also means that based on different needs, different client needs even, we can assign the syncing jobs to run on specific nodes that, you know, have might have more memory or CPU depending on the specific needs or SLAs.
[00:31:19] Unknown:
For data engineers who are onboarding on to Alwin, I know you mentioned focusing on data warehouses and some of the business intelligence layers, and I'm wondering how you think about the kind of initial product focus to be able to keep the problem scoped and addressable and how you're thinking about the potential for expanding into some of the other types of systems and architectures that data engineers are going to be working in. So in terms of data warehouses versus data lakes, structured versus unstructured, kind of which BI tools you want to integrate with, which kind of orchestration engines, and just some of the ways that your implementation and architecture is designed to be able to allow for this expansion of scope and capabilities.
[00:32:08] Unknown:
1 of the bigger projects that we have on the roadmap for q 4 is to do, like, a bit of a refactor on the data layer to support more of this layers of data lineage, as I said. I think 1 of the really big advantages that we have had from the really early days of starting is that we we just assign the data model to be super simple, like, you know, very almost stupidly simple that there's data entities and there's connections between entities. Like, almost like, you know, a school a textbook example of, you know, graph, like, nodes and edges. I think that has been been a good great advantage because it has really allowed us to when you approach an integration, you know, we almost had this, like, methodology of, okay, we look at the system, we try to understand the semantics of the system. What's the data model here? How can we map that into our system? So, like, so far we have integrated with Airflow, Databricks, BigQuery, Spark, Mode, Looker.
Now quite different systems, I would say, with different, like, metadata models and different concepts within those. But so far we've been able to model those. Also quite I think an important aspect also, you need to model the data so it's recognizable to the consumer. Right? So, like, you need to, you know, if it's Looker, there's folders and dashboards, then probably you should display the same folders and dashboards or in in Tableau, it's work books and sheets. Yeah. So you need to give them that familiar experience, I guess. But so far that has been yeah. We haven't had huge challenges there. And and again, I think if you if you just try to stay simple and have this, like it's almost like if you have a canvas then constraints, you know, breed creativity.
And it's almost easier to know that this is our data model and then this is kind of what we need to fit it into. And so far that's been working working pretty well.
[00:34:05] Unknown:
As far as the way that Alwin integrates into the workflow of a data engineer, I'm wondering if it's something where the data engineer is going to the Alvin UI and exploring the data lineage and then jumping back into other tools, or how you're thinking about being able to integrate more closely into the places that the work is being done and being able to surface that information in a more seamless and frictionless way.
[00:34:31] Unknown:
That's also a topic that is, you know, very of, you know, great interest to me and and where I think maybe it's a little tangent to kind of go on, but there's too much, like, inward focus, I think, in in a lot of the, you know, companies that are trying to enter the space and are trying to be category defining in some way. You know, a lot of that is probably fueled by investors that, you know, to be category defining and then everyone's kind of fighting a little bit to do that. You know, we're probably guilty of that as well. I'm probably throwing stones in a glass house, but still I think it's important, like I said, you know, to stay humble and try to be focused on the use cases and you know, the processes of the data engineers. So I think, you know, our approach is really to latch on to whatever they're using and dbt is, you know, an obvious 1 there when it comes to just, you know, market reach and ability to provide value to many data engineers. So what we have there is we have this impact analysis or regression testing, as we also like to like to call it. But the thinking there is that a lot of companies already use GitHub or GitLab actions to to run data quality tests, basically data quality tests on the on changes that that they're making.
What we see there is that this is like a perfect area to also integrate our tooling because what we provide there, like you have data quality tests, which, you know, everyone would argue they're super useful. You can look at the column distribution changes and understand, you know, a bit more what, you know, suddenly you get like 0 values where you should expect to have, you know, an even distribution. But, like, our really, I guess, thinking is that the biggest problem, at least, we see with data engineers is is not, you know, what happens within the data warehouse. It's a little bit more, you know, who is gonna ping them on Slack or or how do you communicate these changes. If you look at what we do, we do cross system lineage and usage.
So that means that you kind of understand, you know, who is looking at this dashboard, who is consuming or who is reading from this table. Then when you kind of have a lineage graph that is so accurate that you can say what's write usage, what's read usage, you'll do pretty interesting stuff. So what we have on the DBT side is something that can comment back on the pull request and give you a report of, okay, these are all the downstream users of these tables that you're now changing. And you know, if it's a BI user or if it's just an analyst. And then you can kind of exclude all the pipeline queries, All the queries that are just creating tables based on this asset, but just actual people in the company that are consuming this.
I think this, you know, can of course also be, you know, viewed in Alvin's UI and all of that, but, you know, our real product vision is to, you know, just be useful in that context. And also we see that's something you you have to do as part of your engineering workflows to run this test. Then you could also, you know, say that, well, we're going to do these changes and we're going to break these things. But then you also have a list of people that you can warn of these things. And I think that's almost equally important because you talk to especially heads of data and data team leads. Like a lot of them are really, you know, passionate and motivated and have a lot of, you know, pride in the sense that we want to drive change and we want to make sure that the company can be data driven.
It's not great for them if someone asks them, like, hey, what's wrong here? If something is wrong, if they can communicate it first, that's also like a lot big win for them. And, yeah, something that, you know, just from a company perspective, it matters a lot to them.
[00:38:10] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.
Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5,000 when you become a customer. The other interesting aspect of data lineage is that it is a collaborative element where it does allow data engineers to understand what is the impact I'm going to have if I change this table or, you know, why is this report not rendering properly? I can look upstream to see what happened or what are the nodes that feed into this and maybe see what are the changes that occurred in that. And I'm wondering from the data consumer perspective, how you also think about that kind of collaboration aspect and how you're thinking about building Alvin to be this cross cutting view of what is happening in the data system and being able to surface the information that each of those different stakeholders cares about at their point of access.
[00:39:58] Unknown:
We do have, obviously, features. I mean, we do have a data catalog as well. We do have features there that, you know, allow you to assign, you know, basic governance like, you know, you can tag things with, you know, types tag like schematic tagging. So you can say, you know, this tag should be a person or this tag should be a number. So you can basically do a lot, you know, set SLAs, set ownership, these kind of things. And then when you combine that with lineage, you have some really interesting things you can do with that to, you know, say, like, maybe someone should be warned if something happens and these types of things. That's a little bit, you know, I guess more bread and butter.
1 thing that we recently put out in in the product is what we called views. I think that's really where it gets interesting because we have all this usage data that we basically mold in a way so you can slice and dice it. So as a business user, you can set up a search basically or a view into all the assets that are all the dashboards that that, you know, match a certain name or all the dashboards in a certain Looker folder. As a data engineer, you can look at all the airflow kind of pipelines. And essentially, the thinking is that I think as a tool, especially for that, you know, aims to be used widely, at least over time, I think it's like you really have to thread that line of being opinionated about how the tool should be used versus allowing flexibility to the end user. And, you know, I'm not saying we have nailed that at all. I don't think we have, but we've really seen some good feedback on this idea that, you know, here you can just create your own curated views into the metadata.
And, you know, if you look at collaboration, onboarding, you know, you can do basic things like, you know, this is a view of all the tables that are tagged with gold or silver, you know, standard. These are all the tables that belong to a certain team. And then, you know, a lot of companies don't even have these things, you know, just onboarding is a little bit like just, hey, look at the DBT graph. And then, yeah, it was 800 models there. So and I have no idea what's used or not, you know, where I start here. And to that point of
[00:42:04] Unknown:
usage and understanding what is active versus what is stale or what was created as a 1 off. How do you capture some of that information and think about how to surface that at a glance when somebody's looking at the graph where they can say, okay, here are the 800 dbt models. I can see that, you know, this model at the end of this graph is actually used by this BI dashboard that's accessed 5 times a day versus this other DBT model was generated and doesn't have any other connections, or maybe it does have a connection to a dashboard, but that dashboard was looked at 1 time 8 months ago. And being able to use that to understand, okay, I can safely prune these models because they're not being used or they're not generating value versus if I prune this model, then, you know, everything is going to break.
[00:42:53] Unknown:
I think those exact questions are, you know, the core of why we, you know, have built and are building ALVIN. I think, you know, we sat there and discussed and like these are the questions that we want to be able to answer. I think, you know, we're on the UI side. It's probably like a little bit, you know, more kind of elbow grease to get there, but but we basically have all this data. And I think more if you look at a technical perspective, which, you know, the UI part is just, you know, the final, you know, mile of the marathon where you can, you know, build these nice cherries on top. But the really hard engineering work has all been done. And I think that's, you know, you can, you know, pruning I think is a really interesting concept because if you look at a huge pipeline or a huge dbt that you can look at, if you would only analyze, you know, at the rudimentary level, you could say that, well, all of these tables are being used, you know, every day because the entire DAG is materialized every day. But then if you look even closer, you can see that, well, all the queries are just, you know, create a selector, you know, some kind of derivative of that. So they are actually all right queries and there's nothing else happening. And That's kind of can reveal, you know, huge opportunities for pruning data, thus saving costs. So for things like BigQuery, where, you know, the cost $5 per terabyte scan, like, you know, you actually have a cost. So you can actually, in those cases, you can give well, this is actually, you know, the hard, you know, hard cash you would save by by just pruning those and that's perfectly fine because there's 1 dashboard connected to this model, but it was last viewed, you know, a month ago. So it's probably okay.
Equally, we can also do, you know, these things with airflow. You can by associating the task instances of a diagram with the queries they're running, you can kind of do similar. So you can say, hey, this airflow DAG is costing you a lot of money, but these tasks, you know, they don't really really have an impact. You should just drop
[00:44:49] Unknown:
them. As far as the overall space of data lineage and the areas that you've been able to dig into and problems you've been able to surface in the process of building Alvin and working with some of your early customers, What do you see as some of the open questions and areas for investigation and further improvement in the space of data lineage, whether that's the kind of technical implementation of how you construct and represent the lineage graph or the applications and use cases for lineage or the opportunities that for building on top of lineage to add extra kind of automation capabilities, but just kind of like from this view of data lineage as a problem space, what are the things that are not being addressed or leveraged as fully as they could be?
[00:45:37] Unknown:
I think the general kind of problem is is a little bit, as you also wrote in 1 of the questions, that it's a little bit, you know, overused. The term is like, you know, almost losing its meaning a little bit. I think, you know, we need to, as an industry, we need to, like, give value back to it. And I think how we do that is, you know, it's no different to what I've been, you know, trying to preach or or we're still trying to preach that. You need to show, like, how it can be used and why it's useful. I think and that's I think, you know, what we as an industry have, you know, a long way to go. I think lineage is unfortunately a little bit, you know, just put in a corner as, you know, a feature that, you know, someone ticks a checkbox and, okay, there's a diagram there, then great. We have lineage. So I think really focusing on that is how it can be used and how it is useful and and really how it can drive value, save time, you know, increase quality, I think is, you know, it's like a pillar of, I mean, data observability.
I think I can talk, like, so much about this as a topic, but I also think it's at least what we see, it's also on the buyer side, I guess. It's also like I guess it's because there has been so much funding into the space, so much investment. Everyone's saying that, Yeah, we have column level lineage, we have everything. It has kind of set the expectations so high that, you know, you read, you know, you read a proof of concept or an RFP from a company and, you know, they expect to have, you know, absolutely everything. And I think, you know, that you know, probably not not feasible from 1 single product. So I think, you know, at the state of the industry at the moment is also a little bit, yeah, like, you can definitely see there's a bit of, like, you know, antithesis about, you know, talking about cost and the kind of the problematic sides of DBT. So so, yeah, I think it's, you know, at the end of the day, you know, the conclusion of this kind of long long ramp is, you know, it's about showing value and pointing to, you know, what's the tangible, you know, benefits of having this, not just, you know, talking about it as a technology.
[00:47:43] Unknown:
In your work of building Alwin and working with some of your early customers and design partners, what are some of the most interesting or innovative or unexpected ways that you've seen Alwin used?
[00:47:54] Unknown:
It's really more, I would say, not to kind of pat our own backs, I guess, but it's a little bit more how well we have been able to integrate it. You know, how much effort you kind of put into making something that just runs in whatever Airflow version or, you know, just have something that just works. I think that's kind of the thing. The use cases, you know, are usually pretty well scoped out. Like, you want to achieve certain outcomes. The features are, you know, pretty, you know, scoped towards that. So I know if I like, you know, maybe it's a boring answer, but to us, a day where we're able to, you know, have a client just not do anything and achieve the results they want. That's kind of the, you know, the holy grail for us that there is this, you know, plug and play magic to the lineage. So I'm sorry, you know, it's hard to answer that in this, like, very exciting way, I think.
[00:48:48] Unknown:
No. I mean, in a lot of ways, particularly when you're working at the foundational layers of a stack that is intended to drive potentially massive amounts of value. Boring is what you're aiming for.
[00:49:00] Unknown:
Yeah. Like a boring day is a good day. Right? Exactly.
[00:49:03] Unknown:
Excitement usually means that something went wrong.
[00:49:06] Unknown:
Yeah. Yeah. Yeah. Exactly.
[00:49:08] Unknown:
And in your work of building the technology and building the business around Alva, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:49:17] Unknown:
I think, again, like, when you work on a really foundational layer, I think you are a lot more vulnerable to, you know, the basics of computer science, you know, big o notation and those things. I think these are really the things you have to think about that, you know, memory usage and, you know, design things that, you know, scale well. And I think this has been very challenging for us because you do something that works in your own environment, and you're kind of you're running it and it works smoothly. And then you're onboarding some clients and everything just blows up. You just have to do it kind of real real firefighting. So I think, like, when you work in an environment where what you get in is so unpredictable, like, you almost have to be, you know, ready for for everything in terms of how companies are using it, like, what they're doing. Like, at 1 point, you know, it's completely unpredictable. Right? So at 1 point, you had, like, 1 company that was, like, 90% of all the data that we had because they did things in, you know, some way. So I think this unpredictability has been very challenging to solve in a good way when you want to provide a product that meets very, very strict requirements on, you know, timeliness, correctness of the data. So, you know, it's not like you're just building a web app. It's quite many components.
So again, like a bit of a boring answer. But like you said, when you're working on this low level, it's usually this more low level kind of problems that that you have to deal with. For people who are interested in being able to
[00:50:53] Unknown:
build and take advantage of the lineage graph for their own data platform and data assets? What are the cases where Alvin is the wrong choice?
[00:51:01] Unknown:
I think it depends a little bit on, you know, maturity, kind of where you're at in your journey. I think we talk to a lot of companies where you realize that a lot of this is more about culture and kind of people and seeing that you can kind of drive the change. So I think it's it's a little bit, you know, you have to be kind of ready to embrace this, you know, from a company perspective. That being said, I think, you know, technically, you know, we also talk to companies that are investing heavily in their own tooling and have, like, really, really, you know, talented people doing that. You know, if that's priority and if that's an investment that the company is willing to make.
Some, you know, make their own kind of internal dbt like tooling where they use annotation and metadata driven pipelines where they kind of have all of this information, you know, in terms of lineage and whatnot. And, you know, that's a hard sell with what we do. So I think that's specific. But all in all, I think if you're using, you know, dbt or you're not using dbt, but you still want to get the lineage, I think, Alvent is the right choice.
[00:52:08] Unknown:
As you continue to build and scale the Alvent product and start to onboard more customers and open it up for general availability, what are some of the things you have planned for the near to medium term or projects or new capabilities you're excited to dig into?
[00:52:23] Unknown:
Yeah. Like like I alluded a little bit to, you know, the q 4 for us will be exciting. I think there's a couple of bigger roadmap items. 1 is the, you know, layering lineage more in the sense of being able to like, show, you know, the different, you know, what are the based on the lineage data, what are the pipelines? What's the technical lineage? What's the more like, you know, business lineage? And really being able to show that in a good way. So that's an important 1. Then another thing that we, you know, just are dipping our toes a bit into is what I would call more of semantic logging and alerting. So based on the query logs, you know, you have a lot of information that's interesting. You can use it to, let's say, version metadata. So you have these create or replace statements. You can kind of say, like, you know, this temporal aspect of, let's say, a table. It's almost like you can create, you know, a git commit log for the data warehouse based on the statement. Again, you know, these are, you know, foundational things that can be used to do really interesting stuff such as logging and alerting. So I think once you're talking about logging and alerting, like you're also talking a bit about observability.
But I think it's a little bit in our DNA that we, you know, maybe we don't always move the fastest, but we really believe in, you know, thinking a bit about the foundations of what you're trying to achieve. And then, you know, that allows you to move really fast once you have that. And that's something we have seen with the lineage is that, like, you know, we have, like, a really good foundation for lineage and impact analysis usage. Like, all of those things, you can build and ship very, very quickly because you don't have to, you know, hack or retrofit because you know you have this, like, good foundation.
[00:54:03] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:54:18] Unknown:
The biggest gap, I think, is no no, it's still soul nascent. You know, I think the focus for instance on there's like a huge focus on data observability for instance. But why are we talking about data actionability? Like, I think there's this obsession with specific concepts or ideas, but there's, like, a big gap of actually making this valuable, and useful. And that to me, like, you know, to me, that's like a huge chasm. This, you know, there's a lot of talking and conversation, but there's a big gap of how do we take this and actually make it valuable and useful for data engineers. You know, that's what we and Alvin really hope to fix.
[00:55:01] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Alvin and your overall explorations and thoughts on the data lineage space. It's definitely very interesting and important problem area, so I appreciate all of the time and energy that you and your team are putting into making it a more tractable and accessible problem area for people who want to take advantage of its capabilities. So appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thanks a lot. Have a great day too.
[00:55:35] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Guest Introduction: Martin Salin
Martin's Journey into Data Engineering
Overview of Alvin and Data Lineage
Technical Aspects of Data Lineage
Latency and Real-Time Data Lineage
Use Cases and Applications of Data Lineage
Challenges and Solutions in Data Lineage
Modeling and Representing Data Lineage
Technical Implementation of Alvin
Onboarding and Integration with Alvin
Workflow Integration and Collaboration
Capturing and Surfacing Usage Data
Future of Data Lineage and Open Questions
Lessons Learned and Challenges
When Alvin Might Not Be the Right Choice
Future Plans for Alvin
Biggest Gaps in Data Management Tools
Closing Remarks