Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others.

Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies.

Sign up for the SaaS today at dataengineeringpodcast.com/accryl.

That's a r y l. Your host is Tobias Macy. And today, I'm interviewing Andy Dang about powering observability of AI systems with the Y Logs data logging library and his work at Ylabs to add some additional context on top of that. So, Andy, can you start by introducing yourself?

Hi. My name is Andy Dang. I'm the head of engineering and a cofounder at YLABS. Before YLABS, I spent 6 years at Amazon working on big data, data engineering, and real time systems,

and machine learning platform, and also AWS SageMaker before

this. Currently, I'm in charge of a lot of engineering decisions and architecture and thinking about how to make data observability

as accessible

as possible to not just wire apps customers, but to the general data flow from both data engineering and machine learning.

And do you remember how you first got involved working in the area of data? Yes. I actually started working at Amazon Building Services first. So it was a lot of rest service building.

And as part of that, I was working on advertising, and that's when I got introduced to, first of all, machine learning and then data science because we had a a data engineering.

Because we had to deal with, like, massive amount of data, and this was just at the cost of Apache Spark

becoming

a popular tour

tool. So I got to see the transition from a very legacy system, and this system was even not an Apache Hadoop Hadoop

file system that was

fancy back

then. But then that position for the jump from a very legacy system to

a very modern system using Apache Spark, and I really enjoyed that journey

and ended up building a lot of pipelines and figuring out how to make those. If this is was also before Airflow as well. So you have to figure out how to stitch them together

and monitor them and dealt with failure. So that was a lot of lessons there, getting to work with really smart people hands on.

Now that has led you on this path that has brought you to where you're currently working at YLABS, building the YLABS

library and some of the additional capabilities on top of that. And I'm wondering if you can talk through

what it is that you're building there and some of the story behind how you ended up focusing on this particular problem domain and why it is that that's where you wanna spend your time and energy?

Yeah. So as part of working with Spark and big data pipelines

at Amazon, 1 of the things I learned was that there's a massive

gap in how data scientists can access this big data system.

So that's why I transition

into

machine learning a little bit where I wasn't working in machine learning per se, but trying to build the connect the bridge the gap between machine learning and big data

and building tools around that, like connectors to data sources

and making it easy and less friction for scientists to access Spark environment.

And

as part of that team, I met Alisa and my CEO and, Sam,

the product the CPO, and that's where we met. And we had a lot of fun building this kind of end to end product where it involves

front end, middle tier user experience,

back end massive scaling, and dealing both data problem where you have the scaling issues and all this kind of hardware and infrastructure.

And then machine learning, which has a kind of different

set of challenges like libraries,

Python dependencies,

and workflow and dependency management.

And this kind of set of complex challenges really got us excited. And

when we left Amazon, all of us thought about the kind of problems we wanted to solve, and we wanna be in the middle of the road between data and and machine learning and figuring out how to solve problems

from not just the technical angle, but really driving it from

the user experience around how

humans interact with the system and that kind of, like another thing that we had the aspiration to build while I was thinking about being the interface between the machine

and

machine learning technologies and humans and data technology as well, and making it really accessible to not just the technical person who's executing it, but across different stakeholders and trying to work out these both from the technological

challenges and user experience.

Yeah. That's why I got started, and then we narrowed down into the area of observability

and monitoring for machine learning. That's where we ended up where we are today. But we didn't stop right there. We explored a few ideas initially as well.

And there are a few different avenues that I'd like to explore here. So, you know, 1 thing is the context of data observability, which is a very active space right now with a lot of different

approaches people are taking.

Another is the juxtapositions

of

data logging as a particular activity as compared to just logging for diagnostic and debugging purposes and software systems

and then the overall space of observability in the context of AI systems and how that maybe differs from the analytical data observability that a lot of people are focused on right now with the questions of, you know, what is the lineage? How is that gonna impact this dashboard that is powering decision making at the c suite kind of thing? And so

I guess for probably the shortest path first,

maybe digging into the juxtaposition

of data logging as a differentiated

capability

from regular software application level logging and some of the

context

and semantics that that brings with it.

Yes. I think data logging is definitely something

when we started, we looked at the landscape. We examined it through

by thinking about

our experience and then talking to hundreds of practitioners here in Seattle.

And 1 of the things that kind of came out of it is that the data's loving is a second break

afterthought for a lot of both engineering and data scientists.

And the reason being that it's extremely hard to log data. And to be fair, it's not really a well defined term at the moment because very few people are doing it.

Typically, when you talk about data logging in the context of data,

people often think about storing a subset or a replica of the data in some other system

to do something like monitoring and observing it. And

and that's it kind of traditional thinking.

But 1 of the things when I worked with Amazon, I learned is that you can't do that, especially, first of all, when you deal with massive volume of data.

And the second thing is that I when I worked in that, we actually had, like, these challenges of, like, we were not allowed to store these date individual data points because of contractual agreements like Facebook ads data. You cannot store for x amount of time. With that, then you have to delete the data. So,

of course, now you know GDPR and all various legal compliance requirements.

So data duplicating data to store to log it, like, operational purpose is just wrong at so many level even when you do sampling.

So that's something,

from the kind of first principle that we concluded.

And then

we try to really think about software logging and thinking about telemetry and software logging just provide a very kind of high level overview of the insights of the software.

I mean, it's dependent on the user to log it in a way, and there's an arc to how much you log and what kind of granularity

you log your information.

And that's a DevOps workflow. And that's something we wanna bring to the data world is the ability as you stream through this massive datasets

and through this kind of date real time data stream. You can collect certain information to give you not necessarily the full picture, but some insights about the data.

An example is just the count very common in many monitoring system. It's monitoring the count of the data.

That's a great start for data logging because let's give you some sense about

the stability of your data stream or data sources.

Obviously, there's a lot more technique, and we kinda expand it from there and build on that kind of idea of, like, how do we find build this, like, using some

very old concept or very well founded concept to perform these tasks, and we're not reinventing the wheel here. We're just trying to apply existing concepts around

using, from information

retrieval world, actually,

to really make it lightweight, make it easy to log data and doesn't break the bank and doesn't cause performance issues. So that's pretty much from my technical point of view. That's where I kind of, like, headed down towards the path when trying to solve the question of how do we solve this data logging challenge.

Another aspect of this is the

difference between what you're doing with y logs and this concept of data logging and the overall approach of capturing summary statistics for being able to understand, you know, what are the count of columns, what are the distributions for values within this particular dataset, and some of the ways that you are maybe

either augmenting or going a different direction than this approach of summary statistics for a purpose of understanding

what is the context of the data and using that for either data quality checks for just making sure that you don't all of a sudden have an order of magnitude more rows or fewer rows,

and also for the sort of data exploration aspect for understanding the context of the data there? The output of whylogs

are similar. It does try to produce summary statistics.

It has a lot of overlap with the output of this kind of other things. For example, I can think of is, like,

a panelist profiling. It it spits out very similar kind of output statistics about unique values,

categorical

features, etcetera etcetera.

So the output is the same. It's the way we execute why logs that is uniquely different.

We don't like any assumption about the underlying

technology.

In particular, what that means that we have specific technology in implementation,

but wire locks can work with a wide range of technology.

An example

is that well, starting with Pandas, that's where a lot of, like, this small library kind of local workflow that you start with.

But then you're scaling up a bit. You get Apache Spark. Again, that is,

massive scale, but still you have a nice interface.

And then we keep going, and you end up with things like Apache Kafka where data is in a real time with partitions

and dealing with consumers

and model and whatnot. And why else has Kafka integration, for example, because it kind of abstracts away the underlying data storage. It treat the data as a stream as well as a batch. It has 2 mode. And the trick for that is, to be honest, just modeling the streaming world in microbatches.

And that allows vialogs to really

still operate when dealing with streaming data sources. And then you go to the really original source is where you, for example, run your services

where in an API world, then that's where wire locks can also handle, like, individual API calls as well by loving those things. We're really trying to the most common

building block for this for the interface for these operations.

Of course, there's a lot of optimization underneath the hood to make it work with different data sources, like, if it's columnar, for example, Spark is a common example. We have really specific optimizations for Spark because they they take advantage of Spark power and things like UDFs.

But then,

like, if users don't have that, they're not blocked from using us. So that's something that is a unique differentiation

aspect from whylogs to to the other system. And another thing is that from mental concept,

while I separate the problem of statistics

collection. So that's why you do this kind of, like, some fingerprinting.

We try to basically collect what we call data sketches, which is just some very you can imagine them a bunch of big maps putting together so we can estimate things like uniqueness of the data stream

or frequent items and etcetera.

So

why logs collect these merge and they are all mergeable by the way. So in a way, very map reusable, very spot friendly, very kind of distributed system friendly. So Yelogs allow you to collect these mergeable lightweight objects,

and you can store them in your local store.

And then you can run the analysis after the fact. So we did really decouple the problem of and this is similar to DevOps. You collect metrics and telemetry and log in 1 place, and then you have another system that monitors these changes like Gafana or Datadog

that the exact model that while I was get the inspiration from. So the data might have disappeared

in a way. You might not have that data anymore. You only have the profiles left in your system, which you can store because it doesn't really contain individual data points. And then you can still run the things like what you described,

summary statistics analysis on top of that. And because they're mergeable, so you can do slice and dice, like, by hour, by day. It's up to you. But that's the power of whylogs is and it's actually really powerful because a lot of the time, you don't think about

the exact analysis you're running

until you see some problem. For example, it's like it tends to work really well with the interactive analysis workflow. And since it doesn't scan through the data, it doesn't require

an a massive live processing back end. It's super fast to really run through a whole year of analysis, for example.

As far as the core use cases that you're looking to power with y logs and the ylabs product on top of it, I'm wondering if you can dig into that and maybe add some compare and contrast to some of the other systems folks might be familiar with, such as Great Expectations in the open source space or Monte Carlo, SOTA Data, DataFold, etcetera, and just the overall space of data observability

for the data engineering focus and what you're building?

So with YLogs, our aim is to be the standard for data logging. I believe that's the only library in the world with open source at the moment that does both for real time system and for streaming system.

So that's why Lock's goal is really to provide, first of all, the ability

to log and to collect this log object to manage them. We're building a lot of APIs and utilities on top of this. And then the tools to run analysis, like run building Jupyter Notebooks, building reports,

generating alerts, and generating

what we call constraints.

Those are the k kind of, like, first part of bylaws, and it overlaps somewhat with expectation. Only 1 part is why we handle constraints.

So whylogs can also why

scans to the data and also check for constraints from both individual data points, or you can you can also run a constraint validation against the final y logs output. So with individual endpoint, you need to know the the kind of constraints you want in advance. But once you have a profile, you can run out of various constraints.

And because we have a summary statistics, we can also suggest the constraints for why logs. And this is the difference between why logs and great expectation. Great expectation require you to have all the data available in some system that

you can query

into local interface.

Hood, and that means that it's limited to Tableau data and Pandas data. Whylogs is not limited to that because the way whylogs work is that you can hook in your own transformation

as well. So it actually can work with not just tablet data. It also work with things like images and video.

1 of the example we have a blog post about this is detecting the RGB channels in images

and using that to monitor the distribution drift because

a very common pattern is when you change the camera model and kind of the lighting just a little bit off even though the lighting condition is the same and the model behaves a little bit differently, and people wanna detect that. Or sometimes if you take a different picture, the distribution of this thing change quite significantly as well. So that's 1 thing. Apply the open source part when we look at why loss, what the differentiation it is compared to great expectation.

We focus a lot more on real time system as well. Great expectation does not support that because it requires data to be fully available.

And our focus is a lot on not just Apache Spark, but other distributed systems and

and also data type, like the complex data types.

When it comes to the platform, now this is where I guess the comparison with Monte Carlo and Soda Data is more kind of, like, direct 1 to 1 because these are platforms. They have solutions. They don't necessarily just a library. They don't have an open source library.

So when it comes to the platform, ylabs focuses really optimizing towards now is is to really building first class monitoring for machine learning first and foremost. So

what we do is we're focusing on really machine learning specific problems like data drips and

model performance challenges

and

explainability

rather than optimizing for the workflow

of SQL data warehouse management.

Of course, we can do both. We have customers who use wirelabs to monitor both because

the outcome of your data pipeline

is your machine learning model. So it's not a decoupled problem. We don't really,

I won't say compete, but I would say that we target a very subset

of problems that is different from what Monte Carlo or so the data target. What they target is really, like, SQL based system where you can run SQL queries against and collect these summary statistics using SQL. What we target is more like

delta lake and real time system with complex data types, and then typically the systems

affects machine learning models and business KPIs. So it's a bit of a kind of market segmentation, I guess, but also a technology segmentation because when people start running things in real time and in production,

these tools don't really monitor them unless those customer also have a system to bend data back to a SQL database.

Which also that pipeline tend to take much longer than monitoring it right there in the container, for example.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool.

Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at data engineering podcast.com/rudder.

In terms of the actual implementation

of Y logs and the structure of how you have approached designing it to be able to work across these various contexts. I'm wondering if you can dig into the design and implementation.

Kind of first class metrics that we have in wire locks are

all mergeable. So they're meant to be run-in this distributed environment.

So that's the first thing we kind of, like, we try to locate. And therefore, we identify various data structures and algorithms that allow you to do this.

Specifically, we use HLL, for example, HyperLogLog,

which is a bunch of bitmaps, by the way. It's nothing fancy. Once you need meet the hood, although the maps are fancy, it's not the data structure that is fancy.

This is using Big Google, BigQuery, for example, or Postgres to estimate the uniqueness, cardinality of a column.

What that means is that

we we ended up going with the an open source library called Apache DataSketches.

That library

was built by Yahoo for the purpose of web log analysis,

but it has this library built in from both Java and Python, c plus plus specifically for Python. So very performant

and very lightweight. And it works in both world of data engineering and data science because Java is really the language of big data pipelines. These days,

Python and c plus. Python specifically is the interface for data science. So it kind of sits in between very nicely.

And we add a lot on top of that is the ability to hook into various data sources. We really optimize on on kind of, like, automatic schema detection,

a real time workflow, and API for real time versus batch workflows,

and then thinking about how to extend it, like supporting extensibility.

This is 1 thing to call us. We actually

have learned tons through the last 2 years of building wire locks and interacting with the open source community. So we are rebuilding it now to make it even more extensible because we have people who realize the value of wire locks and they want to use it not only for machine learning monitoring, but they also want to add things I never thought that they would add. Like, application performance

into these sketches because it can handle anything that is numerical. It can handle fine. So while they use things like Datadog to monitor their models or their system, for example, they also want to have some kind of decoupling of, like,

the

data fingerprints

alongside with the application performance fingerprints. And it's definitely an interesting thought. I didn't know about that before talking to various folks.

But, yeah, we we really optimize based on this kind of thinking about what the friction in data science workflow and reduce the configuration.

Because 1 thing I guess I didn't mention is that in data monitoring,

you have

massive number of features and columns,

and, therefore, you don't wanna configure each of them from both the data logging perspective

and then the data monitoring as well. And once you collect these, you wanna quickly run analysis.

And we wanna make sure that you can run analysis to thousands of features without having to play

set 2, 000 parameters in the monitor. So we wanna build in, like, smart defaults and really optimize for different workflow. So and, again, these are actually workflow specific.

If you have a kind of data warehouse, you wanna monitor against the Kafka stream. That's a bit different from monitoring the machine learning model with NLP in it, and we really think about how to optimize different recipes for different use cases here. But the key thing here is, like, you start off with a very

fundamental, kind of composable core to to start with.

And given the fact that you do have implementations and bindings for both Python and Java runtimes, I'm wondering how you approach

maintenance of feature parity across those environments as well as being idiomatic and keeping them

in step with each other so that you don't end up tilting too far in favor of 1 or the other language ecosystem?

The reality is that Python just has way more support for data types.

Images, you can do the same thing in Java.

But on the other hand, Java has things like Apache Spark integration that for example, even now Spark integration in Python

is there. But once you've looked into the API, it's actually backed by the Java implementation. We just calls the Java UDF underneath that. It takes a bit of a trick, but to make it natural like this. So the 2 ecosystem do complement each other, and they're not against each other. It's definitely a challenge for us to

maintain the feature parity. And 1 of the things we haven't

not really invested in is to provide feature parity. And this is, I guess, also the reality is that we have very different personas in the 2

world when it comes to users, the data scientists alone. But the Moes and the data science users in the Python world, they have this. Well, Python is very flexible, like untyped.

They have this kind of, like,

really, like, maximizing

flexibility

mindset.

Whereas

in the Java world, people want nooks and hooks to control performance. They want performance optimization

that they can really fine tune.

And, unfortunately, they don't really it's very hard to find a common API that support both of these,

and we wanna minimize the configuration as much as possible.

So this is a journey we're still exploring and learning through

various customer. We actually have we have learned a few things about this sort of, schema and then kind of, like, dealing with massive number of features and columns. I think there's still gonna be a lot of differences between the ecosystem

because we wanna make sure that it makes sense for the scientists as well as make sense for the data engineers. And sometimes what what makes sense for 1 doesn't make sense for the other, and and we have to make sure that our API is designed with that in mind.

As far as the

overall ideas that you had going into building y logs and starting to explore this space. I'm wondering what are some of the ways that the

goals and design of the system have changed or evolved since you started working on it and some of the ways that your ideas and approach have shifted?

So 1 of the first thing we started with Ylabs is well, you know, every startup goes a crawl, walk, and run. Right? We started with really no configuration.

Like, you just send in old sort of profiles who would detect the smart defaults for you, and we run analysis and general anomalies.

And then or alerts. And then we realize that, wait a minute, we both need configurations

because each of these data

problem have their own characteristic

different to monitor retail kind of seasonal dataset versus

monitoring

a model kind of training pipeline that runs with

relatively stable data or day over day or fraud detection model. So that's

1 thing that we while we crawl by just, like, building with no configuration, but adding a lot of important, like, algorithms

on top of that to really, like, solve different specific business problems.

1, we really optimized towards data science flow initially with, like, detecting

drips, which is a hard problem when you have massive scale data scale. So drift analysis, we'd really optimize on that initially. We think we found a bunch of algorithms that works with while offs. They're actually now in the open source. I'll rep out of them are in the open source implementation now. And then

optimizing for the data science workflow, so how to make it easy to monitor various things. And then we realized we need to add knobs to this. And we introduced a configuration language, but it's very quite complex to write.

So

our interface only allows you to tune, like, 5, 6 knobs and with parameters, of course. But our main UI is optimized to really, like, make it easy so you don't have to deal with the complex configuration

language behind the scene.

Of course, you can go in the code AVI with the

using this particular JSON schema, and it will work. But the challenge is that it is complex and it is big, and it wasn't really designed with user experience in mind when it comes to that configuration language.

So that's the second part we are but we're kind of walking, but we're kind of, like, crawling in the sense of enabling the configuration language. So it's very easy to use it to come or come in and start monitoring. Like, they don't even have to touch anything. We enable all these smart defaults by default based on the use case.

And then finally, nowadays, we are absolutely revisiting that we have riveted that decision, and we are introducing a lot more knobs and tunes and redesign the language so that we can build not just an API on top of that that is easy to use, but also a a kind of user experience workflow that allows you to set up this configuration

without

without having to deal with that. And by the way,

once you realize once you start monitor both machine learning and data science, the workflow gets extremely complicated because an example is that the cadence of things are different in the 2 worlds. Like, the data engineering pipeline tend to move a lot slower

than the machine learning real time prediction pipeline. They just have a kind of different set of expectations. Then we have to capture all these

knobs into our configuration and our user interface and make sure that we're not trying to solve everything at once, like, making sure that we can at least solve some of the fundamental problems like data lateness, latency,

etcetera into the user workflow and allow you to all this sort of control.

So it's it's been a fun journey but very challenging.

Something I never thought like, a lot of things, even though I work with data pipeline, but you kind of, like,

have this

split mentality when you think about machine learning versus data pipeline, and you you don't realize that once they cross path, they kind of, like, the number of problems just multiplies by the same dimensions. Like,

yeah, it's it's a bit crazy.

And as far as actually

adopting and integrating y logs, I'm wondering if you can talk through that process of starting with either a greenfield or an existing code base and

starting to add the instrumentation and log capture and figuring out where to put it, you know, what systems you need to put y logs into to be able to get the full visibility across the workflows and just that overall process of getting started with it and starting

to capture that information

and do some debugging and alerting on it?

So, typically,

people start with wire offs with the panel's data frame. Like, a lot of people are using Jupyter Notebook. We really optimized for that flow because that's pretty much the 1 on 1 experience these days with all sort of notebooks experience across the board from Databricks to

kind of like SQL based systems.

What we ended up doing is a few lines of code for you to get started with while offs. The key thing here is for you to have either a panel's data frame or a spark data frame. And with spark, you do need a bit more configuration like hooking the library into the spark jar system because it's gonna run-in distributed mode. But the 1 on 1 experience

is very optimized towards panel's data frames.

And you need that, and you need to configure

the wire lock session at the moment to really

point it to where you wanna store the artifacts and a few other things like well, I guess, at least the panels when you only need to point it where to where it stored. Actually, you don't need to because it will just store it locally to where you're running the script. So but that's a configuration

that people often, like, jump into first because they can

and then you just need to create a while ops object assessment, then you can start logging. It's a very natural API.

You don't need to configure a lot of things like schema or expectations or kind of constraints or anything around that. But those are more advanced

API. So few lines of code is what we are aiming for here. And, typically, we so that's a very basic experience. And if people have things like

airflow

or or, like, airflow, then we have examples to hook into those systems so they can airflow task, for example, because airflow takes in a pilot data frame and spit out something.

So it can take in a pilot's data frame and spit out wire logs and store it somewhere. So that kind of mentality,

we're working on a lot of examples so you can apply into your existing system rather than having to

do a mental mapping from, like, oh, pandas data frame in a no tool to something else. Feel free to kind of take it if you see an integration that is not there that you really want that you feel passionate about. Because they know there are a lot of new orchestration systems out there, by the way. Absolutely.

And for people who are just using Y logs purely open source, they don't want to necessarily hook it into Y Labs, what are

where would they

be aiming to store the outputs of those log events? Is it something that you would just ship to your existing log aggregation system, like your Elasticsearch or your Loki or Grafana?

Or is it something where you would need to store it as objects in s 3 and have some of the different analytical workflow for being able to process them?

That's a good question. Unfortunately,

wire logs is because of this hyper log log, for example, data structure, you cannot ship them directly to I mean, you can. They're they're just binary. So you can ship them to the storage system, but they cannot understand Wirelock. That's an unfortunate fact.

What Wirelock's open source user do and by the way, by default, Wirelock does not talk to Wirelock. Like, you have to, like, wire in a bunch of things before we can talk to the platform because, like, API key. So by default, ylabs just runs locally.

And

you can store them in typically, what we see often is a cloud storage like s 3. Like, just storing them in s 3. We also seen people doing a bit more advanced work, and this is something that we're learning from is statistics. For example, what they're doing with PyLabs is that they're storing it in columnar data stores

using a postgres database

so that we can look up these log objects

faster than s 3. So that's what they're doing.

And what end up happening is that either way, you typically

have

analysis notebooks and kind of workflows around this lot of objects. That's the second part is is that when you can extract metrics into the other monitoring systems.

We're also working on how to integrate by logs into Elasticsearch and other systems, but

the information

extracted there is pretty complex. We're still working on iterating on, like, a subset of the signal. An example

is why I'll collect a KLL sketch, which allows you to build dynamic histograms.

And that means that user need to kinda decide in events how they wanna extract that histogram information.

And

you lose a lot of abilities if you store it in a static form

because you can't do, like, data drifts without

the histogram bins aligning between the baseline and the target, for example. If they don't align, then the algorithm doesn't make sense. So that's something to call out there is you do lose information as you go from wire logs to other more JSON based or kind of numerical based systems.

The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality.

Posthog is your all in 1 product analytics suite including product analysis, user funnels, feature flags, experimentation,

and it's open source so you can host it yourself or let them do it for you.

You have full control over your data, and their plug in system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms.

Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog.

The other question that always comes about when you're instrumenting any piece of code is

where to put the instrumentation and how much to do it so that you don't end up drowning out the signal for the noise and just being able to maintain that balance and understanding

what are the key points in the data flows or in the logical flows that you need to start to

capture this information to be able to actually use it effectively and just any

strategies or heuristics that you have come across as you've been working on building out this system and working with your customers?

Typically, people well, you know, in an airflow step, you might have a lot of transformation in between. And, typically, people do put wire locks on each of these transformation step in an airflow deck, but we kinda but it gets like you say, it gets overwhelming. So the strategy here is to, first of all, I really focus on the input and the output first.

Right? Really, the beginning and the end of your that's where where you wanna

put logging in the first place and really do correlation between the 2. However, if you have a lot of feature transformations where, like, new features

shows up

and then they disappear again, then you really wanna put

locking when those events happen because those hidden features in the pipeline

might also be a source of bug, and you want visibility,

maybe even targeting

specifically on these features rather than the ones that already showed up at the beginning and the end. But that's more like a tuning process.

The very simple approach is to treat your pipeline as the black box, 1st and foremost, and monitor

those black box. And that's actually true with machine learning model as well because if you think about a neural network, we have have multiple layers, and you can definitely monitor each of these individual layers. Of course, they get expensive because these metrics is, massive.

Wild ops can handle it, but it just still doesn't change the path. There's a lot of layers. But they can also just monitor the input of the kind of the model and the output of model. So there's some kind of fine tuning. It's an art, unfortunately,

component is the question of governance of that project and how you're approaching the

long term sustainability

of that open source component and how you think about the dividing lines between

what is part of the open source and what is part of the commercial entity.

That's definitely a thing that we have been learning a lot as well from based on the feedback from open source community as well as other companies who built

on top of open source.

Our goal is a bit different. So you look things like Elasticsearch, a common thing, but Elasticsearch search or, yeah, I guess the GitLab. The source for the open source is similar to the closed source 1. Whereas we're kind of like different is that open source is the beginning of the journey, and what we're building is the end of that journey. We're not replicating the open source code internally. We are doing something very different internally. That's optimizing for

real time storage. And open source is really optimizing for logging, for data collection. Like, we are really incentivized to make the open source project work across the board regardless of where you are. So we wanna make it friendly

to the developer as well as company to really adopt it and embrace it. So that's that's kind of why the end game is slightly different. We don't want necessarily them to we it's acknowledged that not everyone will convert to wirelabs customer. Some people will definitely use it for their workflow. For example, Yahoo Japan, they have their own, like, data center, so they don't even talk to the Internet. And, obviously,

we love that they're using wire locks, and that's a kind of

success that we love to talk to people about is that y logs is not independent

of y labs.

You asked for why we are building it, then we wanna make it really easy to ingest these objects and organize them and build, like,

complex workflow analysis

that you can definitely build yourself

but at scale. Think about scale of, like, thousands of like, we have customers with thousands of features, hundreds of models

that gets really expensive to run manual analysis. Open built custom analysis. So we wanna make it really low hanging fruit there. But from the open source perspective,

well, at the moment, we are still the sole contributor to the why logs open source project, but we have seen people requesting to contribute, and we take active feedback on the community. Like, step fix work really closely with us to provide feedback.

So that's from the, kind of, like,

kind of governance. And sustainability,

it is for us, like, big differentiation

from the technology point of view. So there's a lot of incentive for the company to make this

accessible everywhere. And to be honest, I my my vision is to make it interoperable with existing DevOps flow

so that you don't necessarily have to you you can use wire logs for collection, but you can

still use your existing DevOps tools, for example, for monitoring.

And that's a bit of a longer term vision because, like I said, they don't map 1 to 1, so you're gonna lose something. They can only do a subset of monitoring there,

but still, that's gonna be valuable to some, you know, less sophisticated

users out there who only cares about high level monitoring of the pipeline, for example.

And, finally,

like, we are rebuilding a while logs as we are speaking at the moment based on the feedback from the community to make it even more accessible and

even more, like, lower the friction to integrate into various ecosystems, and we will actually work with a lot of open source project as well to hook into that. So we wanna be part of that ecosystem

from the synergy perspective.

So it's definitely a difficult challenge from the point of view of like, company resourcing for it, as in we have a small team.

We're building a lot on both sides from both Java and Python and require. Right?

Like, matching between the 2 is the challenge in terms of finding, like, the people who can work with both. We have those, but then they're not always available on the market.

And, therefore, we would love to figure out how to, you know, get more open source contribution and more, like, community share. But that's, I guess, is the matter of building

the velocity and building the adoption as well on top of that. And

I don't know if I have the answer for that at the moment here. But my vision is to really make it an integral part

of a lot of data and machine learning technologies.

And then in terms of the sort of business model and the functionality that you're building on top of y logs with the ylabs platform. I'm wondering if you can talk through some of that sort of feature matrix and the vision that you have for

building a overall sort of end to end visibility solution for the machine learning workflow

and just the capabilities that you're building there?

We are building,

Datetime CH data store for this WhiteLocks object because it gets interesting when you start collecting these over time in real time system or in batch systems

and then organizing into various views.

So that's the first part is

is building real time

kind of storage. And because these objects are mergeable,

we also can build dynamic views on top of these. So you're not, you can view hourly versus daily versus weekly

and detecting trends and monitor these,

what we call models or datasets in various forms.

So that's first part is, like, really the technology. And the scalability

is that we have, we build on top of really scalable technology like Apache Spark and Apache Druid to enable this with very special optimization around wire locks workflow because it's a very it's a new data type that is not available in this ecosystem. So we have to build a lot. And and so that's the kind of optimization there. And more importantly, the workflow optimization

is to think about how we,

as a platform,

fit into existing customer flow by providing the right monitoring algorithms.

So 1 of the example I have is using well, drift detection is a very common

pan that we our customers see, and they actually don't a lot of people don't have, like, things like baseline for their models. So all they care about is really taking the trailing window, for example, and monitor today. And we'd make it very easy. You don't need to think about building a baseline. You just, like, come in and start sending profiles. And if you send in a profile, then we take enough signal, then we start monitoring.

So removing that kind of barrier to monitoring and let people focusing on, first, data science problem and then just adding a few lines of code. And hopefully, we can make it really, like, 0 line of code as we build a lot of this integration. You can just do it as a configuration, for example. That's the part where we solve is, like, take away that configuration, take away like, you ask a DevOps person to configure monitoring for thousands of features. It's gonna be challenging.

What we're building is, like, holistic aggregate metrics to this, and

those are

not trivial. It's a complex workflow. And to be honest, that's why we are a SaaS platform because we need to really iterate fast on this metric as well because it's an art. It's actually a an art of tuning, an art of thinking about what metrics make sense for what problem.

And so that's what our value is providing here, and we can really help customer optimize the workflow

for as a platform towards their business problem rather than

taking a canned algorithm and apply

it non indiscriminately

towards their workflow.

You would have to involve scientists to do that. Think about the scientist doesn't have to get involved here.

It's the engineer who has to type in some logging, and then that's it. Then based on the use cases, they get all this out of the box monitoring, and then the scientist is the end consumer. So free up scientist effort and time.

And finally, it's

like, like I mentioned previously, like, these workflows are complex because they deal with batch and real time system with latency, with

data delay.

And for people who operate in data engineering, well, they know how complex the workflows get. But adding machine learning

and and kind of model training workflow, it it get messy. So we're really trying to make

to find the kind of sweet spot by talking to a lot of customers and finding the common workflow and optimize for those common workflows

so that, hopefully,

you can solve 90% of the problem out there. We acknowledge that 10%. We're probably not gonna solve in the first few iterations and probably have to keep iterate to really hit the last mile

there. But we wanna optimize for the general workflow first and foremost. So it's a lot of thinking about optimization

and remove friction as a platform from this kind of monitoring and analysis workflow when it comes to data quality and machine learning.

As far as the

usage

of YLogs and the work that you've seen people doing with it and some of the applications. I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it employed.

I've seen people use y logs to monitor application

performance. That is an interesting angle because

got to them. Like, application performance metrics are just metrics. You do lose out on the kind of granularity

because you're using while ops, which assume micro batching underneath the hood. You're not emitting different pipelines, but they actually find values in being able to to build histograms

over time, 5 minutely versus, like, an hour. And they actually collect by every 5 minute. They emit a while ops profile

for application performance.

Because another thing is that it works well across really large

number of dimensions. So keep in mind, if you're thinking about monitoring

in Datadog,

you get charged per metric. So what they end up doing is that dumping everything into wire logs and have, respond the analysis for that. That's an interesting flow that I did not anticipate. Like, I was really optimizing towards the data and the data machine learning world.

In your own experience of building the YLogs technology

and growing the Ylabs business around that, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I guess the most

unexpected

lesson I learned was

probably

I guess it's a lot around the assumptions

around

personas of users. We assume

certain personas,

and we optimize towards that.

And then somehow customer onboard, and they have a very different persona. So,

specifically, we if you look at our platform now, it's really geared towards the machine learning science persona. But we have to have a good number of data engineering personas in who are actual users of the platform.

And there's a bit of friction in terms of how these kind of terms are communicated

between the 2 world. Because, for example,

in data engineering, you call things columns, and in data science, you call them features. And, yes, that might be a minor thing. But when you have a dataset, they might be a model of versus it's, just a traditional dataset. They have kind of different hierarchy, different taxonomy.

So that's 1 thing to go out, and that's actually works well. Right? It's true from the API perspective to the kind of platform perspective. Like, a lot of assumption was built around data science workflow.

So

1 thing I learned is that to build escape hatches for the other use cases. In none, this is not just, like, trying to not to be opinionated from the kind of, like, the first level

operate, like, kind of workflow. It's like optimizing for 1 workflow because you're a small team being a start up, and you really wanna make sure that you target you do well in what you're doing. But then you need to build escape hatches so other people can also hook into that without kinda have to retrofit. Like, we initially

some of our APIs were not flexible enough. Then people have to kind of, like, retrofit the data workflow into the ML workflow to model that. And that gets a bit awkward in terms of communication, like the friction of communication

over time. That's a lesson I learned is really like to think about

some of the cases where we can allow escape patches in general from both the open source library perspective and the platform perspective.

For people who are interested

in being able to gain some of the visibility

and unlock the capabilities that YLogs provides, what are some of the cases where maybe it's the wrong choice and they're better suited with just traditional

logging or

some of the other data observability

systems that exist or a different sort of MLOps platform that they wanna run on to gain some of that insight?

For example, while ops is not designed for experiment tracking on the kind of hyperparameter and iterative tracking of building models.

That's not what we are. We're aiming to solve the data problem, not model parameter problem. Yes. We can track numbers, but it's a wrong kind of number to track. That's 1 part. The second part is

when you rely on this monitoring signals, when you want to use this, the kind of numerical data in the platform for

running, like, actual

business analysis

for your business. While App is not designed for that, we don't store your raw data. So you can't really pull the individual data points out of that for your own analysis. What we provide is just high level statistics. And if you really wanna use some analysis, some reporting, you really need to go back the data source, and that means you need your own data store. It does get a bit confusing because sometimes people think that we have the data just from the overall glance, and then they learn that, oh, we actually don't store the data points. They get a bit confused because they were thinking they can use us as a double solution whether we can store the data as well and, you know, the complication of storing data and dealing with all this regulation. So that's an unfortunate fact. And we definitely can do a better job of communicating that, for example.

And as you continue to build and iterate on the y logs project and the ylabs product, what are some of the things you have planned for the near to medium term?

So we are

releasing a new monitoring workflow in the ylabs platforms

so that user has more power to do thing. I did mention the kind of opinionated workflow that we have at the moment with the configuration.

The new workflow we match, we redesigned it from scratch for both data science and data engineering workflow.

And I think that will be a very interesting angle because it opened up a lot of interesting ways to monitor your data. For example, is that now we don't allow you to have duality in your monitoring, as in you can only monitor hourly or daily or weekly.

But with the new monitor, you can do both, and that it's just a different kind of analysis that we run. And, also, the new monitor will allow users to have

a lot more high level overview

signals about the data

rather than looking at individual feature anomalies. We allow you to run

analysis across the whole dataset and detect changes as a whole rather than as individual features.

That's something that are building on top of the existing mindset. It's not a kind of shift to just making it more configurable,

easier to configure,

and then making it easy to really

understand the health of your, like, producing data health score and machine learning health score.

So that's the platform side.

And the guidelock side, we are working on redesigning the API. I mentioned some of the friction around the API. So the new API is gonna be 2 lines of code, really. Like, import something and then log something.

Underneath the hood, we're actually

hooking into really, like I mentioned,

specific technologies, but we're gonna optimize

the performance to

extreme, like using

vectorization

to maximize performance underneath the hood rather than

relying on iteration or pandas operation to do that. We're gonna hook into this vectorized operations

as a whole, and we're also looking into integrate with Apache Arrow as well

so that you can work with both distributed systems and

local memory storage

natively without having to copy data between

languages.

And, like, if you touch from Python, if you do a Python string into

the ylox library at the moment, it requires copying it from Python string to c plus plus string, and we are working on getting rid of that because Arrow allow you to just hand over the whole string. That's something that I think will be very important for us to hook into a lot more,

data storage systems because Arrow seems to be the

new standard for data transfer these days from things like BigQuery to Snowflake. Everyone is doing Arrow, and that'd be very exciting. And, of course, the API we designed will make it easier for us to hook wild ops into various system. And we're gonna build extensive module to make it work with data flow as well, like data lineage and data flow, like Apache Airflow I mentioned before. But we're also looking at

Flyte, which is a library from Lyft that also does, like, data orchestration.

It's f l y t e.

Yeah. So a lot more innovation, a lot more API redesign, and, like, lower the friction. And, hopefully, we we also clean up the terminology as well so that you can graph it a lot better.

And are there any other aspects of the work that you're doing at YLABS or the YLABS project itself or the overall space of being able to

generate and analyze the observability

information for machine learning and data engineering workflows that we didn't discuss yet that you'd like to cover before we close out the show?

I think that's something to call out is that while while I'm just trying to solve a very specific problem around data logging,

The general sentiment I see is that as we're moving to a more real time system,

there's not enough discussion about how to think and operate this, like, Kafka based systems on the data

and quality perspective, from the lineage perspective.

It's kind of missing black box at the moment in this sort of, like, data management

landscape

because it's a very hard problem to solve.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

The biggest gap is that when you go from

language machine learning becoming more adopted,

it's a bit of

a weird gap to transition from data science, from data engineering world to machine learning world. And that transition is not well defined.

Lot of friction at the moment, and everyone is building their own thing with the Frankenstein of technology. And I think I'd be interested to see who's gonna solve that problem effectively from, like, managing upstream data sources from, like, Kafka to batch to the the machine learning world. Absolutely.

Well, thank you very much for taking the time today to join me and share the work that you're doing at YLABS and on the YLABS project and how it's able to help

provide at least a portion of that bridge from data engineering to machine learning. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Yeah. Perfect. Thank you. Thank you very much. Really enjoyed the the chat.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being

used. And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story.

And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links