Summary
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- Your host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.
Interview
Alan Anders from Applecart
- What are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities?
- What are the biggest technical hurdles at Applecart?
Contact Info
- @alanjanders on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Stepan Pushkarev from Hydrosphere.io
- What is Hydropshere.io?
- What metrics do you track to determine when a machine learning model is not producing an appropriate output?
- How do you determine which data points to sample for retraining the model?
- How does the role of a machine learning engineer differ from data engineers and data scientists?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new t shirt.
And go to data engineering podcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy, and this week, I attended the Open Data Science Conference in Boston, and I recorded a few brief interviews while I was there. First up, you'll hear from Alan Anders, the CTO of Applecart, about their challenges with getting Spark to scale for constructing an entity graph for multiple data sources. Next, I spoke with Stepan Pushkarev, the CEO, CTO, and cofounder of HydroSphere dot I o about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to retrain and redeploy those models for better accuracy and more robust operation.
So I'm here with Alan Anders, the CTO of Applecart. So could you start by introducing yourself?
[00:01:30] Unknown:
I'm Alan Anders, and I'm the CTO of Applecart.
[00:01:34] Unknown:
Fair enough. So, we were just talking a little bit and you mentioned that your primary sort of data engineering concern is being able to process and create knowledge graphs at large scale by sourcing data from multiple different layers. And so wondering if you can just talk a bit about some of the ways that you're sourcing that data and some of the challenges that you're working on overcoming.
[00:01:59] Unknown:
Sure. I mean, I could say I come from an ad tech world where we used many, many servers and distributed computing in different ways. And when I came at Applecart, that's how it sort of was. I found that we did a lot of batch computing in strange ways, and I actually moved the entire company to Spark. And and we actually use we sort of use we're big users of something called Databricks, which I'm sure you're familiar with. We have all of our data engineers and data scientists, like, you know, utilizing Databricks and Spark with shared libraries through GitHub. And it's actually been very, very, effective at building sort of these batch ETLs, but still being able to do this, extensible object oriented programming, with many, many different moving parts,
[00:02:51] Unknown:
if that makes sense. And as you're trying to create these entities from multiple different sources, are you having challenges with creating a unified representation of that data so it's easier to merge things together?
[00:03:04] Unknown:
Abs absolutely. So the entity resolution problems are really, really tough. You know, a lot of our data sets have, like, the profile datasets, they have attributes of very many different kinds of flavors. And figuring out, you know, maybe you have n profiles and you wanna do n choose 2 matchings, it's a nightmare to try to handle this and handle it at scale. You're simultaneously trying to do sort of machine learning techniques to say profile a is the same thing as profile b across these 2 different data sources at the same time within each dataset. You have to dedupe them, which is also sort of an entity resolution problem. And it's the the bookkeeping can be a real nightmare in addition to the, the machine learning aspects and at scale aspect. So if you're dealing with 2 datasets, 200, 000, 000 profiles, a 100 columns in each, How do you say these 2 are the same? And how do you process that data pretty quickly? That's sort of the challenges that we come up against.
[00:04:02] Unknown:
And what are some of the ways that you identify the places to source data from and then actually build the connectors to be able to consume that data at appropriate levels of scale. And also, particularly thinking about things like rate limiting from various sources, being able to make sure that things are flowing along at the proper rate from multiple sources to be able to merge them together?
[00:04:27] Unknown:
That that's a it's a very interesting question. So there is a lot of sort of data analytics, data science analysis, and business analysis that goes into that. So, I mean, if you look at profile data, you have to ask yourself, you know, how often does that data change and how often do I wanna refresh it? So for things like demographic data, how often are people changing their ethnicity, their gender, their religion? Not very often. Right? So if I have data sources that are, you know, fueling that, maybe I don't need to update though those that data. Now last names change, addresses change.
You think about 20% of the the nation moves in a year, and you have to ask yourself, how often do I want that update change and how often like, how do I detect that online? And those those those things could be kinda tricky. For other datasets, like you're talking about your more streaming ones with messaging, like Twitter, a lot of these problems have actually been solved. You know, data's I would say scraping is like CSE 101, and you have to be very choosy of where you invest in that. We try to invest in the kind of scraping that requires, like, an extra barrier.
Right? Or it's very specific to what we're interested in terms of building, real world relationships. So I would say it's actually, the challenges are definitely sort of, scale. But more often when you really ask the question of what you need, the scales come down a lot.
[00:05:58] Unknown:
And a lot of this, like, classical streaming technologies that we'd use, like, really fit the bill for us. And do you find that you have any issues with the sources of data not being available when you're trying to consume them? And have you had to build in any extra engineering effort to manage fault tolerance when your sources are not able to produce data at the rates that you're trying to receive them?
[00:06:22] Unknown:
Yeah, that definitely comes up. We definitely have pauses in our data streams. We you know, we're kind of a different beast. Right? So we're trying to build machine learning probabilities of how, people might, you know, behave in certain situations. So trying to understand who's gonna get get off the couch to go vote. We wanna build that probability in a campaign in real time. The it's it's it's sort of very strange because, you know, these machine learning models and how they operate, like, a lot of those data sources that we're talking about that are streaming in those regards, we're not necessarily gonna incorporate into those models. And so a stoppage, or or if if something if 1 of our collectors breaks, it ends up not being disastrous for our clients.
So that ends up being okay, and we can invest those engineering efforts to fix that over longer periods. But it's it's something that we do think about a lot, you know, as we scale. How are we going to fix that?
[00:07:23] Unknown:
And what have been some of the most, challenging aspects of building and maintaining your data systems for being able to create these entities?
[00:07:33] Unknown:
Spark. Spark is the real, the documentation for Spark is all over the place. I think you definitely have to have, like, an intuition for distributed computing. How, you know, how does a join really work? What is skewing? Things like this. And we have people at different sorts of knowledge, different points of knowledge within in their growth with Spark, and distributing that information is very hard. So, I mean, I I think they're getting better at at sort of doing knowledge, sharing at, like, the Spark summits and things like this. And we interact with the Spark contributors as much as we can, but it's an ongoing fight.
[00:08:16] Unknown:
Before I go to the last question, are there any other aspects of your data platform or some of the challenges you're facing that you think we should cover? Yeah. So
[00:08:25] Unknown:
what's very unusual about us is because we sort of are deploying a lot of our services from sort of a consulting perspective. When we work with campaigns or work with commercial clients, they're often not sophisticated enough to to utilize the machine learning probabilities that we produce. So that gives us a lot of leeway in terms of not necessarily, delivering immediately. On the other hand, we do have sort of the spike infrastructure where we need to spin up a 100 nodes, 150 nodes, 250 nodes, and then shut them down. And a lot of the different vendors we work with, they're not used to this. They're sort of all at this constant they wanna charge us at so many nodes, over a month, and then, you know, that's you know, you wanna upgrade to 10 nodes? Fine. This will cost us much. It doesn't work for us. We all we need sort of the spike infrastructure, spin up and spin down. A lot of databases don't work for us. We've we generally use a lot of cloud storage with Spark so that we can have that spin up and spin down. We had to been very choosy about the kind of partnerships that we've been with. So there are companies like DataRobot that do machine learning, but have the sort of they're not their business isn't ready for that spin up, spin down. And we're in, like, talks with companies like Datadog for instrumentation, and and this is sort of new for them too. So we're trying to find databases that can handle that. Right now, we're looking at something like Databricks Delta, which actually does allow for that sort of, we're sort of also trying to explore other technologies.
[00:09:52] Unknown:
And as 1 last question, what do you see as being the biggest gap in the available tooling or technology for data management as it stands right now?
[00:10:01] Unknown:
That is a really good question. Yeah. I mean, there are these little things that we find here and there that just aren't made for start ups. And I kinda was mentioning this to you before we started talking is, there's just certain spark operations are not performing, especially certain kinds of machine learning computations. And it's the the research is getting there in terms of, like, how do we scale graph computations? How do we scale sort of machine learning, or, like, principal component analysis? And they want to integrate that with Spark because everyone loves using Spark, and if you're small or large for doing these computations, but it just currently doesn't work. And so I'm really happy to see the open source community kind of, like, trying to solve this, but it's still way behind. And, know, companies like Google and Facebook believe have these problems solved, but can't you know, won't share it for the rest of us. So we don't have 100 of data scientists or data engineers, you know, solving these problems.
[00:11:08] Unknown:
And for somebody who wants to find out what you guys are up to or follow along, what would be the best way for them to do that?
[00:11:14] Unknown:
Oh, yeah. Great. Yeah. Very much visit applecart.co. Reach out. We're very much hiring. We're very interested in strong data engineers, strong data scientists. We have some product roles opening up. And, you know, if you don't know, you don't have that those backgrounds, we're we would love smart people. Political scientists, mathematicians, physicists. We have sort of very many applications that we're looking for right now. Well, thank you for your time. Enjoy the rest of the conference. Thank you so much, Tobias.
[00:11:45] Unknown:
So I'm here with Stepan from HydroSphere dot io. So could you just start by introducing yourself?
[00:11:49] Unknown:
Hey. I'm Stepan. I'm from HydroSphere. Io. I'm founder and CTO. So I used to work as a software engineer for many years, and then I used to build and architect big scale stream processing systems based on Apache Spark, Apache Kafka ecosystem. So and also I used to work a lot with, machine learning engineers and data scientists to make them happy, to make them successful in delivering their proof of concepts in a notebook environments to the real end to end real time production applications.
[00:12:20] Unknown:
And, Hydrosphere, you're working on sort of closing the loop of AI and machine learning models in production to make sure that they're doing what they're supposed to be doing. So, if you can just talk a bit about how you do that and some of the types of metrics that you use to be able to understand how well these things are running? Yeah. Well, so
[00:12:40] Unknown:
everybody is focused right now on training. It's, very sexy. It's very happy. We are, taking 1 more step ahead, and, we are automating, serving, and productionizing part of the machine learning. So we do a model deployment serving and monitoring. Monitoring is the most crucial part because, like, deployment is a boring thing and, monitoring of the model performance and the, input features to the model and output, predictions is the most crucial part. So what we do how we do that? We monitor distributions of the input features. We apply, different statistics statistics like Kolmogorov Smirnov, correlations, clustering based algorithms to, basically we basically know about your production traffic, as much as possible, to to be able to detect a concept drift and model degradation.
And, also, we use these statistics to, do resampling to generate a clean and diverse dataset for your retraining pipeline. So we'll help you team to maintain and retrain your models in production as you go along. Yeah. It's obvious obviously your model is as good as your data. If you if your data is being changed in in production, your model is not as good as you, as it's supposed to be in a training time. So and this, like, a very, very challenging part of the machine learning to maintain the model quality over time and to scale that. This the the question of scale is, probably the next most important question in the industry. If you, as a data scientist, owning and building, I don't know, 10 models, 10 different models, and every that every model has 10 different versions, how would you, watch them in production?
It's not your day job to watch, models and identify the model drift, concept drift, and model degradation. So obviously, you need more tooling to augment your day to day operations, and this is here we are. We apply AI and machine learning to monitor your machine learning in, models in production. So this is basically the the same concept, that is being used in in the history a lot. Machines help machine to build better, machines better. Computers help, to build better computer programs, and now AI can help to build more reliable and fault tolerant machine learning models. So, this is like the high level the high level overview.
[00:15:21] Unknown:
And when you're monitoring these various metrics for the machine learning models, what are some of the things that you specifically look at for being able to identify that the output of a model isn't within the desired bounds? Is that something that you automatically detect once the model gets put into production or is that something that needs to be predefined once the model gets pushed into your system? So we can we can automatically profile your training data. So and build the basic, basic, like, data profile, what you have. It might be a statistical metrics. It might be a deep auto encoders.
[00:15:54] Unknown:
Like, it it might be so we actually we're, using actively using actively using GANs right now. Originally, GANs are being used for fooling fooling the, predictor to make that output the wrong prediction, and we are probably the first who is using that to generate not a noise, we generate, a drift. And we train our discriminator to identify a model drift. So it's, yeah, it's 1 of the techniques we use, and we we actually combine a different, methods and different metrics. So it might be like a simple statistics, it might be more advanced statistics, might be some ad hoc rules, and in addition to deep learning anomaly detection methods. So they they are if they are, coincide altogether, this is, like, a great sign to for the model degradation. So it's, there is now a silver bullet for this. Every model is unique. Every more, and this is kind of a new discipline to to look at for machine learning engineer. How would I make up our more my models more reliable?
What metrics should I use for my model to to monitor? So this is, as I mentioned there is no silver silver bullet. There are different method methods different metrics. There are there are some unified approaches like, just watching a distribution of input features. Okay. We can do that for almost any input except images and, and maybe words, NLP use cases. But for the classical machine learning, you can you can monitor the age, the wage, the the salary, or what whatever features you have. And, yep, that's, all the tooling is, like, in place for many years for that. So, just a matter of, like, applying that to the to and automating that. The crucial thing is why the traditional monitoring, like, software monitoring methods, do not work here. Like, there are a lot of, like, system on a CPU, GPU because the the metrics are much more complicated than just a sim simple counter or histograms.
For example, Kolmogorov Smirnov. It's a stateful metric. You need to set, you need 2 samples to compare, and and you need to do that in a window. So and you need to do that in a real time. So we do that, internally, it's be everything is based on Kafka streams. So we do that, like, a a stateful, stateful aggregation and stateful calculations of these metrics in in real time. So, yeah, That's that's how we work.
[00:18:32] Unknown:
You, preempted the next question I was gonna ask. So thanks for that. And in terms of the data that you're sampling for being able to retrain the models, are you keeping track of the data that's coming in and then the output that's going out to determine which inputs are producing valid outputs and which ones aren't so that you can then determine which data points to sample for being able to retrain the model to bring it back in line with the expected outputs?
[00:18:56] Unknown:
Yeah. For resampling, we do not watch for outputs. It's mostly for, inputs. If you do have outliers in, in production, that's fine. You need to include these outliers into your training pipeline because it to make it more reliable because you at the end of the day, you will get these outliers in in production, and you you need to be ready. Your model need to be ready to to do that. Yeah. And, this is, yeah, this is mostly an input, monitoring the inputs of the model.
[00:19:25] Unknown:
And what has been some of the most challenging aspects of building HydroSphere and determining how best to scale it and architect it to be able to serve your clients appropriately?
[00:19:34] Unknown:
Yeah. It's a wide question. And so, oh, it's open question. So besides just technical challenges, of course, it's like all all things Docker, microservices, and Kafka. Besides that, the challenging part that's is probably socializing the idea and the evangelism about that. So a big part of the education of the clients and education of the community. And another, kind of, not a show stopper, but, very challenging part is, that mo we are a little bit ahead of the progress rather the, comparing to the average, in the industry. So everybody is building the models yet trying to figure out how to build their training pipelines, figure out how to build their data pipelines.
And when we offer something advanced, we need to help clients to do the basics. That's why we are kind of trying trying to focus just on our product and, without without, like, spreading our focus across the whole machine learning, business and in this and, yeah, and machine learning development.
[00:20:49] Unknown:
So you're trying to make people aware of the pain that they're going to be suffering before they actually get to that point and because they haven't experienced it, they're having a hard time seeing where they might be able to use
[00:20:59] Unknown:
you. Yeah. So, before before even, building a sophisticated model, what we try to, emphasize is can you build a very naive model and just but can you build it overnight and deploy it into end to end application and get a feedback from from business and get a fee, get a feedback from the, from users, whatever? And then improve it, continuously. So that's so you you can you can be in production after 1 week of research and and development. This is this is kind of everybody can, like, wanna do that, but, what we see, companies just hire data scientists.
They spend a year prototyping in machine learning notebooks, and they don't have a single model in production. So this is a changing a changing sometimes it requires to change an organizational structure to make data scientists more aware about the final goal and the final final production. So it's, it's very, very tough topic.
[00:22:05] Unknown:
And do you think that that's why the whole idea of a machine learning engineer as a distinct role from data engineering and data scientists is starting to gain more
[00:22:14] Unknown:
prevalence? Yeah. So, we target machine learning engineers rather than data scientists. So, if a machine learning engineer responsible for delivering a final value to production to the end users, he's he's interested in in his hands to, like, to to drive that, to craft that. So and not not to throw out of the through the wall to IT people. Yeah. That's another, like, a cultural aspect of that. So, the IT traditional IT operations that do monitoring, they do do support, and especially in big companies, they have no idea about machine learning. If we, if we expose them a metric of Kalmagorov Smirnov test, it means nothing for them. So and they they even don't know how to adapt, and how to react on that alarms.
So that's why we need, like, we need a machine learning engineer to watch this matrix, to watch, to watch these alarms, and probably at least to be in the loop. Because, the, serving pipeline, there are, like, a thousand of reasons why machine learning may not work in production. Yeah. The data pipeline might be even broken. Your upstream application just, started producing your corrupted input features. So it's like, there is no, there is no a single tool to to monitor that and to fix that. That's why it's like the whole team should be aware about about any incidents in in production. If your apps it's actually another interesting use case that we are we are work working right now. So if you're differentiating between expected failure and unexpected failure expected concept drift and unexpected. So for example, if if your upstream pipe pipeline failed to to produce your right features, this is unexpected drift. You should not retrain your models on these bad features. Yeah? You you because they will propagate through the training and serving pipelines to the end users. So you have to classify between that, like, kind of expected drift and unexpected system failure that you just started started receiving, like, something different.
So, that's another interesting aspect of using AI or machine learning to help, operation operationalize
[00:24:50] Unknown:
machine learning in production. Yeah. I've been definitely seeing a lot of trends towards adopting some of the principles that the DevOps movement brought between developers and IT staff moving into the realm of data engineering and data science and some of the more, statistically oriented roles within a company. Yeah.
[00:25:09] Unknown:
I hope so. I hope so. This is, this is where we are and, like, following that that idea, everybody wanna be AI first company right now. And 1 of the major step towards AI first company is to to improve that organizational processes and educate people about production, about end value, about iterations, about all that DevOps stuff that is, that is, like, already pa or in the past for most of the companies. But, yeah, I still see that engineers just make that work on their laptops, and that's it.
[00:25:47] Unknown:
And are there any other aspects of, machine learning engineering or model serving or Hydrosphere in general that you think we should talk about? I don't know. Probably, I covered almost everything, that I have in my mind. Probably,
[00:26:00] Unknown:
yeah. The the 1 I I wanted to as I mentioned, I don't have a a word. I didn't point any buzzword for this for the thing that we are focusing on. And probably, 1 of the thought I have is, like, extension of, OTML, which is more, which is already, like, a little bit understood by the community. So you know how to people at least understand and and use it, heavily, automation around train training process, about hyperparameter tuning, about model selection, and all that stuff. And if we can extend that to the production, so you will have serving model, monitoring, resampling, retraining in the same loop that would be really beneficial for everybody.
[00:26:48] Unknown:
Maybe we should just start having everybody call it prod ML, and then you can leverage that.
[00:26:53] Unknown:
Prod AutoML SS service.
[00:26:57] Unknown:
So as as 1 last question. What do you see as being the biggest gap in the tooling or technology that's available for people working with data management today?
[00:27:06] Unknown:
Basically actually, there are a lot of gaps. So it's when you even start just experimenting, just, having a Jupyter notebook up and running, just, make it make your Spark, application work perfectly with s 3. It looks like a very, very basic thing, but, when you started from scratch, you can find a lot of a lot of, like, gaps. And if you're, like, data scientist, machine learning engineer, you just spend, like, tons of time on on that. So that's that's 1 gap, and there there are, like, perfect companies, amazing companies that are trying to, build more tools, for for train for train there is, like, weights weights and biases company that just popped up a a new start up. Very, very, very, very cool. They they do some tooling around the training as well. So even training, it's, it's very popular and everybody's doing training. It's there are some playbooks, notebooks, about that. You still fight with reproducibility of your experiments, about the versioning of your experiments. Even you you have, like, a, like, a dozen of model versions, you just forgot forgot about the you could not track track it per properly this versioning their performance correct characteristics.
Yep. So some you you you name it. A lot of companies, like, do that data science collaboration tools, but I don't think it's perfect at this moment. And I don't think there are tools in open source to, address all the issues in collaboration and training slogans and the, of the companies and the tools everybody say about, like, try trying to use data apps. I don't I I have no idea what data ops is. So it's, like, a 10 different interpretations and definitions of day of data ops. And, about and everybody is putting everybody is production ready. But when you take a a closer look, for some companies, production ready means the having an ability to to tree, to schedule a cron job.
Okay. This is a is it a production? It's just an offline batch job. It's not a, it's not a real production. So this, like, terminology and marketing hype is a little bit, challenging to break through. So that's why the, I believe the podcast that that are really hardcore and that go straight to the point are very helpful to the community. Great. Yeah. It's,
[00:29:55] Unknown:
seems like the sort of data management, data engineering and productionizing of some of these more advanced workflows is going through the same growing pains that the DevOps movement did as far as not having a cohesive definition of what it even means to be doing that. And so it's open to per interpretation and opportunism. So I I think we're in those same growing pain stages, but I think that we'll come out the other side the better for it. Yeah. Yeah. Definitely. Definitely. We'll see more so,
[00:30:20] Unknown:
from the public use cases you see. So we we were we've been talking about, some, like, theoretical stuff right now. So you you you see, for example, type Taybot from Microsoft. You you remember they launched that. So this is a very it's a public lesson for everybody how machine learning and AI in production might look like. When you do even in Microsoft, even in their, like, pretty cool research environment, I believe they tested the tie tie Tay bot on some Wikipedia like day datasets, and it works pretty fine, pretty nice.
But, of course, the the real world is much more tough than lab environment in Microsoft, so this is a result. So they it's a turn turned into racist, fascist, and the all that. So it's it's not fun, really. And, this and, actually, the question, the question that is very, like, open question for everybody, Was Microsoft able to monitor and adjust their models as they go along and prevent that unexpected shutdown of the of their bot. Did did they have right tooling around that or not? So it's, it was a fun it was just a research ex ex experiment, but it it might cost, reputation of the company. It might cost, like, a huge efforts of the PR team to mitigate the to minimize the impact of of the business. So, and I I believe, like, everybody can make their own assumption about the impact of the ML failures in his project or in his business. So with, and, like, think about think through that, how can we improve the,
[00:32:07] Unknown:
the tooling and situation about that? Well, thank you very much for taking the time, and for anybody who wants to see the work that you're up to, it would be the best way for them to find you.
[00:32:17] Unknown:
So Hydrosphere IO and and follow follow the links, follow the social media. So I'm I'm pretty open for any comments, questions, and, yeah, thoughts and contributions. We have open source version. Great. Thank you very much.
Introduction and Overview
Open Data Science Conference Highlights
Interview with Alan Anders: Scaling Spark for Entity Graphs
Challenges in Data Source Integration and Fault Tolerance
Gaps in Data Management Tooling
Interview with Stepan Pushkarev: Monitoring Machine Learning Models
Challenges in Building HydroSphere
Adopting DevOps Principles in Data Science
Gaps in Data Management Tooling and Technology
Lessons from AI Failures: The Tay Bot Incident