Interactive Exploratory Data Analysis On Petabyte Scale Data Sets With Arkouda

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Data stacks are becoming more and more complex.

This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating

the quality of the data and causing teams to lose trust.

Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring it's reliable from ingestion all the way to consumption.

Whether the data is in transit or at rest, Ciflae can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams on their preferred channels,

all thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2,000 to use as platform credits when signing up to use Siflae. Siflae also offers a 2 week free trial. Find out more at data engineering podcast.com/ciflae

today. That's s I f f l e t. Your host is Tobias Macy. And today, I'm interviewing David Bader about Arcuda, a horizontally

scalable parallel compute library for exploratory data analysis in Python. So, David, can you start by introducing yourself?

Sure.

It's nice to see you. My name is David Bader. I'm a distinguished professor

and director of the Institute For Data Science at the New Jersey Institute of Technology, where I've also founded the Department of Data Science

this past fall that has degree offerings in data science.

And do you remember how you first got started working in the area of data?

My work with data

actually goes back decades upon decades. So I've always been fascinated

by graph analytics

and

understanding

from large datasets

going back literally to the 19 eighties. So this has been a passion of mine.

So in terms of the Arcuda project, I'm wondering if you can share a bit about what it is and some of the story behind how it came to be and why you decided to build this tool and some of the problems that it's aimed at solving.

Sure. Great question. Arcuda,

which is the Greek word for bear, it's spelled

arkouda,

is an open source

framework for

big data,

and it's available from GitHub, so anyone can check it out.

We noticed the tension

in data science where we have productivity

languages like Python,

where on

a desktop or laptop,

many programmers are able to write codes when the data fits on their laptop. Very easy

to learn all of these tools like NumPy and Pandas and

NetworkX

and all of these great tools. But then when you have a large dataset,

and by large, when your datasets

overwhelm what you can fit onto your laptop or desktop, so those datasets

tend to be multi terabyte in size, the number of tools quickly

falls off, and you need to use supercomputers

where I have quite an extensive background.

And what we thought about was how do we make supercomputing

for large data as productive

as using Python?

How do we combine that productivity

with performance?

So Arcuda was born about 2 years ago.

And, again, as an open source project where the end user

can write their analytics in a Jupyter notebook

with constructs that look very similar to NumPy.

But in reality,

all of their secret sauce and development kicks in and they're actually using a back end supercomputer

with an open source compiler called Chapel that is doing a lot of the heavy lifting.

In terms of the

core audience and the focus of what you're trying to solve for them, I'm curious if you can talk to how you thought about that formulation

as you started building Arcuda.

What we realized was that Python is really the lingua de franca of data scientists.

Everybody is able to work in Python.

And we wanted to make big data analytics accessible

to those that know how to use Python

and to put their workflows in Jupyter Notebooks, for instance.

This is really an area that is affecting

all enterprises,

all organizations,

because we know data just keeps increasing in size.

And

when we put datasets together, for instance, we often get datasets that are larger than

the space on our workstations.

They tend to be multi terabytes and more tens of terabytes.

So we're really addressing a need

that the enterprise has for being able to rapidly

and productively do analytics

once those datasets are just so large.

And we wanna have interactivity.

So it isn't like supercomputing

of decades ago where you come to it with a problem and it crunches on it for a while and gives you an answer back the next day. We want to have near time responsiveness,

just like our Jupyter Notebook, where we hit return and we get the answer, even if our data set is, say, 25 terabytes in size.

In terms of the exploratory data analysis

aspect of the data life cycle,

it's generally the domain of the data scientist. And when they do start to hit those points where their data doesn't fit on their laptop and they need to actually start turning to things like Arcuda and parallel compute,

how does that generally manifest in terms of who they're asking for help with being able to solve that problem,

how they start to think about approaching that problem, and just how that propagates into the broader team and the broader organization as far as the impact on their ability to be productive and who they're leaning on to help them solve those problems?

Well, as you mentioned, once you get stuck at that point where your data grows larger than than your resources,

you hit a speed bump. You're looking to ask experts, how do I solve this? What tools are out there? How do I get access to resources? And what we're trying to do is democratize

data

science so that anyone who can program in Python will be able to use Arcuda

and escape that barrier without the need for finding a parallel computing expert, such as myself,

that will

be able to show them how to do some

programming for a supercomputer.

So we're trying to make it turnkey so that anyone can just turn the the knob when their datasets gets larger, be able to modify their code so that they can still do exploratory data analysis,

meaning looking at datasets

in the real time,

exploring what if questions on their data, and being able to use tools that seamlessly let them scale from their desktop to a supercomputer.

And at the point where they stop being able to manage that interactivity, it's another aspect of the speed bump of, you know, they do get to the point where they have to submit their batch job to a supercompute cluster

or in past decade, you know, put a MapReduce job into a Hadoop cluster to be able to figure out what comes out the other side. What are some of the ways that that impacts their ability to be productive and some of the problematic behaviors that that might encourage or lead to if they do have these time delays of being able to ask and answer questions?

Often, we want to be productive, meaning that we want to be able to ask questions of our data and get the answers

in the same time that we're thinking about it, so that we can explore these datasets.

And if that turnaround time or those transactions take

minutes, hours, days, then we lose that train of thought. And so it's very important for a data scientist to be able to operate in near real time. So as they post questions, they see the answer right away that can steer them towards what's the next question to ask. So

I should mention with Arcuda,

we are really focused on the analyst, the end user, and giving them a new capability

that is very similar to how they've used Python and NumPy,

pandas, and other such constructs.

But with just a slight modification,

our kudo will drop in to replace NumPy

and give you a incredible capability

for basic data science constructs. But also, we've been building out a rich

set of libraries on graph analytics.

I'm very proud that 1 of their expertise in my lab is graph analytics at a large scale. And in fact, last week, we had a brand new book come out that I edited on massive graph analytics.

So I think this is 1 specialization of data science that is also

really interesting, and Arcuda

powers our graph analytics as well.

That brings up an interesting point about the types of analysis and the types of data that you're working with and how you think about the capabilities to work into Arcuda because there are definitely a number of other projects out there for being able to scale, compute, and scale data access beyond the bounds of a single computer.

I'm thinking in terms of projects like Ray or Dask. There's also another 1, I think, called Bodo.

And I'm curious, what was

missing in those solutions that made something like Arcuda necessary and some of the ways that you think about the

specific problem sets that Arcuda is well suited to and when you might actually want to lean on some of those other frameworks for different problems?

Great question. So we were faced with trying to solve some real world brand challenges. For instance, in cybersecurity,

where often we're collecting up information about

network traffic, and we're trying to identify cyber threats and give attribution to those threats. And there, we have to operate in near real time

and working with analysts to

formulate their questions and problems who are trained in basic data science

using Python and so forth. And these datasets are humongous.

Often

in a large enterprise, we're looking at datasets

that could be tens of terabytes in size. So we wanted something very seamlessly.

Our analyst could just pick up a new tool, take the existing code, and with just slight modifications, be able to scale beyond,

say, tens of gigabytes

to tens of terabytes and beyond. So that was really the

goal for CUDA was to make this easy and productive so that analysts don't need to learn a new framework, a new language,

and all of the

challenges that come along with trying to figure out a new environment.

And, again, 1 that scales

for this

extraordinary size of datasets that I think we're gonna see more and more as we collect more data and put datasets together.

And so in terms of Arcuda itself, can you talk a bit about some of the implementation

and the ways that you thought about approaching this problem and maybe some of the unique

algorithms and capabilities that you've baked into it to be able to

power these interactive use cases on larger than single compute datasets?

Arcuda was started by a team at the US Department of Defense,

and we have been contributing

to Arcuda since its start. So I should first mention that there's a team behind Arcuda,

and I've been responsible for building out the graph analytics,

along with my very capable researchers

at the New Jersey Institute of Technology,

where we've been focusing on adding the capability

to look at data as a graph with relationships

and to build in capabilities for solving

standard graph analytic questions.

For instance, are there communities within the dataset?

Are there paths between particular vertices?

And other sorts of features in the graph to understand

and explore

the dataset as a graph.

Let me take a step back. And when I think of data as a graph, it's really

looking at data through a lens

where we have objects in our data, which we think of as vertices in the graph. And when these objects

interact with each other, we have an edge induced inside of the graph. So there are many problems

that we can represent as these types of graphs, whether it's in cybersecurity,

social media analysis,

personalized health,

and more, we can

move these problems into this graph abstraction

and then solve it with some very powerful graph analytics.

In that graph analytics space, I know that there are a number of interesting capabilities,

as well as an associated set of interesting challenges that factor into it. 1 of the things that I know is most common that people encounter is the question of supernodes and how to handle those,

and then the question

algorithms some of those problems in Arcuda and maybe some of the algorithmic aspects that you've had to develop to be able to support those use cases.

Arcuda isn't the only open source framework that I've developed for graphs. Over the years, I've developed many frameworks,

or I should say I've developed several frameworks for graphs. And there are some specialized frameworks for streaming graphs, and that's graphs where we get edges

from a fire hose, and we wanna

ask

questions to that graph as it changes over time. So we were 1 of the earliest groups to build ACE driven graph analytic framework that we called Stinger.

And in many of these graph analytics, we're faced with

challenges when we find what you called supernodes. Sometimes we call them high degree vertices.

And these often are

problematic if we're trying to partition a graph or we're looking at information flow, and these nodes may bias

the results that we see or really make it a challenge. So often when we're running a graph analytic,

we can threshold

For instance, we're looking at vertices

above a certain degree

and below a certain degree.

And in this way, we can find other

methods for handling

these, what you call supernode vertices in the graph. Let me just give you an example to make it more concrete. We have a social network,

and the degree of a person in the social network is the number of friends that they have.

And I may be trying to analyze

how many

friends are between me and you on the shortest path between friends.

But if we're both friended with, say,

a superstar,

then it doesn't really make sense to count our connection through that superstar.

We want to find connections just through our ordinary friends to connect to me and you. And so often, we'll threshold out when there's a friend and they have a 1000000000 friends out there. Well, it isn't as interesting. And our algorithms let us filter off those vertices to try to answer these questions.

The inverse of the 6 degrees of Kevin Bacon. That's right. In fact, my students were very interested to see in the IMDB, the movie

database on the Internet,

was Kevin Bacon really center

I should say central to

the actors of all movies? And so we actually did that analysis.

Turned out he was just a common actor,

unfortunately, but a great story nevertheless.

As far as the Arcuda project itself, you mentioned it started off as a project in the Department of Defense. I'm curious if you can talk to some of the ways that it has evolved and grown in terms of the goals and scope of the project

and any changes in the implementation details of how you've approached the problems that you're trying to solve with it. Arcuda

is open source. So on GitHub, if you go to the

repositories

under the group bears r us, so bears hyphen r

hyphen

us, and look under Arcuda,

you'll see talks

and all of the source code in a repository

and a lot of the discussion spaces and papers related to Arcuda. So it's a completely

open

and widely shared project.

The goals were really to make a new framework that

everyone around the world can use to do

productive large scale data analysis.

I'm quite excited by it because it's starting out as an open source project, and we'd love to contribute.

And we have collaborators

at this point around the world who are helping to make Arcuda a reality.

It's getting their maturity.

The project started in 2019,

and it's undergone a number of changes. For instance, there's a

sub repository

for all of our graph analytic contributions

and ways to create new modules to build on top of the core Arcuda framework.

Arcuda operates with a user

on the front end interfacing through their Jupyter notebook or with Python,

and they connect to a 0 m queue, a message queue, going to a supercomputer in the back end running the HPE Cray

Chapel compiler that's also open source.

And so we, as developers, are developing all of the plumbing to go from

Python in the front end, all the way through the supercomputer in the back. So that analysts who wish to use Arcuda,

they don't see all of that complexity. They just see a productive

environment where they can scale up their datasets

to tens of terabytes.

Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95%

reported being at or overcapacity,

With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation.

In fact, while only 3.5%

report having current investments in automation,

85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation,

orchestration, and observability.

Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture,

as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.

Go to data engineering podcast.com/ascend

and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5,000

when you become a customer.

Given that

it is relying on this super compute framework, I'm wondering how that maps to the actual hardware requirements that are necessary to be able to run Arcuda.

So Chappell should not be a barrier because

it's open source, and it runs on most available compute platforms from desktops

to clusters to dedicated supercomputers.

And

you can acquire your own resources,

where to run Chappell,

where to load your datasets, and then manipulate them through the Sarcuda front end.

Given the fact that there has been so much evolution in hardware capabilities

and the ways that hardware is consumed,

you know, with things like the cloud, but also with the evolution of different CPU architectures and the rise of ARM.

I'm wondering how that evolution

has impacted the

types of processing algorithms and the appetite for space time trade offs and how you approach those algorithmic aspects of being able to work on these data processing

problems at these scales?

Great question.

The Chappell compiler originally came out of a US DARPA, that's the Defense Advanced Research Projects Agency

project

about 20 years ago called High Productivity

Computing Systems.

And the compiler with 20 years of work in it now

was built to be able to take advantage of different processor generations and to do a lot of the transformations.

So the performance engineering is built into the compiler.

Originally, Cray, the supercomputing

company built this compiler, and then recently, HPE acquired

the supercomputing company Cray. And HPE has a great,

fantastic team,

project manager and developers,

who've been working with Chapel for quite a number of years

and put a lot of sophistication

into that compiler to be able to leverage the new processor architectures.

So much of that comes with Chappell, and that's 1 of the reasons why we decided to use Chappell as the back end compiler

for the Arcuda framework.

As far as the

adoption of Arcuda

and how it factors into the development and analysis capabilities

of a team or an organization. I'm wondering if you can talk to some of the process of getting it set up and the end user workflow.

Arcuda, again, is an early project.

It's open source, so every organization can look at the source

code and be able to import it quite easily.

We've been working in my lab on a tutorial for

getting users started with Arcuda.

I teach on it with students who are able to use it,

and we

have tutorials available

for Arcuda. So anyone who's interested, I suggest they head over to the GitHub

website for Arcuda

and explore the resources that we have there. It's pretty simple to install,

and it works well. So I encourage everyone here to try it out.

As far as

the collaboration process around being able to build these analyses with Arcuda,

I'm wondering

if there are any programming patterns or

approaches to how to structure the analyses that helps to make it so that multiple people can be able to

take advantage of maybe intermediate result sets that are generated by each other and just some of the aspects of being able to use this in a team environment?

That's a great question. And, in fact,

the large datasets are stored in the back end where Chappell

has access to them.

And there are constructs

to be able to keep intermediate results in the back end for others to

collaborate on and make use of rather than to have to bring results to a front end to share. Because for large data, we want to keep it in place. We don't want to move it that often, because that takes quite a lot of time and a lot of resources

and a lot of energy. So So we're trying to be very energy efficient as well.

So Arcuda does

include those capabilities that a team that may be looking together at a dataset or derive products can do so quite easily by sharing resource locators

for those datasets.

In terms of the work that you've been doing at Arcuda, as you mentioned, it's a team effort. It's open source, so anybody can contribute to it. But given your expertise

in super compute capabilities

and graph analytics, I'm curious what are some of the

specific contributions that you've been focused on and some of the specific challenges that you've been addressing in maybe some of the sharp edges that people experience when building on top of Arcuda or some of the new capabilities that you're hoping to unlock in this framework?

We've been building out graph analytics that, as I mentioned, are quite important for solving real world grand challenges.

And often graph analytics

look easy on paper, but when you go to implement them, you run into performance issues. For instance, the high degree vertices that were mentioned before could really slow down

a graph analytic. So what we're doing is

analyzing the performance of

algorithms on a wide variety of inputs to try to make sure that

we have the capability to solve many instances

quite fast using the graph analytics that we built into our CUDA. And for example, we're implementing new algorithms.

For instance, 1 algorithm is called triangle centrality

that a colleague of mine

invented.

And it's a centrality that's based on looking at the importance of vertices

based on how many triangles

and the distribution of triangles around them. This analytic, I think, is quite interesting

and is a peer

analytic to other centrality measures like betweenness centrality,

closeness centrality,

degree centrality,

and others. So we're always looking for new, highly capable analytics that will provide new functionality.

And then how do we implement those with high efficiency

and also productively for the end user in the Arcuda framework.

In terms of the applications of this framework,

as you mentioned, it's focused primarily

at data scientists and data analysts who are trying to address some of these real world problems. And I'm wondering what are some of the ways that you've maybe seen

Arcuda used to address some of the problems of data scale and data maintenance as well, where maybe a data engineer would use it to understand

what is the distribution

of data assets or asset categorization that I have in my lake where I'm, you know, working at terabyte or petabyte or exabyte scale.

Arcura really provides a capability

that doesn't exist today. So if you're using a data lake

and you're doing analytics, you would have to pull information out of that lake and find out processing system

where you can ask your analytics or queries.

So a data engineer would have to maintain that lake, and that may be federated across 1 or or more systems.

What we're really doing is providing a new capability for the productive use of those large data sets.

So by productive, I want

a data scientist to be able to ask queries

on terabytes

or even petabytes of data without the need for a data engineer in the middle to be able to gate

and

really

provide

the services that a data scientist would need in order to ask those questions. So I wanna make the data accessible.

I wanna make it productive to be able to access those large data sets, even combine large data sets together. So, again, we're trying to cut out barriers

to solve these really large data science problems

and to be able to do it with removing all of the

roadblocks and remove all of the friction that we would normally face for solving these very large problems.

It's time to make sense of today's data tooling ecosystem.

Go to data engineering podcast.com/rudder

to get a guide that will help you build a practical data stack for every phase of your company's journey to data maturity.

The guide includes architectures and tactical advice to help you progress through 4 stages,

starter, growth, machine learning, and real time. Go to data engineering podcast.com/rudder

today to drop the modern data stack and use a practical data engineering framework.

In terms of the

data management aspect of this, I'm curious, what are some of the specifics in terms of the types of

data that you're able to work with? So thinking in terms of structured versus semi structured versus unstructured or binary and things like that,

and some of

the organizational

aspects of how best to

work to the strengths of Arcuda in as to how you organize and structure the data so that it's able to be able to parallelize and take advantage of being able to shard and work across the data

in parallel in isolation from each other?

Great question. So natively, Arcuda

manages 1 dimensional arrays, so collections of 1 d arrays.

In our graph analytics, we built new data structures

to have a native graph data structure to do these graph analytics on.

And for many applications,

we have data sets where we can decompose them into sets of these 1 d arrays.

That said, the user doesn't have to worry about sharding

or other partitioning techniques.

The data will sit on a back end, and Chappell

will have the responsibility

of managing

the distribution of that dataset across the available compute resources.

So that sophistication is built into Chappell

and means that we don't

have to be as concerned about it because we have quite a sophisticated

compiler and runtime system that's managing many of those aspects.

And so in terms of the opportunities

that you see

as far as algorithmic

advances for being able to work across these large datasets and being able to

accelerate the time to insight, I'm wondering what are some of the open questions or unrealized

capabilities

of how to

expand the functionality of things like Arcuda and just general

large scale data analysis?

Up to now, we really haven't had platforms where we could experiment with multi terabyte datasets

productively in real time. And so this really opens your imagination

for new types of analytics. For instance, let me focus on the graph space.

There are some very capable tools out there that operate for analytics on graphs, for instance, on our laptops and small clusters. But once the graph becomes larger than a certain size, those frameworks typically don't allow

those

analytics to be run. They will be too time consuming

or just too large for the analytic,

and we can't explore that space. So I believe with Arcuda, we'll actually have the ability for the first time to

ask questions that we thought

were never possible previously

on some of these large datasets.

That may lead to new insights or even new algorithms and analytics

or interrogating

and really getting more insights from these large datasets.

In terms of your experience

of working with Arcuda and supporting teams who are building analyses on top of Arcuda, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

Right now, Arcuda has a dozen to 20 plus users. It's really in its infancy providing this very highly capable framework.

And so we're also looking forward to seeing people download it, clone it from the Git repository, and use it, and hearing more about those success stories.

It's been quite useful for the enterprise

and the users that we've seen pick it up and use it. Many of them are working in places with large data sets, for instance, from cybersecurity

detecting cyber threats. And

some are working in social network analysis with very large social networks.

And we hope to continue to see success stories from Arcuda

in the coming months

years as more and more users adopt this highly productive and capable framework.

And in your own experience of working on Arcuda and contributing to it and using it for some of your own research? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

My background

comes from high performance computing and supercomputing.

So at this point, nothing really surprises me. But my

pleasant surprise

was that the Chappell compiler that we're using in the background

is quite sophisticated.

So many things that we normally would have to do by hand on other systems,

we found that the Chappell compiler

was able to find the performance that we needed. There's a great support team also at HPE with supporting

this Chapel framework, and we found

really learning new ways of implementing our code

that can better leverage the Chapel compiler. So it's been a great experience,

and

we are also

pleased that we're getting the performance that we anticipated,

that there were no major bottlenecks or roadblocks to getting the performance that we expected on these large datasets.

For people who are interested in being able to work at these scales in an interactive fashion, what are the cases where Arcuda is the wrong choice and they're better suited going with some of these other parallel compute frameworks?

That's a great question. So if your dataset isn't massive, you probably don't wanna go through the effort of setting up Arcuda. If you can solve it easily today

on your laptop or on the systems that you have, then

you should stick with what you have. But if you find that you have datasets that you

can't process or that the

speed bump to get your analysts

being able to productively

ask questions of those datasets, then maybe in those cases, you should consider Arcuda.

So, again, if things work well for you today, then maybe you're not

the ideal candidate for moving to Arcuda. It's only when you start facing these issues of having

datasets that are too large or not having the performance that you're seeking, near real time or interactive

performance

on these massive datasets,

then you should consider looking at Arcuda.

As you continue to iterate on and contribute to the Arcuda framework, what are some of the things that you have planned for the near to medium term or any problem spaces that you're excited to dig into?

We're very excited in a few areas. For instance, stringology.

How do we design more data sets that can process

strings of text quite well? So this is important when we're analyzing documents or looking at unstructured text.

We're trying to build those capabilities

into Arcura,

and we have

funding from the US National Science Foundation

to explore these areas

in Arcuda.

Another area that I'm quite excited by is looking at

what we would think of as table joins

that normally are going to be expensive operations

within databases.

Here, we're looking at Arcuda and what does it mean to do a join

of

the datasets that we have, and is there a way to optimize those

to do it in the fastest way possible?

So there's a lot of hard work that we're gonna do at New Jersey Institute of Technology to try to build out new capabilities

for processing different types of large datasets

with the Arcuda framework.

Are there any other aspects of Arcuda or large scale graph analytics

or being able to do interactive analysis on large scale datasets that we didn't discuss yet that you'd like to cover before we close out the show? I think this is an exciting area. We already know datasets are getting larger,

but we also have to understand that often we get data

not as just a big

block of data to process, but as a stream of data.

That stream may be updated every millisecond,

every second, every hour,

every day, and so on. And we wanna have tools that can process those streaming

data streams.

And what I want to be able to do is build out new tools and new capabilities

to look at massive streaming data analytics. I want to get away from

using these resources

just for doing forensic analysis

after something egregious happened, and we're trying to explore,

for instance, in a

cyber hack, how did they get in? What did they destroy? What did they exfiltrate?

How do we protect against it?

That is after the damage has been done. Where I wanna move to is predictive analytics.

Can we take these data streams and detect

that there's a change, a pattern,

or

something emerging

that we can protect against before

some egregious event? So I'm really excited by building out more tools for predictive analytics on these massive datasets.

For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question,

I'd like to get your perspective on what you see as the biggest gap in the tooling our technology that's available for data management today.

I think the biggest gap that we have is really the seamless integration of multiple tools.

So there's a great

number of tools and suites out there, some commercial, some open source,

and

that they serve many different purposes.

But when we look at the data science environment,

we see

the

ability to, for instance, use Jupyter Notebooks or workflows that we can record, which I think is a great advance. But what I'd like to see more is the

compatibility

among multiple tool sets

so that we can move between

the different tools and environments out there. For instance, I have datasets where sometimes I wanna look at them

as unstructured

datasets. Other times, I wanna view them as a graph. Other times, I wanna view them in a different light as well. And I wanna be able to move through tools that specialize

in that view of the data and to be able to do it seamlessly without having to

modify those datasets

or go through different workflows on different systems.

Yeah. It's definitely a very real problem these days as we get into specialization of these different tools, and I'm definitely excited for some of these investments that are happening in the metadata layer where maybe we can use that as the interchange point without having to do all kinds of custom integration between these different tool sets.

Exactly. I think you hit the nail on the head.

Alright. Well, thank you very much for taking the time today to join me and share the work that you are doing on the Arcuda framework. It's definitely a very interesting

platform, interesting capabilities that it's unlocking. So definitely excited to see that continue to

evolve and grow in terms of capabilities and adoption. So I appreciate all the time that you and the other members of the team are putting into that, and I hope enjoy the rest of your day. Thanks, Tobias. And I really hope anyone out there looks up Arcuda

and works on their large scale

data science problems with productive and capable tool sets. Thanks again for chatting.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links