Summary
Exploratory data analysis works best when the feedback loop is fast and iterative. This is easy to achieve when you are working on small datasets, but as they scale up beyond what can fit on a single machine those short iterations quickly become long and tedious. The Arkouda project is a Python interface built on top of the Chapel compiler to bring back those interactive speeds for exploratory analysis on horizontally scalable compute that parallelizes operations on large volumes of data. In this episode David Bader explains how the framework operates, the algorithms that are built into it to support complex analyses, and how you can start using it today.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing David Bader about Arkouda, a horizontally scalable parallel compute library for exploratory data analysis in Python
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Arkouda is and the story behind it?
- What are the main goals of the project?
- How does it address those goals?
- Who is the primary audience for Arkouda?
- What are some of the main points of friction that engineers and scientists encounter while conducting exploratory data analysis (EDA)?
- What kinds of behaviors are they engaging in during these exploration cycles?
- When data scientists run up against the limitations of their tools and environments how does that impact the work of data engineers/data platform owners?
- There have been a number of libraries/frameworks/utilities/etc. built to improve the experience and outcomes for EDA. What was missing that made Arkouda necessary/useful?
- Can you describe how Arkouda is implemented?
- What are some of the novel algorithms that you have had to design to support Arkouda’s objectives?
- How have the design/goals/scope of the project changed since you started working on it?
- How has the evolution of hardware capabilities impacted the set of processing algorithms that are viable for addressing considerations of scale?
- What are the relative factors of scale along space/time axes that you are optimizing for?
- What are some opportunities that are still unrealized for algorithmic optimizations to expand horizons for large-scale data manipulation?
- For teams/individuals who are working with Arkouda can you describe the implementation process and what the end-user workflow looks like?
- What are the most interesting, innovative, or unexpected ways that you have seen Arkouda used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Arkouda?
- When is Arkouda the wrong choice?
- What do you have planned for the future of Arkouda?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Arkouda
- NJIT == New Jersey Institute of Technology
- NumPy
- Pandas
- NetworkX
- Chapel
- Massive Graph Analytics Book
- Ray
- Dask
- Bodo
- Stinger Graph Analytics
- Bears-R-Us
- 0MQ
- Triangle Centrality
- Degree Centrality
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring it's reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Ciflae can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams on their preferred channels, all thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2,000 to use as platform credits when signing up to use Siflae. Siflae also offers a 2 week free trial. Find out more at data engineering podcast.com/ciflae today. That's s I f f l e t. Your host is Tobias Macy. And today, I'm interviewing David Bader about Arcuda, a horizontally scalable parallel compute library for exploratory data analysis in Python. So, David, can you start by introducing yourself?
[00:02:03] Unknown:
Sure. It's nice to see you. My name is David Bader. I'm a distinguished professor and director of the Institute For Data Science at the New Jersey Institute of Technology, where I've also founded the Department of Data Science this past fall that has degree offerings in data science.
[00:02:22] Unknown:
And do you remember how you first got started working in the area of data?
[00:02:25] Unknown:
My work with data actually goes back decades upon decades. So I've always been fascinated by graph analytics and understanding from large datasets going back literally to the 19 eighties. So this has been a passion of mine.
[00:02:43] Unknown:
So in terms of the Arcuda project, I'm wondering if you can share a bit about what it is and some of the story behind how it came to be and why you decided to build this tool and some of the problems that it's aimed at solving.
[00:02:55] Unknown:
Sure. Great question. Arcuda, which is the Greek word for bear, it's spelled arkouda, is an open source framework for big data, and it's available from GitHub, so anyone can check it out. We noticed the tension in data science where we have productivity languages like Python, where on a desktop or laptop, many programmers are able to write codes when the data fits on their laptop. Very easy to learn all of these tools like NumPy and Pandas and NetworkX and all of these great tools. But then when you have a large dataset, and by large, when your datasets overwhelm what you can fit onto your laptop or desktop, so those datasets tend to be multi terabyte in size, the number of tools quickly falls off, and you need to use supercomputers where I have quite an extensive background.
And what we thought about was how do we make supercomputing for large data as productive as using Python? How do we combine that productivity with performance? So Arcuda was born about 2 years ago. And, again, as an open source project where the end user can write their analytics in a Jupyter notebook with constructs that look very similar to NumPy. But in reality, all of their secret sauce and development kicks in and they're actually using a back end supercomputer with an open source compiler called Chapel that is doing a lot of the heavy lifting.
[00:04:37] Unknown:
In terms of the core audience and the focus of what you're trying to solve for them, I'm curious if you can talk to how you thought about that formulation as you started building Arcuda.
[00:04:49] Unknown:
What we realized was that Python is really the lingua de franca of data scientists. Everybody is able to work in Python. And we wanted to make big data analytics accessible to those that know how to use Python and to put their workflows in Jupyter Notebooks, for instance. This is really an area that is affecting all enterprises, all organizations, because we know data just keeps increasing in size. And when we put datasets together, for instance, we often get datasets that are larger than the space on our workstations. They tend to be multi terabytes and more tens of terabytes.
So we're really addressing a need that the enterprise has for being able to rapidly and productively do analytics once those datasets are just so large. And we wanna have interactivity. So it isn't like supercomputing of decades ago where you come to it with a problem and it crunches on it for a while and gives you an answer back the next day. We want to have near time responsiveness, just like our Jupyter Notebook, where we hit return and we get the answer, even if our data set is, say, 25 terabytes in size.
[00:06:08] Unknown:
In terms of the exploratory data analysis aspect of the data life cycle, it's generally the domain of the data scientist. And when they do start to hit those points where their data doesn't fit on their laptop and they need to actually start turning to things like Arcuda and parallel compute, how does that generally manifest in terms of who they're asking for help with being able to solve that problem, how they start to think about approaching that problem, and just how that propagates into the broader team and the broader organization as far as the impact on their ability to be productive and who they're leaning on to help them solve those problems?
[00:06:45] Unknown:
Well, as you mentioned, once you get stuck at that point where your data grows larger than than your resources, you hit a speed bump. You're looking to ask experts, how do I solve this? What tools are out there? How do I get access to resources? And what we're trying to do is democratize data science so that anyone who can program in Python will be able to use Arcuda and escape that barrier without the need for finding a parallel computing expert, such as myself, that will be able to show them how to do some programming for a supercomputer.
So we're trying to make it turnkey so that anyone can just turn the the knob when their datasets gets larger, be able to modify their code so that they can still do exploratory data analysis, meaning looking at datasets in the real time, exploring what if questions on their data, and being able to use tools that seamlessly let them scale from their desktop to a supercomputer.
[00:07:46] Unknown:
And at the point where they stop being able to manage that interactivity, it's another aspect of the speed bump of, you know, they do get to the point where they have to submit their batch job to a supercompute cluster or in past decade, you know, put a MapReduce job into a Hadoop cluster to be able to figure out what comes out the other side. What are some of the ways that that impacts their ability to be productive and some of the problematic behaviors that that might encourage or lead to if they do have these time delays of being able to ask and answer questions?
[00:08:16] Unknown:
Often, we want to be productive, meaning that we want to be able to ask questions of our data and get the answers in the same time that we're thinking about it, so that we can explore these datasets. And if that turnaround time or those transactions take minutes, hours, days, then we lose that train of thought. And so it's very important for a data scientist to be able to operate in near real time. So as they post questions, they see the answer right away that can steer them towards what's the next question to ask. So I should mention with Arcuda, we are really focused on the analyst, the end user, and giving them a new capability that is very similar to how they've used Python and NumPy, pandas, and other such constructs.
But with just a slight modification, our kudo will drop in to replace NumPy and give you a incredible capability for basic data science constructs. But also, we've been building out a rich set of libraries on graph analytics. I'm very proud that 1 of their expertise in my lab is graph analytics at a large scale. And in fact, last week, we had a brand new book come out that I edited on massive graph analytics. So I think this is 1 specialization of data science that is also really interesting, and Arcuda powers our graph analytics as well.
[00:09:42] Unknown:
That brings up an interesting point about the types of analysis and the types of data that you're working with and how you think about the capabilities to work into Arcuda because there are definitely a number of other projects out there for being able to scale, compute, and scale data access beyond the bounds of a single computer. I'm thinking in terms of projects like Ray or Dask. There's also another 1, I think, called Bodo. And I'm curious, what was missing in those solutions that made something like Arcuda necessary and some of the ways that you think about the specific problem sets that Arcuda is well suited to and when you might actually want to lean on some of those other frameworks for different problems?
[00:10:24] Unknown:
Great question. So we were faced with trying to solve some real world brand challenges. For instance, in cybersecurity, where often we're collecting up information about network traffic, and we're trying to identify cyber threats and give attribution to those threats. And there, we have to operate in near real time and working with analysts to formulate their questions and problems who are trained in basic data science using Python and so forth. And these datasets are humongous. Often in a large enterprise, we're looking at datasets that could be tens of terabytes in size. So we wanted something very seamlessly.
Our analyst could just pick up a new tool, take the existing code, and with just slight modifications, be able to scale beyond, say, tens of gigabytes to tens of terabytes and beyond. So that was really the goal for CUDA was to make this easy and productive so that analysts don't need to learn a new framework, a new language, and all of the challenges that come along with trying to figure out a new environment. And, again, 1 that scales for this extraordinary size of datasets that I think we're gonna see more and more as we collect more data and put datasets together.
[00:11:47] Unknown:
And so in terms of Arcuda itself, can you talk a bit about some of the implementation and the ways that you thought about approaching this problem and maybe some of the unique algorithms and capabilities that you've baked into it to be able to power these interactive use cases on larger than single compute datasets?
[00:12:08] Unknown:
Arcuda was started by a team at the US Department of Defense, and we have been contributing to Arcuda since its start. So I should first mention that there's a team behind Arcuda, and I've been responsible for building out the graph analytics, along with my very capable researchers at the New Jersey Institute of Technology, where we've been focusing on adding the capability to look at data as a graph with relationships and to build in capabilities for solving standard graph analytic questions. For instance, are there communities within the dataset? Are there paths between particular vertices?
And other sorts of features in the graph to understand and explore the dataset as a graph. Let me take a step back. And when I think of data as a graph, it's really looking at data through a lens where we have objects in our data, which we think of as vertices in the graph. And when these objects interact with each other, we have an edge induced inside of the graph. So there are many problems that we can represent as these types of graphs, whether it's in cybersecurity, social media analysis, personalized health, and more, we can move these problems into this graph abstraction and then solve it with some very powerful graph analytics.
[00:13:34] Unknown:
In that graph analytics space, I know that there are a number of interesting capabilities, as well as an associated set of interesting challenges that factor into it. 1 of the things that I know is most common that people encounter is the question of supernodes and how to handle those, and then the question algorithms some of those problems in Arcuda and maybe some of the algorithmic aspects that you've had to develop to be able to support those use cases.
[00:14:08] Unknown:
Arcuda isn't the only open source framework that I've developed for graphs. Over the years, I've developed many frameworks, or I should say I've developed several frameworks for graphs. And there are some specialized frameworks for streaming graphs, and that's graphs where we get edges from a fire hose, and we wanna ask questions to that graph as it changes over time. So we were 1 of the earliest groups to build ACE driven graph analytic framework that we called Stinger. And in many of these graph analytics, we're faced with challenges when we find what you called supernodes. Sometimes we call them high degree vertices.
And these often are problematic if we're trying to partition a graph or we're looking at information flow, and these nodes may bias the results that we see or really make it a challenge. So often when we're running a graph analytic, we can threshold For instance, we're looking at vertices above a certain degree and below a certain degree. And in this way, we can find other methods for handling these, what you call supernode vertices in the graph. Let me just give you an example to make it more concrete. We have a social network, and the degree of a person in the social network is the number of friends that they have.
And I may be trying to analyze how many friends are between me and you on the shortest path between friends. But if we're both friended with, say, a superstar, then it doesn't really make sense to count our connection through that superstar. We want to find connections just through our ordinary friends to connect to me and you. And so often, we'll threshold out when there's a friend and they have a 1000000000 friends out there. Well, it isn't as interesting. And our algorithms let us filter off those vertices to try to answer these questions. The inverse of the 6 degrees of Kevin Bacon. That's right. In fact, my students were very interested to see in the IMDB, the movie database on the Internet, was Kevin Bacon really center I should say central to the actors of all movies? And so we actually did that analysis.
Turned out he was just a common actor, unfortunately, but a great story nevertheless.
[00:16:39] Unknown:
As far as the Arcuda project itself, you mentioned it started off as a project in the Department of Defense. I'm curious if you can talk to some of the ways that it has evolved and grown in terms of the goals and scope of the project and any changes in the implementation details of how you've approached the problems that you're trying to solve with it. Arcuda
[00:16:59] Unknown:
is open source. So on GitHub, if you go to the repositories under the group bears r us, so bears hyphen r hyphen us, and look under Arcuda, you'll see talks and all of the source code in a repository and a lot of the discussion spaces and papers related to Arcuda. So it's a completely open and widely shared project. The goals were really to make a new framework that everyone around the world can use to do productive large scale data analysis. I'm quite excited by it because it's starting out as an open source project, and we'd love to contribute. And we have collaborators at this point around the world who are helping to make Arcuda a reality.
It's getting their maturity. The project started in 2019, and it's undergone a number of changes. For instance, there's a sub repository for all of our graph analytic contributions and ways to create new modules to build on top of the core Arcuda framework. Arcuda operates with a user on the front end interfacing through their Jupyter notebook or with Python, and they connect to a 0 m queue, a message queue, going to a supercomputer in the back end running the HPE Cray Chapel compiler that's also open source. And so we, as developers, are developing all of the plumbing to go from Python in the front end, all the way through the supercomputer in the back. So that analysts who wish to use Arcuda, they don't see all of that complexity. They just see a productive environment where they can scale up their datasets to tens of terabytes.
[00:18:49] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity, With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to data engineering podcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Given that it is relying on this super compute framework, I'm wondering how that maps to the actual hardware requirements that are necessary to be able to run Arcuda.
[00:20:04] Unknown:
So Chappell should not be a barrier because it's open source, and it runs on most available compute platforms from desktops to clusters to dedicated supercomputers. And you can acquire your own resources, where to run Chappell, where to load your datasets, and then manipulate them through the Sarcuda front end.
[00:20:28] Unknown:
Given the fact that there has been so much evolution in hardware capabilities and the ways that hardware is consumed, you know, with things like the cloud, but also with the evolution of different CPU architectures and the rise of ARM. I'm wondering how that evolution has impacted the types of processing algorithms and the appetite for space time trade offs and how you approach those algorithmic aspects of being able to work on these data processing problems at these scales?
[00:21:01] Unknown:
Great question. The Chappell compiler originally came out of a US DARPA, that's the Defense Advanced Research Projects Agency project about 20 years ago called High Productivity Computing Systems. And the compiler with 20 years of work in it now was built to be able to take advantage of different processor generations and to do a lot of the transformations. So the performance engineering is built into the compiler. Originally, Cray, the supercomputing company built this compiler, and then recently, HPE acquired the supercomputing company Cray. And HPE has a great, fantastic team, project manager and developers, who've been working with Chapel for quite a number of years and put a lot of sophistication into that compiler to be able to leverage the new processor architectures.
So much of that comes with Chappell, and that's 1 of the reasons why we decided to use Chappell as the back end compiler for the Arcuda framework.
[00:22:07] Unknown:
As far as the adoption of Arcuda and how it factors into the development and analysis capabilities of a team or an organization. I'm wondering if you can talk to some of the process of getting it set up and the end user workflow.
[00:22:25] Unknown:
Arcuda, again, is an early project. It's open source, so every organization can look at the source code and be able to import it quite easily. We've been working in my lab on a tutorial for getting users started with Arcuda. I teach on it with students who are able to use it, and we have tutorials available for Arcuda. So anyone who's interested, I suggest they head over to the GitHub website for Arcuda and explore the resources that we have there. It's pretty simple to install, and it works well. So I encourage everyone here to try it out.
[00:23:04] Unknown:
As far as the collaboration process around being able to build these analyses with Arcuda, I'm wondering if there are any programming patterns or approaches to how to structure the analyses that helps to make it so that multiple people can be able to take advantage of maybe intermediate result sets that are generated by each other and just some of the aspects of being able to use this in a team environment?
[00:23:31] Unknown:
That's a great question. And, in fact, the large datasets are stored in the back end where Chappell has access to them. And there are constructs to be able to keep intermediate results in the back end for others to collaborate on and make use of rather than to have to bring results to a front end to share. Because for large data, we want to keep it in place. We don't want to move it that often, because that takes quite a lot of time and a lot of resources and a lot of energy. So So we're trying to be very energy efficient as well. So Arcuda does include those capabilities that a team that may be looking together at a dataset or derive products can do so quite easily by sharing resource locators for those datasets.
[00:24:22] Unknown:
In terms of the work that you've been doing at Arcuda, as you mentioned, it's a team effort. It's open source, so anybody can contribute to it. But given your expertise in super compute capabilities and graph analytics, I'm curious what are some of the specific contributions that you've been focused on and some of the specific challenges that you've been addressing in maybe some of the sharp edges that people experience when building on top of Arcuda or some of the new capabilities that you're hoping to unlock in this framework?
[00:24:51] Unknown:
We've been building out graph analytics that, as I mentioned, are quite important for solving real world grand challenges. And often graph analytics look easy on paper, but when you go to implement them, you run into performance issues. For instance, the high degree vertices that were mentioned before could really slow down a graph analytic. So what we're doing is analyzing the performance of algorithms on a wide variety of inputs to try to make sure that we have the capability to solve many instances quite fast using the graph analytics that we built into our CUDA. And for example, we're implementing new algorithms.
For instance, 1 algorithm is called triangle centrality that a colleague of mine invented. And it's a centrality that's based on looking at the importance of vertices based on how many triangles and the distribution of triangles around them. This analytic, I think, is quite interesting and is a peer analytic to other centrality measures like betweenness centrality, closeness centrality, degree centrality, and others. So we're always looking for new, highly capable analytics that will provide new functionality. And then how do we implement those with high efficiency and also productively for the end user in the Arcuda framework.
[00:26:15] Unknown:
In terms of the applications of this framework, as you mentioned, it's focused primarily at data scientists and data analysts who are trying to address some of these real world problems. And I'm wondering what are some of the ways that you've maybe seen Arcuda used to address some of the problems of data scale and data maintenance as well, where maybe a data engineer would use it to understand what is the distribution of data assets or asset categorization that I have in my lake where I'm, you know, working at terabyte or petabyte or exabyte scale.
[00:26:49] Unknown:
Arcura really provides a capability that doesn't exist today. So if you're using a data lake and you're doing analytics, you would have to pull information out of that lake and find out processing system where you can ask your analytics or queries. So a data engineer would have to maintain that lake, and that may be federated across 1 or or more systems. What we're really doing is providing a new capability for the productive use of those large data sets. So by productive, I want a data scientist to be able to ask queries on terabytes or even petabytes of data without the need for a data engineer in the middle to be able to gate and really provide the services that a data scientist would need in order to ask those questions. So I wanna make the data accessible.
I wanna make it productive to be able to access those large data sets, even combine large data sets together. So, again, we're trying to cut out barriers to solve these really large data science problems and to be able to do it with removing all of the roadblocks and remove all of the friction that we would normally face for solving these very large problems.
[00:28:06] Unknown:
It's time to make sense of today's data tooling ecosystem. Go to data engineering podcast.com/rudder to get a guide that will help you build a practical data stack for every phase of your company's journey to data maturity. The guide includes architectures and tactical advice to help you progress through 4 stages, starter, growth, machine learning, and real time. Go to data engineering podcast.com/rudder today to drop the modern data stack and use a practical data engineering framework. In terms of the data management aspect of this, I'm curious, what are some of the specifics in terms of the types of data that you're able to work with? So thinking in terms of structured versus semi structured versus unstructured or binary and things like that, and some of the organizational aspects of how best to work to the strengths of Arcuda in as to how you organize and structure the data so that it's able to be able to parallelize and take advantage of being able to shard and work across the data in parallel in isolation from each other?
[00:29:12] Unknown:
Great question. So natively, Arcuda manages 1 dimensional arrays, so collections of 1 d arrays. In our graph analytics, we built new data structures to have a native graph data structure to do these graph analytics on. And for many applications, we have data sets where we can decompose them into sets of these 1 d arrays. That said, the user doesn't have to worry about sharding or other partitioning techniques. The data will sit on a back end, and Chappell will have the responsibility of managing the distribution of that dataset across the available compute resources. So that sophistication is built into Chappell and means that we don't have to be as concerned about it because we have quite a sophisticated compiler and runtime system that's managing many of those aspects.
[00:30:09] Unknown:
And so in terms of the opportunities that you see as far as algorithmic advances for being able to work across these large datasets and being able to accelerate the time to insight, I'm wondering what are some of the open questions or unrealized capabilities of how to expand the functionality of things like Arcuda and just general large scale data analysis?
[00:30:36] Unknown:
Up to now, we really haven't had platforms where we could experiment with multi terabyte datasets productively in real time. And so this really opens your imagination for new types of analytics. For instance, let me focus on the graph space. There are some very capable tools out there that operate for analytics on graphs, for instance, on our laptops and small clusters. But once the graph becomes larger than a certain size, those frameworks typically don't allow those analytics to be run. They will be too time consuming or just too large for the analytic, and we can't explore that space. So I believe with Arcuda, we'll actually have the ability for the first time to ask questions that we thought were never possible previously on some of these large datasets.
That may lead to new insights or even new algorithms and analytics or interrogating and really getting more insights from these large datasets.
[00:31:38] Unknown:
In terms of your experience of working with Arcuda and supporting teams who are building analyses on top of Arcuda, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:31:49] Unknown:
Right now, Arcuda has a dozen to 20 plus users. It's really in its infancy providing this very highly capable framework. And so we're also looking forward to seeing people download it, clone it from the Git repository, and use it, and hearing more about those success stories. It's been quite useful for the enterprise and the users that we've seen pick it up and use it. Many of them are working in places with large data sets, for instance, from cybersecurity detecting cyber threats. And some are working in social network analysis with very large social networks. And we hope to continue to see success stories from Arcuda in the coming months years as more and more users adopt this highly productive and capable framework.
[00:32:39] Unknown:
And in your own experience of working on Arcuda and contributing to it and using it for some of your own research? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:32:50] Unknown:
My background comes from high performance computing and supercomputing. So at this point, nothing really surprises me. But my pleasant surprise was that the Chappell compiler that we're using in the background is quite sophisticated. So many things that we normally would have to do by hand on other systems, we found that the Chappell compiler was able to find the performance that we needed. There's a great support team also at HPE with supporting this Chapel framework, and we found really learning new ways of implementing our code that can better leverage the Chapel compiler. So it's been a great experience, and we are also pleased that we're getting the performance that we anticipated, that there were no major bottlenecks or roadblocks to getting the performance that we expected on these large datasets.
[00:33:46] Unknown:
For people who are interested in being able to work at these scales in an interactive fashion, what are the cases where Arcuda is the wrong choice and they're better suited going with some of these other parallel compute frameworks?
[00:33:59] Unknown:
That's a great question. So if your dataset isn't massive, you probably don't wanna go through the effort of setting up Arcuda. If you can solve it easily today on your laptop or on the systems that you have, then you should stick with what you have. But if you find that you have datasets that you can't process or that the speed bump to get your analysts being able to productively ask questions of those datasets, then maybe in those cases, you should consider Arcuda. So, again, if things work well for you today, then maybe you're not the ideal candidate for moving to Arcuda. It's only when you start facing these issues of having datasets that are too large or not having the performance that you're seeking, near real time or interactive performance on these massive datasets, then you should consider looking at Arcuda.
[00:34:55] Unknown:
As you continue to iterate on and contribute to the Arcuda framework, what are some of the things that you have planned for the near to medium term or any problem spaces that you're excited to dig into?
[00:35:06] Unknown:
We're very excited in a few areas. For instance, stringology. How do we design more data sets that can process strings of text quite well? So this is important when we're analyzing documents or looking at unstructured text. We're trying to build those capabilities into Arcura, and we have funding from the US National Science Foundation to explore these areas in Arcuda. Another area that I'm quite excited by is looking at what we would think of as table joins that normally are going to be expensive operations within databases.
Here, we're looking at Arcuda and what does it mean to do a join of the datasets that we have, and is there a way to optimize those to do it in the fastest way possible? So there's a lot of hard work that we're gonna do at New Jersey Institute of Technology to try to build out new capabilities for processing different types of large datasets with the Arcuda framework.
[00:36:07] Unknown:
Are there any other aspects of Arcuda or large scale graph analytics or being able to do interactive analysis on large scale datasets that we didn't discuss yet that you'd like to cover before we close out the show? I think this is an exciting area. We already know datasets are getting larger,
[00:36:24] Unknown:
but we also have to understand that often we get data not as just a big block of data to process, but as a stream of data. That stream may be updated every millisecond, every second, every hour, every day, and so on. And we wanna have tools that can process those streaming data streams. And what I want to be able to do is build out new tools and new capabilities to look at massive streaming data analytics. I want to get away from using these resources just for doing forensic analysis after something egregious happened, and we're trying to explore, for instance, in a cyber hack, how did they get in? What did they destroy? What did they exfiltrate?
How do we protect against it? That is after the damage has been done. Where I wanna move to is predictive analytics. Can we take these data streams and detect that there's a change, a pattern, or something emerging that we can protect against before some egregious event? So I'm really excited by building out more tools for predictive analytics on these massive datasets.
[00:37:37] Unknown:
For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as the biggest gap in the tooling our technology that's available for data management today.
[00:37:52] Unknown:
I think the biggest gap that we have is really the seamless integration of multiple tools. So there's a great number of tools and suites out there, some commercial, some open source, and that they serve many different purposes. But when we look at the data science environment, we see the ability to, for instance, use Jupyter Notebooks or workflows that we can record, which I think is a great advance. But what I'd like to see more is the compatibility among multiple tool sets so that we can move between the different tools and environments out there. For instance, I have datasets where sometimes I wanna look at them as unstructured datasets. Other times, I wanna view them as a graph. Other times, I wanna view them in a different light as well. And I wanna be able to move through tools that specialize in that view of the data and to be able to do it seamlessly without having to modify those datasets or go through different workflows on different systems.
[00:38:54] Unknown:
Yeah. It's definitely a very real problem these days as we get into specialization of these different tools, and I'm definitely excited for some of these investments that are happening in the metadata layer where maybe we can use that as the interchange point without having to do all kinds of custom integration between these different tool sets.
[00:39:13] Unknown:
Exactly. I think you hit the nail on the head.
[00:39:15] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you are doing on the Arcuda framework. It's definitely a very interesting platform, interesting capabilities that it's unlocking. So definitely excited to see that continue to evolve and grow in terms of capabilities and adoption. So I appreciate all the time that you and the other members of the team are putting into that, and I hope enjoy the rest of your day. Thanks, Tobias. And I really hope anyone out there looks up Arcuda
[00:39:42] Unknown:
and works on their large scale data science problems with productive and capable tool sets. Thanks again for chatting.
[00:39:56] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with David Bader Begins
Overview of Arcuda
Target Audience and Use Cases for Arcuda
Comparison with Other Tools
Open Source and Community Contributions
Hardware Requirements and Performance
Collaboration and Team Use
Future Opportunities and Algorithmic Advances
When Arcuda is the Wrong Choice
Future Plans and Exciting Areas of Research
Predictive Analytics and Streaming Data
Biggest Gap in Data Management Tooling
Closing Remarks