Summary
In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Dale Kim about Hazelcast, a distributed in-memory computing platform for data intensive applications
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Hazelcast is and its origins?
- What are the benefits and tradeoffs of in-memory computation for data-intensive workloads?
- What are some of the common use cases for the Hazelcast in memory grid?
- How is Hazelcast implemented?
- How has the architecture evolved since it was first created?
- How is the Jet streaming framework architected?
- What was the motivation for building it?
- How do the capabilities of Jet compare to systems such as Flink or Spark Streaming?
- How has the introduction of hardware capabilities such as NVMe drives influenced the market for in-memory systems?
- How is the governance of the open source grid and Jet projects handled?
- What is the guiding heuristic for which capabilities or features to include in the open source projects vs. the commercial offerings?
- What is involved in building an application or workflow on top of Hazelcast?
- What are the common patterns for engineers who are building on top of Hazelcast?
- What is involved in deploying and maintaining an installation of the Hazelcast grid or Jet streaming?
- What are the scaling factors for Hazelcast?
- What are the edge cases that users should be aware of?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Hazelcast used?
- When is Hazelcast Grid or Jet the wrong choice?
- What is in store for the future of Hazelcast?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- HazelCast
- Istanbul
- Apache Spark
- OrientDB
- CAP Theorem
- NVMe
- Memristors
- Intel Optane Persistent Memory
- Hazelcast Jet
- Kappa Architecture
- IBM Cloud Paks
- Digital Integration Hub (Gartner)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Tree schema is a data catalog that is making metadata management accessible to everyone. With TreeSchema, you can create your data catalog and have it fully populated in under 5 minutes when using 1 of the many automated adapters that can connect directly to your data stores. Tree schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging, and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration you'll love Tree Schema as much as you love your data. Go to data engineering you'll love Tree Schema as much as you love your data. Go to data engineering podcast.com/treeschema today to get 1 month free. And if you contact them and mention this podcast, you can get an additional 50% off the first 3 months after the trial. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster.
With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Dale Kim about Hazelcast, the distributed in memory computing platform for data intensive applications. So, Dale, can you start by introducing yourself?
[00:01:54] Unknown:
Hi, Tobias. Thanks for having me on your podcast. My name is Dale Kim. I'm the senior director of technical solutions at Hazelcast, and I'm responsible for the go to market strategy of our in memory computing technologies, which essentially means that I talk to our engineering team, our product management team, analysts, customers, partners, and so on to try to get a feel of the different requirements for some of the computing needs that our customers have. And so this often entails large scale data management and data processing requirements for some of these modern architectures today.
[00:02:25] Unknown:
And do you remember how you first got involved with data management?
[00:02:28] Unknown:
Yeah. Actually, I have a degree in computer science. And so I've been a technologist for a very long time, and 1 of the first companies I worked at was a relational database company. And that was the time when relational databases were really hot, and so suddenly, there's a lot of focus on what you can do with data online. And from there, I kind of drifted over towards more the of the nonstructured or the non relational data types. So the next company I worked with was in full text search, even before the Internet or World Wide Web were household names. And so having the ability to search for textual information in a way that is very consistent with the way we think about finding information was very fascinating. And from there, I've just worked for a number of data management platform companies in a variety of different fields. And I think that everybody is dealing with data today in some form or another, even in a private consumer oriented way. And I think the technologies that are coming out to be able to address some of these newer ideas and newer challenges is is fascinating. So I've been with data management pretty much from the start.
[00:03:30] Unknown:
So now you've ended up with Hazelcast. So I'm wondering if you can give a bit more of a description about what it is and some of the origin story for that.
[00:03:38] Unknown:
What fascinates me about Hazelcast is its focus on high performance, so high throughput data, low latency responsiveness. You you hear about all these different data management platforms that talk about these things, but Hazelcast's advantage is around in memory. And in memory isn't a new concept. It's been around for a while, but there have been some limitations about what it could do in the past, and some of these limitations are being mitigated so that in memory speeds are opening up to more and more companies. And Hazelcast was founded a little more than 10 years ago, actually, in Istanbul, Turkey by a couple of very smart engineers, and they came to Silicon Valley to start his cast as a formal company. And it was all about in memory computing. And so the first product was the IMDG product, which stands for the in memory data grid. So very much like a database, but a bit more capabilities in terms of distributed computing, ways to simplify building applications that could be spread across multiple nodes in a cluster, and thus enable parallelization much more simply. And so from the early roots, it was all about trying to get applications that ran faster, but at the same time, maintaining some of these enterprise qualities like security and reliability and availability. So ensuring that you're not getting speed at any cost, but getting the right amount of speed that you need to address your use cases while also protecting your data. And we've added on stream processing since then, and we have a set of technologies that work extremely well together and are fitting in quite well with some of the types of use cases that people are building today.
[00:05:11] Unknown:
And you mentioned that it's not being built at the expense of some of the reliability and durability guarantees that you might care about is particularly if you're working on mission critical applications. So I'm wondering if you can dig a bit more into some of the benefits and the potential trade offs of in memory compute, particularly for data intensive workloads and things that are going to be operating on stateful sets of operations.
[00:05:36] Unknown:
Yeah. The benefits of in memory computing have largely to do with the fact that you have fast access to data that's stored in memory. And so I've heard some people say that this notion of in memory computation or in memory processing is redundant, when in fact, if you think about it, the processing isn't done in memory. The processing is done in the CPU or, these days, increasingly more in the GPU. And the in memory simply means that all of the data is stored within memory and not necessarily spilled out to disk. And so when you have a system that's designed to optimize that pattern where you have all your data in memory, that means that you can get fast access to a lot of fast processing and be able to deliver on some of these use cases that have very narrow windows for service level agreements. So you get performance, but at the same time, when you have a fast system, you need to incorporate some of the typical characteristics of a distributed system, like replication in a variety of ways, and you need to have consistent replication. So we've after doing some research, some competitive research, we've seen at least 1 technology where at certain levels of throughput, it pauses some of the replication to be able to handle that throughput. And so most people won't notice it, but it's 1 of those things that if you're not watching, then you could potentially have a big problem when your data isn't replicated and nodes go down and you get failures, and then you might see a lot of unexpected data loss when you thought that all of the data protection capabilities were in place. But for us, we don't make those trade offs when we run our benchmarks. So we say, here's what you get in a true production environment in terms of performance, and you can be sure that we keep everything retained for the business continuity that you would expect. And and certainly some of the trade offs are pretty clear and and people are pretty familiar with these. It's mostly about how much data you can store. So you wouldn't use Hazelcast as your, say, your system of record for petabytes of data. We're talking more about operational data where you want to process it very quickly. So things like payment processing or fraud detection are good cases where you might have a good amount of data in memory as a lookup, but also have the engine processing in parallel and being able to use that data in in almost a transient way. So it's it's data that's persisted somewhere else, but we put it into our engine so that we can have those very stringent, very data intensive workloads running.
[00:07:56] Unknown:
My understanding is that the actual implementation is as a library that can be embedded into other applications, or it can also be built into a standalone sort of object storage for in memory operations as well as some of the particular algorithms for being able to handle conflict resolution and concurrent access. Access. I'm wondering if you can dig a bit more into some of the common use cases for the data grid that Hazelcast implements.
[00:08:26] Unknown:
Some of the common use cases involve payment processing. So a lot of banks use us as part of the mechanism for authenticating and settling payments. And 1 part of that 1 important part of that is also fraud detection, especially based on machine learning models. So with machine learning and and fraud detection as potentially a huge competitive advantage factor for a lot of these companies where the better they are with detecting fraud and the better they are about not getting false positives, that represents real revenue for them. So a lot of these use cases that that involve being more accurate, being better with your data for the purposes of cutting costs and and adding revenue are great use cases for Hazelcast. Other use cases involve transmission of data between different nodes.
Another great use case is having a cache as a service. So a lot of companies know that they have applications, both internal, external, that that need to access data as quickly as possible, but they may think about building a a cache specific to that application. And we've seen among our customers that they quickly recognize that that's not a scalable way to go. It'll take them months for them to essentially recreate the wheel. So what if you had a centralized and multi tenant role based access controlled caching layer that can be shared across multiple development teams. So they can all leverage that as their caching layer to provide fast access to data, while also segmenting it properly enough so that you get the separation of the sensitive data from the more public data. That's a common use case across industries.
And I think there are many types of different use cases where fast processing, including things like ETL, so very technical use cases that are appropriate for Hazelcast.
[00:10:11] Unknown:
So in the caching scenario, it sounds like you could use it as a replacement for things such as Redis or a memcached where you need to be able to have that common access layer to avoid having to do expensive calls out to on disk storage.
[00:10:26] Unknown:
Yeah. Absolutely. And in fact, you know, we've had a number of customers migrate to Redis, not necessarily because of specific differences in the architecture. It's it's more about what capabilities are provided to Hazelcast. And as I mentioned, we're not only a database, so we provide data storage, but we're also an in memory data grid. And what that means is that you can write applications that are deployed across multiple nodes in a cluster. So it's very much like the Apache Spark model where you have processing that's delivered to the nodes versus moving data to the node where the computation is. And that notion of moving computation, people refer to as data locality, and that's still a very important concept in trying to handle some of these large scale data problems.
And being able to add that capability as part of your overall solution, in addition to caching, has become a very powerful capability when using in memory.
[00:11:20] Unknown:
So in terms of how the HASEL cast grid is implemented, can you dig a bit more into the architecture? And I'm also interested in the approach that's taken for the sort of distribution or balancing of data across the cluster.
[00:11:34] Unknown:
1 thing that the architectures that the original architects did very well is try to model the Hazelcast framework on existing Java best practices. So you can almost think of Hazelcast as a Java object store that's distributed across multiple nodes. And so whereas a lot of these primitives within Java are mostly for single JVM on a single node. Imagine if those objects that you're creating in your application can be seen with proper access, of course, by other applications running in your network. It basically extends a common familiar programming model of Java across into a distributed environment.
And so using that as a starting point for the architecture makes Hazelcast very familiar for a lot of Java developers. And then since then, we've added a number of clients in different languages like Go and Node. Js and dotnet so that you can use it as an object store and get fast access to data. But that simplicity has always been a hallmark of ours so that it's very easy to get started and very easy to integrate with your existing system and then start building from there.
[00:12:41] Unknown:
1 of the first times that I came across Hazelcast was when I was starting to take a look at OrientDB and realized that it was using Hazelcast as its coordination mechanism. So I'm wondering if that's a common approach and just some of the other patterns that you tend to see with developers who are building on top of Hazelcast?
[00:13:01] Unknown:
I do think that a lot of companies are taking advantage of the underlying plumbing of Hazelcast, interestingly enough. So while performance is our big thing, because of the foundation of interprocess communication that we provide. You know, the OrientDB example that you gave is a perfect example where an open source company saw a foundation and decided to leverage that instead of having to build their own cluster management software. And the interesting thing about Hazelcast is that it has a auto provisioning mechanism so that, say, if you build an application and run 2 instances of it, it can be configured for them to auto discovery each other by broadcasting and that you automatically form a cluster so that you don't have to pre other applications per node, and when configured to do so, they will discover each other and they will automatically create that distributed system without a lot of effort. And that's why we can package our system as a very lightweight library without a lot of external dependencies, and you can embed it in your application, and you get that clustering mechanism basically for free simply by embedding that technology, the the Hazelcast Foundation, into your software.
[00:14:13] Unknown:
And in terms of the underlying principles and the system design of Hazelcast, what are some of the ways that it has evolved since it was first built and particularly in recent years?
[00:14:25] Unknown:
Certainly, there are a lot of very low level technical capabilities that we've added on. 1 of the more fascinating ones is our CP subsystem. So a lot of distributed systems today follow the AP model, and AP referring to the cap theorem where you can have consistency or availability, but not both, in addition to the partition tolerance of a distributed system. Most distributed systems out there are of the AP model, which means that availability is preferred over consistency. So you might get stale data, but you will get an answer. 1 thing that we added, as I mentioned, was the CP subsystem, which is used in very stringent environments where the data has to be consistent. In those environments, it's better to have the right answer, meaning the most up to date data element coming from the store, versus the system being available. So it's okay for you to get an error saying the system's not available. But if I do get an answer, it haps absolutely has to be correct. And 1 implementation of that is as a transaction ID in 1 of our financial services customers where they needed a distributed system so they can handle bigger loads, but also they needed the reliability. So should a node go down, the other nodes can take over. And they needed the CP subsystem as that counter or the ID generator so that they had that monotonically increasing number that could represent a transaction ID and be sure that there is no overlap. Because if you think about it in in any financial services environment where if you have a duplication of a number for a transaction ID, that could be disastrous. So they had some guarantees that they had to to deal with and having this CP subsystem mechanism along with the associated persistence.
So all of the memory associated with the consistent data is also saved to disk, they had a very reliable way of handling scale and dealing with accurate, consistent data.
[00:16:16] Unknown:
Another interesting aspect to the in memory computing ecosystem is some of the recent developments with hardware, particularly such as NVMe drives that provide very high speed throughput on disk access and the better durability. And there's also been some recent research with memory stores for being able to have memory that's actually durable. I'm wondering how that has impacted the overall market for in memory compute or the ways that HASEL cast is able to leverage some of these new hardware capabilities?
[00:16:50] Unknown:
I think the growth and the evolution of a lot of these NVMe technologies has certainly allowed companies to take advantage of more data and get faster speeds. And 1 observation in the market is I've seen that some companies pivot their messaging from more of an in memory model to more of NVMe optimized. And maybe that's a result of them not truly being in memory because they would have some dependency on spilling to disk, and therefore, the NVMe optimization was a better overall story. But we're always seeing more need for a faster speed, for the ability to do more in less time. And so I think NVMe's will be a great complement to a lot of these in memory technologies like Hazelcast. And as we start exploring other ways that we can take advantage of it, I think it'll be great for the industry. I mean, 1 thing that we do is we have an option where you can store your in memory content to disk and you save it in an asynchronous manner so that if you need some kind of downtime, then you can shut down your entire cluster, do some maintenance, and then start it back up. But rather than repopulating your in memory from data directly from the system of record, you can get it back to the state where it was prior to you shutting down the system and quickly repopulate the memory and and get going again. So a great way to reduce that downtime associated with planned maintenance.
And 1 other technology I'm really excited about is from Intel, their Optane DC persistent memory technology, which runs in 2 modes, basically. 1 is you can use it as persistent memory, as you would expect, in in a nonvolatile memory mode. So that's great when you wanna use persistent memory. And again, that fits in with our hot restart store capability so that you can write to memory. And then after you restart the cluster, you can read it quickly back in. But they can also use their Optane technology as an alternative to RAM that's far more cost effective and more scalable. And so that potentially opens up a new world for a lot of companies who previously might thought that in memory processing was out of their reach, so they had to trade off on using disk based systems that represented much more latency that maybe they were okay with to live with. But having that extra speed and having a more cost effective solution on the hardware side allows them to try to do more things. And, you know, with machine learning and artificial intelligence being really hot these days, a big part of that of those deployments involve doing a lot of exploration. And if you don't have a lot of performance overhead, you don't really have a lot of room to try new things. And that's where I think in memory systems will continue to grow. The notion of trying to do more things, try new experiments all at the same time rather than trying to make a compromise on what's only the best use case that we can run given our resource limitations.
[00:19:34] Unknown:
You mentioned of capacity is something that's worth digging into as well in terms of how HASEL cast is able to take the full advantage of the available memory and the overhead that's necessary for the Java heap or things like that, the other things that are resident in memory to make sure that you don't have to deal with things like the out of memory killer on Linux. Also, for cases where you have machines that are entering or exiting the cluster, being able to rebalance across the available nodes and not lose state or lose data in the process.
[00:20:07] Unknown:
Yeah. A lot of interesting innovations there to address those issues. For 1, the notion of dealing with garbage collection pauses on a Java system is a long running headache for anyone building large scale systems. And to address that, we've built our own memory manager. So with our enterprise version of our software, you can allocate many gigabytes, up to a 100 gigabytes of memory onto your HASEL cast managed heap. And so you avoid a lot of the the headaches associated with garbage collection that will impact performance and ongoing operation. So that's 1 way that you can work around some of the the limitations of the default Java garbage collector.
And Hazelcast is a true distributed system, so it's designed to deploy across multiple nodes in a cluster. And so the data protection facilities you expect are are in place such that there is replication. If if a node goes down, then the remaining clusters will rebalance the data and accommodate that missing node so that you restore to a state that's fully protective of all your data. And the CP subsystem that I mentioned earlier also has provisions to be able to protect the data like like the file system. But also, the notion of split brain handling is an important consideration. So, for example, what if your cluster is partitioned into 2 due to some network outage? That's a very complicated product problem that all the distributed systems need to be able to solve. So it's not as simple as you just keep on running as 2 independent clusters. You need to determine which cluster is gonna be the master cluster based on which has more nodes. And so you have to do a lot of calculations to determine that and then decide what that master cluster can do and what the secondary clusters can do. So that split brain logic is handled within Hazelcast as well. And we've added recently some some more intelligence around some of these edge cases where a node might appear to be down by some of the cluster, but appears fully available by other members of the cluster. And whenever that happens, and it does happen unfortunately due to some of the intricacies of of a large scale network, then you need to be able to handle that properly. And so we have some logic to be able to decide based on what the other nodes are seeing whether that's truly a down node or not. And so you don't create any inconsistency as a result of having the incorrect view of that node.
[00:22:35] Unknown:
You mentioned earlier too that there are streaming capabilities that you have built on Hazelcast on top of the data grid. And I had seen a mention of the JET project recently, so I was interested in digging more into that as well and some of the ways that it's implemented and the motivation for building that as a streaming engine for in memory compute.
[00:22:54] Unknown:
Yeah. Hazelcast is a very high performance distributed streaming stream processing engine, and so you can definitely compare it to the likes of an Apache Flink or Spark Streaming. And so a lot of the capabilities that you would expect on a stream of data like windowing, aggregations, and fast processing, and then sources and syncs that you can easily connect to are there. And 1 of the differences in how we approach JET is how we can tie it together with a separate data store. And, architecturally, JET was built on top of our IMDG. So with the distributed computing capabilities that IMDG provided, it was relatively easy. Relatively is a is a very subjective word, but relatively easy to build JET on top of that where we had the JET jobs that are submitted across the different nodes handling 1 of the many tasks within a pipeline.
And that pipeline is represented by a directed acyclic graph as a means of identifying all the different subtasks and optimizing each of those and distributing them across the cluster. And running this on top of of IMDG made a lot of sense because a lot of companies a lot of our customers were were trying to do something similar. They wanted to follow the streaming model of reading in data from a technology like Apache Kafka, doing some processing, and then spitting it out to some sort some sync, whether it's a database or or even INDG or back to Kafka. And so when we saw that people were trying to do that, we thought we could probably simplify it for them by providing an API that allows this capability. And we actually provide 2 APIs. 1 is for building your own complex data pipeline in a DAG model. So it's a very low level API. But what's making more sense is our higher level declarative API where you can specify the different functions or the different actions to take on your data in a more understandable way. So allows you to write less code while still getting most of the advantages of being able to handle the stream processing.
[00:24:57] Unknown:
Another interesting element of the work that you're doing at Hazelcast is the fact that the foundational technologies are open source, and so anybody can take them off the shelf and build on top of them. So I'm curious how you manage the governance of those open source projects and the guiding heuristic that you use for determining what to build into the open source versions versus your commercial offerings?
[00:25:21] Unknown:
We absolutely encourage the community to add new capabilities, to make new suggestions, and add whatever functionality that will will suit their some of their needs. And we're in the process of adding a whole set of new capabilities to serve this this notion of technology convergence, especially in the in memory side where there is this convergence of in memory databases, in memory data grids, in memory analytics. And so there there are some provisions we're working on for this year to be able to build out that larger scale, more built out comprehensive environment. But at the same time, that there are some capabilities that we we restrict to the enterprise version. And those tend to be the the capabilities that the open source community doesn't want to build as as far as I've seen throughout my career. So things like the high density memory that involved our own heap. Now that's certainly a complicated product, and so that's something that we wouldn't necessarily see the community work on. Business continuity is the 2nd big area of where we build in more capabilities to ensure a 24 by 7 operation. And then, of course, security. I like to joke that if you're a good engineer, you'll say, I can build those capabilities myself, so I'll take the open source and just work on those capabilities.
So that's what a good engineer would say, but a great engineer would say, I am not gonna risk my career building those things and take the fallback of any problems that occur as a result of trying to build a high available system or the security. So why not let a company like Hazelcast, with all of their expertise and experience, build those really critical pieces and we'll pay for it versus trying to do it ourselves?
[00:27:00] Unknown:
So for developers who are building on top of Hazelcast or contributing new functionality to it, what is it actually involved in getting started working with it and the interfaces that are available to them for being able to use as primitives within their own applications?
[00:27:17] Unknown:
Yeah. So we we follow the standard model. We have our source code available on GitHub, so you're welcome to to, fork it and make your modifications, add any pull requests, and our team will will take a look at those and add different capabilities. So very much like any other open source company that you've seen that has that that availability and many many contributors are familiar with that model and are able to make contributions here and there for the different capabilities. And like I said, you know, we always welcome more input to this. You know, so, certainly, we don't have the type of communities that an Apache project would, but we still have a lot of good contributions.
And it is a fairly technical and and complicated product, and so we encourage anybody who has that background to take a look and see what they can do to try to add more capabilities on top of, you know, something that we think could potentially be really big as a result of this movement towards higher performance and more in memory focused systems.
[00:28:15] Unknown:
For people who are building their application in Java, it seems that they would just add the Hazelcast library as a dependency to their Maven or Gradle specification and just build it as they would with any other Java project. But for people who are building in other languages, what's actually involved in getting a cluster of the Hazelcast data grid set up for being able to interact with it and start developing and experimenting with it?
[00:28:41] Unknown:
Yeah. Since we supply a a JAR file, you simply run a command line script that will start up Hazelcast as a server. So the model that you described was embedding it within your application, and then you've got that self contained Hazelcast application. But you can use the JAR file to run as a server, and so you deploy that across your servers in a cluster. And then we provide clients that can talk to that server, to that cluster, and be able to use it as an in memory data store so that you can build applications on the clients that are running separately from the HASEL CAS cluster, but still having similar functionality.
[00:29:22] Unknown:
For people who are building and maintaining a cluster of HASEL CAS servers, what are some of the operational considerations that they should be aware of or potential edge cases that they should guard against?
[00:29:33] Unknown:
A lot has to do with data scale. So if you're looking at a persistent storage mechanism that's dealing with, say, 100 of terabytes or maybe even petabytes of data. We're not gonna try to say, let's see if that works for us. This is all about the operational aspects of your data, so trying to process it at a very high speed. So definitely think about HASEL cast as another component in a stack rather than necessarily a replacement. And so we're very good at complementing relational databases, NoSQL databases, you know, streaming environments. So if you think of us as a new set of capabilities versus a replacement, that's where you should really focus on in terms of how we fit in.
[00:30:14] Unknown:
In terms of the ways that you've seen Hazelcast being used, what what are some of the most interesting
[00:30:19] Unknown:
or innovative or unexpected projects that you've seen built with it? Yeah. I think there's a a pattern emerging that is is very interesting in terms of how JET and IMDG are being used together. So notion of the kappa architecture is 1 that was created by Jay Krebs, the CEO of Confluent, and he talked about how you have a streaming database like Kafka and you read in data, do some processing, potentially some indexing, and then spit it out to a database or some other store for analytics. It's a very simple model, but it's extremely powerful. And a lot of people might think of the Kappa architecture as a real time system only, when in fact, 1 of the provisions within that architecture is that if there's any mistake that's made in your code or if you wanna make any changes, you can go back to Kafka and replay that data stream and process it again. And that description sounds very much like batch processing.
And so what we've discovered for some use cases that we're working with customers on, what if you could do that batch processing in a very fast way so that you're essentially mimicking a real time system and that you're essentially creating indexes and aggregations on demand. And in fact, 1 of the really innovative and almost crazy implementations that I've seen was from a very big technology company who's a customer of ours. And I don't think we I can talk about too much of the details, but the solution sounds very simple in terms of of what it's providing. It's essentially a recommendation engine for a certain segment, certain industry. And so business people will log into their system, run a query, and get a recommendation based on their query. And so you think that data warehouse might be able to provide that the response, but but it can't because there's some processing and some artificial intelligence built into their architecture that needs to be done. So you might think that they would add Apache Spark and get that going, and so now you have a data warehouse complemented with Apache Spark, and that would provide you the answer. And it turns out that they've done some testing, and a configuration like that would take many hours, like over 10 hours to give you a response.
And if you think about how people do analytics and running queries, you they typically want to submit follow-up queries to further refine the answers. And so if you're limited to 1 query per day, essentially, the the 10 hour response time isn't going to work. And so they did a lot of research, and they determined that if they had a fast stream processing engine, they could be able to do all types of calculations on demand so that they could get an answer in a matter of minutes. And so the the underlying foundation is, in fact, the kappa architecture where they're not necessarily storing all the data within the recommendation engine. They have their repositories and their systems of record. They'll read in data and only the appropriate data based on the query, do a lot of processing, run through a lot of AI algorithms, save some data, do some more processing. So at any given time, they could be running hundreds of Jet jobs within their system. And all of that processing can give them a comprehensive recommendations list after a few minutes, which is well within the range of what their customers would expect. So being able this this notion of on demand indexing is, I think, going to be extremely powerful in the near future. 1 other use case that I found interesting was this idea of trade monitoring within capital markets for the back office folks. So the front office folks, the traders will have these expensive high end systems that provide real time analysis, and that's expected because they're the revenue generators. But what about the finance people? What about the regulatory compliance people or the risk people? They don't necessarily have the same amount of budgets. So how can they get a near real time system to be able to do analytics? So if there's something strange going on in the market, they wanna be able to assess what's their risk, what's their position within the context of the rest of the market. What if they could just read all the relevant data for a given query, like what's going on with, say, IBM stock? Why is it rising so quickly? Then they can get that data, process it really quickly, index it, and then do some additional queries that are stored in IMDG.
So, again, the underlying model is that kappa architecture but used for an on demand analytics type of scenario.
[00:34:34] Unknown:
As somebody who has been working with the HASEL Cast product and supporting customers, what are some of the most interesting or unexpected or challenging lessons that you've learned while working with HASEL cast and with the in memory computing ecosystem?
[00:34:48] Unknown:
I think 1 of the challenges that we faced is just the limits in terms of how people view what in memory computing can do. And I think a a great example of what people should be thinking about is what is that additional thing I could be doing? And so if I can go back to the fraud detection model, we have 1 customer who, because they recognized that they could gain a lot more revenue and would cut a lot of costs associated with fraud, they want to get the most accurate fraud detection system possible. And using Hazelcast allowed them to try new types of algorithms simultaneously to come up with a composite score that represented a better prediction of what transactions could be fraudulent.
Now most people wouldn't think I need faster speed because I'm gonna try a lot of different things. But since they did, they were able to get some real value there. And so I think my recommendation to a lot of folks is don't think in terms of what I have to do now, but think in terms of what else I could be doing, and that's where a lot of the performance will provide additional benefits. Being able to do more things, and it's not necessarily about only trying to meet your specific SLAs. And I I think a lot of people get trapped in thinking that's what their goal is, only trying to fit within the time frame that they're given. But rather, they should be thinking about what else that they can do as a result of investing in in memory.
[00:36:08] Unknown:
For people who are considering in memory compute or taking a look at the Hazelcast Grid or a JET product, what are the cases where that's the wrong choice?
[00:36:18] Unknown:
Definitely for IMDG. If you're currently thinking about using it as some type of system of record, that that probably isn't the greatest choice. I would expect, even if you think that you don't have enough data that it'll overrun your memory availability, you probably will be in an environment where that data will continue to grow. So using it purely as a storage mechanism is probably not a great choice. And in fact, there are NoSQL databases out there where if you want to use them as a basic cache, you could and you wouldn't get the same type of latency. But for the type of scale of data you might have, that might work. So I think basic caching is not a terrible choice. It's just not ideal. But at the same time, think about what else you could be doing on top of caching, like the distributed computing, like the stream processing so that you can grow out your architecture to do far more than you initially imagine.
[00:37:11] Unknown:
And as you continue to build on top of Hazelcast and work with customers and explore its capabilities, What do you have in store for the future of the product and the business?
[00:37:22] Unknown:
Yeah. On the business side, we have some great opportunities working with some very big partners. So I had mentioned Intel earlier with their Optane technology, and so we're certified to run on them. And they're really excited to work with us as we are with them on trying to tell the story of how to use these newer storage technologies to a greater extent and be able to get the performance you need for some of your more advanced systems. So we're we're doing some work with them, and I think that relationship will grow to new levels for us. We've also recently partnered with IBM and have closed a huge deal jointly where we fit into their technology known as CloudPAX. IBM is taking a very strong push towards cloud native technologies, and their cloud packs are software packages that have a variety of technologies embedded so that you can build out a cloud native environment either in the public cloud or on premises. So you get that same agility that you would expect from any cloud deployment. We're working closely with them, and we're embedded and OEM ed as part of their cloud packs.
So we expect that to be a big story in terms of getting the word out of in memory. And as I alluded to earlier, we're we're making a few changes to our architecture to be able to expand in terms of capabilities. So perhaps 1 day, people will think of us as more than just a data grid, which means that we could be that converged technology that handles system of record type of data for longer term storage.
[00:38:48] Unknown:
Are there any other aspects of the Hazelcast product or the in memory computing space or the work that you're doing at HASELCAST that we didn't discuss that you'd like to cover before we close out the show?
[00:38:59] Unknown:
Yeah. 1 thing I like talking about, and and this has been resonating really well, is the fact that in memory goes together with stream processing extremely well. And if you think about these event driven architectures today where you're getting data from, say, in an IoT environment where the data might be fairly sparse in terms of what you get, you might get a a machine ID and you might get some device readings and not much more than that. But for the purpose of analytics, you need a little bit more information about the machine and about all the different readings that you're getting, and that requires a lookup. And the fact that JET is based on top of IMDG means that you get this in memory store for free. And the fact that they're working together, being collocated within the nodes across a cluster, you get that fast lookup for the absolute fastest, lowest latency. And I think that's 1 of the differences that we're providing in the market, but there are other technologies that have strengths in other areas where scale might be the important aspect as part of their data storage or certain types of libraries that are available.
So I think depending on what your needs are, a lot of cases, there'll be technologies that you commonly use that'll be appropriate. But for some of these newer use cases, we're seeing that, you know, in memory and streaming working together is a very powerful combination, especially if I can comment if I can add to that, including at the edge where there are a lot of restrictions around how much processing power, how much physical space you have, how much access you have to the computing technology for the purposes of maintenance. So having a lightweight but powerful streaming and in memory system running on the edge as well as in the cloud provides that edge cloud connectivity for processing data, you know, where it's created while also providing that large scale analysis across all of the different data sources that you have.
[00:40:54] Unknown:
Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:41:10] Unknown:
I think 1 of the biggest gaps today is some of the technologies that allow companies to expand beyond their existing architectures. So, certainly, there are ETL tools. There are data integration tools. There are new sorts of databases and data management platforms that you can add altogether. But for the most part, connecting them all together has been a big challenge. And I think 1 approach that can help alleviate that is using some some of these newer technologies, 1 of which Gartner refers to as a digital integration hub, which acts as a data layer that incorporates data from a variety of sources into 1 common platform that exposes an API for many back end processing. So any type of ETL you wanna do from systems of records or just large scale storage, whereas this digital integration hub tends to be more front facing, forward facing in terms of closer to the consumers and the application users.
And deploying these types of systems that we've seen quite a bit, especially in in banks for the purposes of higher speed accesses and and better responsiveness for customer satisfaction. I think that has been underserved so far. So I think building more focus putting more focus on these digital integration hubs, I think, will represent a next wave of technology for managing data for this, you know, increasingly distributed world.
[00:42:44] Unknown:
Thank you very much for taking the time today to join me and share the work that you're doing at Hazelcast and the insights that you've got into the use cases for in memory compute and storage. It's definitely a very interesting problem space, and it's exciting to see some of the ways that it can complement some of the traditional architectures. So definitely look forward to seeing the ways that that evolves, and I appreciate all of your time and effort on helping to drive that forward. So thank you again for your time, and I hope you have a good rest of your day. Thank you very much, Tobias. It was great talking to you. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Dale Kim and Hazelcast
Overview of Hazelcast and Its Origins
Benefits and Trade-offs of In-Memory Computing
Common Use Cases for Hazelcast
Hazelcast Architecture and Evolution
Introduction to Hazelcast JET
Open Source Governance and Community Contributions
Innovative Use Cases and Customer Stories
Lessons Learned and Future Directions
Biggest Gaps in Data Management Tooling