Summary
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Upsolver is and how it got started?
- What are your goals for the platform?
- There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?
- What are the shortcomings of a data lake architecture?
- How is Upsolver architected?
- How has that architecture changed over time?
- How do you manage schema validation for incoming data?
- What would you do differently if you were to start over today?
- What are the biggest challenges at each of the major stages of the data lake?
- What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake?
- When is Upsolver the wrong choice for an organization considering implementation of a data platform?
- Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house?
- What features or improvements do you have planned for the future of Upsolver?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Upsolver
- Data Lake
- Israeli Army
- Data Warehouse
- Data Engineering Podcast Episode About Data Curation
- Three Vs
- Kafka
- Spark
- Presto
- Drill
- Spot Instances
- Object Storage
- Cassandra
- Redis
- Latency
- Avro
- Parquet
- ORC
- Data Engineering Podcast Episode About Data Serialization Formats
- SSTables
- Run Length Encoding
- CSV (Comma Separated Values)
- Protocol Buffers
- Kinesis
- ETL
- DevOps
- Prometheus
- Cloudwatch
- DataDog
- InfluxDB
- SQL
- Pandas
- Confluent
- KSQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. And go to data engineering podcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. Your host is Tobias Macy. And today, I'm interviewing Yoni Eeny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease. So, Yoni, could you start by introducing yourself?
[00:00:54] Unknown:
Hi, Tobias. Yeah. I'd be happy to. So I'm Yoni. I'm the, CTO of Upsolver. I started Upsolver in 2014, and we've been building a data lake platform since around end of 2015, beginning of 2016. Originally, I started let's say, I learned how to program in the, Israeli army. And, yeah, I've been doing that ever since. And do you remember how you first got involved in the area of data management? Yeah. So I guess, you know, the army uses tons of data. And, I mean, every time I work with data I mean, basically, all my projects were data project. And then every time I worked with data in the army and after, I think 90% of the work was getting it into a usable state, and then 10% of the work was getting insights out of the data. So I think it was kind of like a natural progression. Like, you know, you build the same data management system again and again and again for each project.
And then eventually, you're like, why are we doing this each time? Why don't we have just a a system that does it for us? So so, yeah, I think that was the that was the motivation.
[00:01:54] Unknown:
And so can you give a bit of an overview about what Upsolver is both as a platform and a business and how it first got started? Yeah. Upsolver as a I mean, we bill it as a data lake platform.
[00:02:07] Unknown:
And, basically, the idea is that if you're using a data lake, you should really be using upsolver or, let's say, upsolver or an equivalent, system to manage your data. And, I mean, basically, it started because, you know, we were originally, we're actually in in advertising. That was the the space. Like, the company, actually, it was a product company. And we were spending we decided to build a data lake because we had huge amounts of data. And and we spent, I think, like, literally 90% of our effort was around managing the data, and it really slowed down development. I think there are a lot of tools out there that that help, especially open source tools. There's the, you know, Spark and Kafka and all these tools that help you help you get like, work with the data, but it's very ad hoc. You need to really put everything together. So, yeah, you're you're basically repeating the same the same work over and over again. So I think that, like, once we realize that we're we're putting in so much effort, like so much of the company is based on on actually just managing the data and together with, let's say, our, like, desires of of building interesting software product products. So yeah. So we just we decided to go in that direction. And what are your overall goals for the platform
[00:03:19] Unknown:
and, what you're hoping to
[00:03:21] Unknown:
solve for people who are choosing it as their platform of choice for a data lake? So I think, you know, data lake like, the data lake itself is a pretty new concept. And right now, it's still at the level where people are kind of debating, is it good? Is it bad? Like, should I actually be using a data lake? And I strongly believe that except for a few edge cases, data is gonna be moving into data lakes. And then I would say I would just want upsolver to be the, like, de facto standard data lake platform. I mean, basically, I just want people to stop suffering their data, but and start enjoying it, which is something that's very hard to do at the moment. So so, yeah, I think that's, like, that that's where I would want upsolver to be. And as you mentioned, there are
[00:04:06] Unknown:
still a lot of discussions about whether or not data lakes make sense in terms of being the platform for people's data or people should be using data warehouses or some other solution. So I'm curious what your thoughts are on when data lakes are the right choice for a platform and whether you think that they should be the primary component or simply used as 1 stage of the overall life cycle of the data and that it should ultimately end up in a data warehouse?
[00:04:37] Unknown:
Yeah. So, I mean, first of all, I think data lake isn't, like, well defined at the moment. I think a lot of people have different opinions of what a data lake is. So when I'm talking about a data lake, I'm talking about a system that, raw data is coming into. So it's your 1st or first and a half step, within within your data pipeline is gonna be the data lake. And and then on top of the data lake, you have a lot of, data discovery and data management. So you're get you're getting value out of the data lake. I think a lot of people think of a data lake as as just kind of a place to put your data, and then we'll figure it out later. And I really don't think, like, obviously, when you think about it that way, it's not very valuable. Like like, maybe you'll get value later, but building something for for some unknown objective isn't, like, isn't great. So really, like, the idea, at least when I think about a data lake, is that is that you want to be getting a lot of, like, a lot of services out of it. I mean, when you're talking okay. So why would I want a data lake? Like, why am I not happy with my database or with Excel for that matter? So, you know, they talk about the, the 3 v's. You have, data velocity, variety, and volume. So, I mean, I think really mostly, even if you have 1 of those v's, you're probably gonna want to go in the direction of a data lake. So if you have a large volume of data, then, like, the cheap storage really makes a huge difference. I mean, a data lake is basically you're gonna be paying for compressed data on Blob storage. So you're paying roughly $20 per month per terabyte.
I mean, that's the cost of compressed data. Whereas, I mean, you'll easily pay 10 times as much in in any alternative, basically. Like, there isn't there isn't any alternative that isn't gonna have the data on hard drives that are that are costing you a lot more and multiple replicas, etcetera, etcetera. So that would be volume. And then velocity is is kinda similar. I mean, but with velocity comes volume. If you have tons of events coming in all the time, then, obviously, you're just gonna have a lot of data in general. But aside from that, also, data that's changing very, very quickly tends to be very challenging for non data lake systems. So so you're gonna be spending a lot of effort kind of just managing your data, which, again, is not what you wanna do. I mean, most people have a business that's that's a real business. You wanna you wanna get something out of the data, but you don't wanna be coddling it all the time. And then the third thing is data variety.
So you have a lot of different data sources. You have a lot of different types of data. And, you know, with a database or or with any data management system, which is, like, first, I do the management and then I get value. So you're gonna be doing a lot of extra work. A lot of the data just isn't that interesting, so you're never gonna be touching it. And just figuring out kind of, like, what the scheme is, and and how I should be looking at it, how I should pivot it. I mean, that's a lot of stuff that you don't even necessarily know ahead of time. So I think also even if you have low data volume, but you have the high data variety, you might you might wanna use a data lake, just just so that you can, you know, you have a place where you can dump all your data and right away see what is it, how am I supposed to work with it, things like that. And 1 of the difficulties
[00:07:44] Unknown:
of data lakes in particular because of what you're mentioning of variety of data or changing data is being able to manage the schema to be able to ensure that you're able to analyze the information once it lands. So what are some of the ways that you approach that challenge of being able to ensure that the information that gets stored into the data lake is ultimately useful and that you're not just writing out a bunch of records that don't have any sort of common format or common schema that will just end up causing greater pain down the road, rather than doing any sort of upfront verification
[00:08:24] Unknown:
before it ends up getting written to disk? Yeah. So I think that's a pretty common problem that you have, you know, you have like, often data is coming from from, like, sources that you don't even control. So you have a Kafka stream. It's coming from mobile devices in the world, and some of them just send garbage, basically. So some of it isn't even parsable, some of it is is in completely the wrong schema. Your streams get mixed up. I mean, these things happen in the real world. So, I mean, I think the tooling is super important there, and, an an upsolver, like, that's basically what we're focusing on is is giving you visibility into messy data. So the first thing obvious or not obviously, but the first thing is that is that you don't usually don't wanna discard any data. I mean, if it's a single record that doesn't pass validation, okay, that kind of makes sense to discard, but we don't know that ahead of time. Often, you're gonna have kind of a stream of data, and suddenly it shifts because someone released a new version, and all your data is of the new schema that doesn't pass validation. And then, I mean, are you just gonna discard your entire data stream until you fix it? I mean, that's awful. So so I think the the most important thing, first of all, is just to get the data in there. So so you don't wanna do any validation as you're ingesting data. You wanna make sure that everything is stored properly, everything is visible. And then from there, you wanna do your data cleanup within the data. I mean, that's really the, the paradigm that we're promoting.
Data comes in. It's it's stored in your object storage, and then and then you run, clean up ETL operations, which are gonna filter out records that that aren't interesting or split a single stream into multiple streams according to that. So so, yeah, I think basically, like, I mean, the the difficulty is is the insufficient tooling. Is once the data is in, I have to be able to see it, And upsolver goes a long way into helping you do that, and obviously without being able to see what your data is. Data lake doesn't really add much. I mean, the data is there, but but I don't know if it's broken or not. I'm gonna find out anyways in the downstream systems. So but if it gives you the services that that I can see how many things don't pass validation within the data lake, that I can run the ETLs and clean it up within the data lake itself, then I'm saving a lot of work. And what are some of the overall shortcomings
[00:10:39] Unknown:
of data lakes as a platform and some of the challenges that are unique to that particular approach?
[00:10:49] Unknown:
So I think I mean, I think of a data lake as, I mean, as similar to a database. Kind of like a database with a different, like, with a different paradigm, but but it's supposed to give you pretty much the same the same value, like cheaper, bigger, etcetera, but aside from that. So so I think the the main shortcoming is just that there's very little, very little in the space now. Like, you have Spark that lets you and and Presto and Apache drill and and things like that that let you query the data, but but there isn't anything that's really like I mean, they're all open source tools, which are which are great, but there isn't anything that kind of pulls it all together. So I think generally, the the main shortcoming, I think that's gonna disappear in a few years. I mean, especially, obviously, with upsolver.
But but also, I mean, I'm sure there's gonna be a lot of competition that's doing similar things, and the cloud services themselves, Amazon, and, Microsoft and Google are gonna be adding a lot more features around that to support it directly. But, yeah, I think it's just just very immature right now. So you don't you don't have a lot of tools that that just run on top of the data that can give you give you value. Maybe the other thing is that if, you know, if you have small data, like, if you have a data set that's just like static, I have a few 1000000 records, and I wanna, I wanna munch that, then I think data lakes, like, they do tend to bring in some overhead, and, you know, as opposed to databases or Excel, which which are really good for that. Data lakes, you know, they're they're built thinking about large streaming data. So if your data isn't large large streaming data, then possibly you're gonna be making effort to fit it into that that square which, like, if it's if it's round data, then maybe you better off deal use, like, different tools. So I think it it does depend on the type of data also that you're using. And so
[00:12:38] Unknown:
for the Absolver platform itself, can you describe the overall architecture and the major components of how it's put together for people to be able to manage these large volumes and variety and,
[00:12:54] Unknown:
velocities of data? Yeah. I'd be happy to. So first of all, upsolver runs, in the cloud. Let's say not exclusively, but it definitely is is much happier there. So the first thing is that you're talking about, like, usually, you're talking about huge amounts of data. So, like, you have to be very, very cost efficient. So the first thing is that the entire architecture is headless, just to make it easier to scale and easier to manage, and and the whole thing runs on, on spot instances or the equivalent of spot instances in in the different clouds. Again, just to just to save money since we're using object storage as as just the central repository, s 3 in Amazon, blob storage in, in Asia, etcetera. So so all the data is really cheap, but you wanna make sure that also the processing is is is as cheap as possible on on these, volumes of data. So so so that's I mean, basically, anywhere where we could where we could cut costs relative to the amount of data, that's what we chose to do and that, like, kind of shows in in in the architecture at each at each step. I mentioned that object storage is is where we store data. So so that's like it's really kind of like a, like a basic tenant of of the entire architecture.
Anything we do is gonna be stored within your object storage. There isn't any, like, repository on the side or or anything like that. So your all your data is stored in a single place and and all of your ancillary services are also based on that. So so again, like saving money and also, simplicity because everything's everything's in the same place. Scheduling in upsolver is the first class citizen. It's basically I would even say, like, the scheduling is the top level thing that then everything is built around up because you have data coming in all the time. And, you know, I think it's usually the the other way around in in open source big data projects where you start with writing your scripts, your ETLs, and then you figure out how to schedule them with an external system. And I think that leaves a lot of kind of a lot of efficiency on the table. When when you start with the scheduling, then, I mean, you have a lot more power as far as as far as dependencies between tasks and and just making everything a lot more seamless. So that's a big thing, for upsolver.
And I think I mean, not like the last thing, but but kind of even maybe the biggest thing is, how we index data within the data lake. So, basically, we have proprietary object called the materialized view. It's basically a a file format of data stored in the data lake, but that lets us query large amounts of data in real time. So I think very similar to the architecture of of, key value stores like Cassandra or or Redis, but built directly on top of object storage instead of, using local, local storage for it. I mean, object storage is super fast, really, and it got much faster in the last even year, I would say. So, so you're not paying a penalty for that at all, but but it can scale much, much better. So I think when you combine that with, with everything else, that that's what really gives you kind of the visibility into your data that you really need. And 1 of the things that you mentioned in there that I find particularly interesting is the use of spot instances
[00:16:15] Unknown:
as a cost mitigation strategy, which makes sense in terms of the pricing, but it also introduces a lot of challenge and complexity in terms of being able to ensure that you're able to deploy and scale these instances quickly and consistently as well as that given the fact that you're processing these datasets, there is a certain amount of statefulness to the computation. So I'm curious how you manage the cases where spot instances get removed in the middle of processing, ensuring that you don't end up just dropping those tasks and being able to pick back up where you left off? Yeah. So, I mean, basically,
[00:16:58] Unknown:
the whole architecture is pretty much in order to make sure that you can preempt an instance without without losing anything. So so first of all, the headless architecture is a big part of it. So, basically, any server is just a a worker, kind of similar to a spark worker, but but then you don't have the the master server that would kind of need to be a non spot instance. So you just don't have that. And then all of the work that that our system does is from object storage to object storage. So you have kind of a well defined, chunk of work that you need to do. And it starts from 1 or many, 1 or many blobs and ends with 1 or many blobs. And and since that's, like, kind of a well defined unit of work, any server can do it.
Multiple servers could do it at the same time, if there's, like, you know, split brain syndrome or something like that. And and, like, nothing bad is gonna happen. It's not it's not gonna break anything. So so once you have that, basically, like, a server can die and and so some other server is gonna pick up instead and and and do the work for it. So basically, it becomes a question of latency, which, I mean, which is super important, by the way. It's not like the fact that a server can go down in the middle of a computation. You definitely wanna make sure that you're not introducing, you know, a 5 minute or a 10 minute lag in in in the processing of the data. I mean, a lot of our customers use use upsolver for for serving real time data. So so they're actually trying to like, they want the data to be as fresh as possible and definitely adding a few minutes is not is out of the question. So there is a lot of complexity there, but but I think that's, like, kind of dwarfed by the complexity that would be if you could actually have data correctness miss issues,
[00:18:40] Unknown:
that would happen if if a server suddenly falls in the middle. And another piece that you mentioned is that you have your own specific data format for being able to store the information in object storage. So I'm curious at what point the data is transformed into that format and how it compares to some of the open source options such as Avro or Parquet or Org Files?
[00:19:03] Unknown:
Yeah. So, I mean, first of all, to be clear, the data lake itself is stored in either avro or parquet, and is in a like, it's completely open. You can read it with any any external system you want. These indexes are used specifically for either just gathering metadata on the data itself, or else if you specifically build it because you want it for a use case. For example, like doing aggregations on your data or or things like that. So it's an an additional service on top of the data, which is in an open format. And, I mean, this, like, our our materialized view format, I would compare it most to the s s tables of Cassandra. So, I mean, basically, it's a key value store format. It's a it's a file which, I mean, the purpose is to kind of load it into memory and then and then answer get by key request. I mean, if you think about, for example, Upsolvers UI, which given a data stream, you can see all the different fields and all the different kind of top values and the percentage of that value within the, within the field, distribution. So it's a lot of get by key requests. And I think that's, like, the kind of a majority even of of, operational workloads are gonna be something like that. So and that's that's what, for example, Cassandra or Redis would, would do. So the purpose of the file format is to fill that niche. And then the like, if I'm talking about the difference between, our file format and and Cassandra, which would be well, I mean, you could also compare it to parquet, I think. But parquet is a columnar format, so it's more for scanning a file rather than getting a specific row. But we are kind of leaning on the same principles that, you want your data to be as compressed as possible, and columnar formats are are very, very good at compressing because, you know, data tends to be similar to to other data within the same column, but not similar to other columns. So 1 of the things we did within this format is while we are storing it as a row based format, so you can deserialize a single row instead of needing to look up each and every column value, We are compressing it, per column. So we're using an arithmetic coding, tokenization.
But, basically, the purpose is that we want to get the maximum compression for each value, but within a row. So so there are a few things that you can't really do, like, run length encoding, which which is kind of very common in in columnar format. And and then we're we're basically trying to to mitigate those issues with a very powerful powerful compression. And basically that like, what it lets you do is is store a lot of data in a relatively small file in object storage. And then and then when you load that into memory, you can just hold it there as is. Whereas, alternative formats like, like the Cassandra s s files, they're actually decompressed when loaded into memory. So the server can only hold in memory like, you know, a factor of 10 less than than would be otherwise possible if it was if it was stored compressed.
[00:22:03] Unknown:
And so the fact that you keep the source data in that Avro or Parquet or whichever other file format, I think is definitely valuable to the end user because it enables them to use whatever other processing or analysis or extraction tools they want to use for, other purposes beyond what they're doing with Upsolver while at the same time being able to take advantage of the optimizations that you're doing with your file format, which sounds to me as though it's, in some ways more of the metadata layer of being able to track the catalog of what data you have and what schemas are available without having to go all the way back to the source files every time to be able to introspect, what schemas are there, where it's stored, what, sort of partitioning,
[00:22:56] Unknown:
strategies are being used, etcetera. Yeah. Yeah. Totally. I mean, you really nailed the, like, the differentiation. So, I mean, yeah, you know, they're often I just wanna see what's in the data. I wanna open the stream and say, like, you know, this field, what are actually the values? Did I ever see anyone with with, to be deleted equals true in this in this stream. Like, I I wanna just know what the contents of my data is. And then asking that question of the raw data can often like, it's a process. It's work. I wanna just see it. It's it's a simple question. Give me the simple answer right away. So so for that, yeah, you have like, that's the, kind of metadata catalog level level, and that's I'd say that's step 1 and a half to step 2 of of data process. But then you actually have the, you know, your systems. You have, like, I wanna actually put some view of this data into a database because my website uses it, or or I want to transform it into a dataset to run machine learning, or I wanna create a report on top of that. So so all these, either I'm gonna use Spark or or any other tool that just takes this data and processes it and does whatever and and outputs it somewhere, or I would use upsolver as a tool that that does an ETL, takes the data transforms it, aggregates it, etcetera, and outputs it to to a table or to to some other downstream system.
But but, yeah, like, you know, it's whatever tool is the best at that point. But definitely, you're gonna want to at least be able to see your data and make sure that it's all managed, it's being retained properly, and and it's in kind of a a consistent format that you're not gonna have, like, surprises down the road. And
[00:24:36] Unknown:
as we mentioned earlier, 1 of the challenges of data lakes is being able to manage schema consistency and schema validation at the ingestion point. So I'm curious what strategies you're using for validating the schema of the files as they come in and how you handle records that don't match a specified schema for whatever particular reason while still being able to ensure that you're not just throwing data away arbitrarily because of some sort of bug at some step in the pipeline. Yeah. So, I mean, I'm gonna divide data into 2,
[00:25:12] Unknown:
into 2 buckets. You have self describing data and non self describing data. So I think data in the wild is actually at this point, it's like 90% JSON files. So so that's self describing. And then and then, like, you can basically ingest the data. It's, I mean, as long as it's a valid JSON, but but you can basically ingest the data regardless what's in it, and it'll be fine. On the other side, you have, for example, if you're using it per record instead of as a as a closed file or, CSVs without headers, protobuf, all these formats that aren't actually self describing, you need to have an external schema. And those are a lot trickier because you could actually have, I mean, we have a customer that had, that has Avro data, and their schema evolved, and the old schema obviously couldn't read the new data and vice versa. And so, basically, just everything crashes.
And so the way we deal with those situations is that we basically divide our ingestion into 2 steps. The first is taking the data from wherever it is, like often, Kafka or Kinesis or, something like that, and and just copying it into object storage as is. So So basically not doing anything, just making sure that it's gonna persist at least until it's finished parsing. And then the parsing of the data takes it from the original format and basically builds the data lake. So it creates Avro files or parquet files depending how you, how you configured it. And just, like, but everything has to pass validation at that point. So it has to be legal data. I'm not checking that the fields are correct, just checking that I actually know how to read it. And anything that doesn't pass validation is put aside and you get, like, alerts in the UI and stuff like that so that you can know, okay, there was a problem and then and then you have tooling to to allow you to change kind of, if it's just a matter of of scheme evolution, for example, you can say, okay. This is actually the new schema and then re ingest the the broken data. So so, basically, let's say that's but that's only relevant for for situations where you have data that isn't self describing.
So when data is self describing, it's gonna go in anyways. You're gonna see the data. And then often what you're gonna see is that in the last 5 minutes, suddenly this new field appear or or this field that I count on, it's suddenly not there because it changed name or something like that. And that's usually something you're gonna find out at your very down the stream system. Suddenly, I have my my web app, and it hasn't received data in the last 5 minutes because I don't know why, the data is just gone. So that's basically a situation where where your your schema evolved or changed or your data changed in a way that you didn't expect, and then it breaks your ETLs and and eventually your your end end system. And then what you want is the tooling to basically take look at all the steps going backwards until you can see where what the root cause is and then correct it, in retrospect.
And that's really kind of what the data lake is is meant to do because you can get visibility into all of your data, at any time scale and drill down as much as you want. So so you can see, okay, like, the problem started at this minute, and this is what the data looked like before, this is what it looked like after, and and then you can change the ETLs, and since it's all stored in the data lake, you can also change the ETLs retroactively. You can say, from the point where the data changed or a bit before, now I'm gonna do that, do the new transformation kind of. So, yeah, I think that's like if I'd summarize that, I would say that use scheme on read if at all possible, like, if if your data is self describing, and kind of try to, move all of the validation and transformations down the funnel to places where where you have full visibility into flowing in. But before
[00:28:55] Unknown:
we get to flowing in. But before we get to that, I'm also curious how the overall architecture of Absolver itself has changed over the time that you've been working on it from when you started to where it is now? And if you were to start over today, anything that you would do differently When
[00:29:17] Unknown:
we When we started out, we were kind of saying, okay. Object storage is is is the central repository, of everything. So, obviously, we need to support s 3 and, Google Cloud Storage and Azure Blob Storage and and and that's it. And then, you know, we're completely cloud portable. You just switch out the the storage and everything else runs the same. And I think over time, we realized that that first of all, there are a ton of very, very valuable services that that each different cloud provides. And also there are a lot of, gotchas and things that don't exactly work the same between clouds. So I think, like, you know, slowly where we started in a completely cloud agnostic way, and I could have said, yeah. Sure. We support all clouds, but in reality, we kind of supported none of them. And if I look, let's say, in our at our Amazon deployment now, we use CloudWatch, we use CloudFormation, we use Kinesis, EC 2 with spot instances, elastic IPs, ALB.
Like, we're using, I don't know, like, 8 to 10 different Amazon specific products that, like, they're not portable at all. They're just systems within Amazon, but they they make the system so much more robust that it's really worth it. And, also, seamless integrations. You know? Like, someone new who wants to start using our system, there's a whole DevOps process that is completely avoided, of getting permissions and spinning up servers, creating a VPC, all all these things that that, you know, you need to like, you could have a a a sheet of of to dos with, like, 12 bullets, or you have just a single cloud formation script. He clicks create, and that's it, and you're up. So, like, having the seamless integration into the cloud, I think it's like it's a huge barrier of entry. It means you don't actually need a DevOps person to manage your data, which is huge. It's like a complete a complete paradigm shift. So, yeah, I think that's, like, just integrating more deeply within within each each different cloud platform was a big change and something we probably started too late. Like, we probably should have started there of saying, okay. We'll support Amazon in the beginning and then move on to the other clouds, and instead we were kind of like, let's be cloud agnostic.
And,
[00:31:32] Unknown:
and then yeah. And then we had to to shift in that direction. And so now going back to the question of monitoring and alerting on the data as it's flowing in, I'm wondering what systems or strategies you're using to enable that and some of the common problems that might occur or some of the more challenging aspects of being able to provide that to users of Upsolver?
[00:31:56] Unknown:
Yeah. So I think, monitoring systems have have started consolidating slowly. I mean, common ones that we would see are, Prometheus or CloudWatch, Datadog, InfluxDB. So, I mean, there are a few others, but in in general, you have, let's say let's say, 5 to 10, common, monitoring players, and and they all look pretty much the same. So so I think in that sense, it's the world has become a lot easier that instead of everyone rolling their own monitoring, you have kind of very, common interfaces. So it becomes very easy for a system like us to integrate directly into the customer's monitoring and alerting systems that they're already getting alerts from there. So you don't have to add an additional layer of complexity. And then as far as issues with the data itself, I think the 2 most common issues are either that either that data changes and then ETLs fail. So basically, I depend on a field called event type, and the lowercase t became an upper case t because some developer thought that that makes more sense, which is probably true, but and that breaks breaks the system going from there downwards. And I think the second most common is just that 1 of the systems in the pipeline breaks. So suddenly, like, data isn't arriving in Kafka anymore. Like, it's an empty stream and and we don't know why. So I think in both cases, the most important thing is to know about this as quickly as possible. And invariably, you're you're gonna know about it first from the downstream system.
Like, in the end, I have a a database that I care about, that that I want to make sure is is getting data all the time and that's where I'm gonna feel it the most. But since the data lake is the step in the middle, it can be an early warning. Like, it can know before the downstream system that something's that something's happening. But, yeah, I think the most important thing there is to be able to rapidly fix these problems. First of all, you know, it's early discovery, but also being able to quickly I mean, correcting your data, obviously, like, taking the the uppercase t and changing it back to the lower lowercase t, but also then taking the data that exists, that's broken, and and doing a very quick transformation to get it to get it in, to to be able to use it. And you don't want that to take a week of development because these things happen all the time. You want it to be a few minutes. It's also usually a simple fix.
So, so so I think that's kind of the the biggest added value of having upsolver or or a data lake in general, in the middle of that pipeline. More more in the corrective measures than than in the alerting itself, but obviously, yes. Like, you know, all all the different metrics of of data arriving, what the velocity is, how many events, etcetera, like, what's actually happening, that all you can get out of the data lake directly. And so the sort of major components
[00:34:44] Unknown:
of the life cycle of data in a data like platform or an up solver in particular are the ingestion, the storage and management, and then the egress of the data from the from that platform. So I'm curious what you have found to be some of the biggest challenges
[00:35:02] Unknown:
at each of those major stages of the life cycle of the data as it's being as it's working its way through the data lake? Yeah. So I think in all cases, right, they're actually all very challenging if if you're doing it yourself. Because then you have to do everything, for each individual stream and, you know, you have to decide on a format for each individual object. But then let's say, if you're using a data lake platform, then in ingestion, I think the most challenging thing is is usually gonna be actually managing the the Kafka. And I've actually seen a lot of customers starting with Kafka and then migrating to to Kinesis or, Azure Event Hub just because you don't have to manage it. Like, Kafka is is a super robust system, but it can sometimes break for for strange reasons. So so that, I guess, to a to a certain extent also just managing permissions between the different systems, making sure everything is is kind of secure, but at the same time, you do have access and and you don't accidentally lose permissions, which then also would cause data to stop streaming. So I think that's as far as the data ingestion.
Then in the object storage, I mean, really it's it's decide like, you know, it's a lot of decisions. It's not there isn't anything technically difficult about putting data in a parquet file on s 3 or at least for the most part because there are actually also several different formats of parquet and some of them work with the with Spark, some of them work with, with Presto, etcetera. But there are a lot of decisions to be made. You have to decide, like, what the folder structure is gonna be, what the file format's gonna be, what the what type of compression you're gonna use, the kind of what's the largest file size before you wanna split it into several files, and how are you gonna discover all these new files that you created. So so I think that's like, if you're building a data lake yourself, like, I really wouldn't recommend that. I would say that's kind of similar to saying, you know, I'll decide to use Oracle because it's an amazing database, but but I won't use their storage layer. I'll build the the Oracle files myself. I feel like that's kind kind of a a the level of the comparison there of of kind of rolling your own storage layer. But, yeah, I think it's it's mostly making the decisions of, like, kind of the right decisions that are gonna scale and and that you're not gonna regret them down the road. And changing that is is very very difficult once a system, once a system exists. And then getting the data out, so data is usually like, you know, when you think about data, you you usually the examples are always flat. It's always like, you know, we think of data as databases. And maybe in 5 years that won't be the case anymore. But for now, like, when people think about data, they think of CSV files, but that isn't isn't data. Data is is hierarchical always. 90% of it is JSON. So you have a schema that has records, and within the records, additional records and arrays that might have nulls in them and might have all sorts of different stuff in them that that so, like, there's a lot of variety, but also, like, you know, the the the tools of transforming data are either SQL, which works very well on flat data but doesn't really support hierarchical data at all, or else writing code. And writing code really sucks. I mean, you have bugs and, like, you want a declarative language. There's a reason SQL is so popular. So I think that's another like, when you're talking about getting the data out of the data lake, like, taking this data and often your output is gonna actually be a flat format or or just a key value. Like, I wanna know for each user how many times I saw them. Like, super, like, super simple, the the output, but but the transformations you have to do on hierarchical data is are are really challenging.
And that's something that we we specifically address in upsolver. Kind of the reason we we chose not to go with SQL as as the, as the language of of our transformations, but also just adding a lot of kind of a lot of support, a lot of native support of of hierarchical data and what happens when you do a transformation on a nested field and you bring the field out to an unnested field. So all these things kind of happen, seamlessly, and that I think is a huge challenge. Like, if you already had flat data and you're transforming it to a different flat format, that's usually not gonna be too challenging. But if you had a nested format, then then that can be quite quite difficult. And as far as, outputs, I think maybe the most common output we see today within Amazon is, is athena.
Athena is, is a system that basically it well, it wraps it's a managed Presto. And, Presto is a SQL engine over over s 3. So basically, it lets you query, lots of files and you get results in seconds. So you can query, like, terabytes of data and get results really in in a few seconds. And, I mean, that's, let's say, the most common output from Upsolver because people just want SQL on top of data usually for analytics or reporting, things like that. And there, I think the main challenge is is managing the the metadata of the output itself, and that's something that we also try to do in upsolver. So, like, I think I mean, maybe it's a bit beyond the boundaries of what the data lake platform should be doing. I mean, it's more the responsibility of the downstream system, but since it doesn't do that, so so then we kind of added that service. But I I think then if you're talking about downstream systems, getting data into a downstream system is usually not too challenging. Replaying a large amount of historical data into a downstream system, that can be very challenging. So often systems that you use in order to actually, build build build your, your product on top of, they usually have limited throughput. They're usually like, I'm a database and I can get up to 5, 000 records per second, or I'm a data warehouse and I can get up to a 100000 records per second. But if I have a 100, 000, 000, 000 records of backlog and and I wanna just you know, I wanna change my transformation, put all of it into there, I think that that can be very challenging. I think that's probably the the main driver of why, a lot of our customers are choosing to use athena is because the data itself is stored in s 3, so you really don't need to like, you can transform as much data as you want. It it has limitless throughput. So I think that that's really a main driver for for the adoption of that,
[00:41:05] Unknown:
of that product. And 1 of the things that you mentioned there is that for a lot of the data lakes that are built in house or self managed or put together from various open source components are using SQL as the layer for being able to do transformations or processing of the data once it's landed on the file system, but that you have your own interface for being able to natively support these hierarchical data structures. So wondering if you can talk a bit about what the workflow looks like for somebody who's using Upsolver and the types of user that you're targeting with your platform and how that compares to some of these self managed or self hosted data lake systems?
[00:41:49] Unknown:
I think when you're looking at a a self managed data lake system, I mean, the target audience is almost invariably gonna be a big data engineer. So it's someone who knows about big data, who who has, like who's very familiar with the challenges of, of of kind of running all these systems, in a distributed and and high throughput environment. And then, you know, the tools of choice are usually gonna be Spark, where you're gonna be doing, let's say, a certain percentage SQL and then a certain percentage just coding kind of in or using systems like, Pandas or something like that, in order to to manage your data. But, I mean, the end user, the person who actually needs the data I mean, the big data engineer, their their role is to make the data accessible to the, data analysts, data scientists, or developers who are actually gonna be building stuff on top of it. So in upsolver, we're kind of trying to cut out the middleman in a way. So I think our target audience is gonna be people who don't know much about big data, but do know a lot about their data or or want to know a lot about their data. So it's gonna be either developers, data analysts, or data scientists, and kind of trying to give tools to those people that that, you know, their expertise is much more around what they're building than managing the data that's gonna feed it. I think for for 2 out of 3, probably SQL is definitely the language of choice. And I think it was, let's say it was a kind of a statement for us to say we're not going to be SQL, like, you know, Confluent, the company that, that manages Kafka, came out with KSQL, which is like SQL over streaming data. So they really went in that direction. And when we saw that, we were like, well, did we actually make a mistake, not going with the SQL interface? And definitely it is a barrier a barrier of entry because people know SQL and they, you know, it's it's very fluent to them to start to just start writing their code and it kind of does what they expect. I mean, obviously, you can't translate SQL from 1 engine to another really, but, but but it more or less does what they expect. And so that would be for data scientists and and data analysts. I mean that's their bread and butter. That's what they know how to use. And and I think we have a good offering in that sense because it's very visual, kind of even lowering the barrier of entry a bit more, to people whose SQL isn't their first language, let's say. So so for them, I mean, having a visual interface, I'd say more similar even to Excel than, than anything else. I mean, Excel is something you can really, like, a layperson who hasn't worked with Excel can more or less just start working with, with it without, like, too much introduction. So so I think that's kind of the direction we're going, getting, like, an immediate feedback loop to the operations you're doing and having everything work in a lot more of a of a visual way on top of your data. And I think for developers, definitely it's incredible because I mean developers you know they're used to writing code, they're used to like I mean their tooling was awful And then having this kind of visual representation where you can also build, things with with kind of a code like interface is is is amazing for them. And mostly developers weren't like there aren't many developers whose first language is SQL, let's say. So they're kind of replacing a second language with a different second language that's much more expressive for what they're trying to do. So for that user, I think it's a it's it's very valuable. And have you found there to be
[00:45:11] Unknown:
a particular scale or level of data maturity for an organization at which point they would be better served by moving their data like platform to be fully managed in house and built on top of their own tooling or pulling together some of the open source options as opposed to using Upsolver? Or do you think that Upsolver solves the majority use case regardless of scale or maturity of the organization?
[00:45:38] Unknown:
Well, I mean, you know, at, like, Facebook scale, they would probably have their own up solver that that they built internally. So, I mean but, like, if I'm taking aside the the companies that are actually building a a data lake platform themselves also. You know, I think that that most companies, like or any company would would get value out of having Upsolver at least as the first step. So, I mean, it could be that you have a lot of, like, a lot of data infrastructure that's already running on top of your on top of your data, and you don't wanna migrate that. But on the other hand, you do want to get insights and you wanna be able to see your data immediately as it streams in. So I think in that sense, even if you're not doing a full on migration, if you're not, like, saying, okay, opsolver is gonna be my data lake. All my data lake services are there. But adding it as a step between your streaming data and all the operations you're doing right now, I think that adds a lot of value. Yeah. I I think that it's it's definitely more challenging for us, when when you have, you know, when you have a a blue ocean, when a company doesn't really have data infrastructure yet, they have a lot of data and they wanna use it, but they don't don't necessarily know how. I mean, that's that's definitely, the easiest situation for us to to add a lot of value. Whereas if you have a company that has a lot of data infrastructure that that a lot of people are managing and a lot of existing ETLs that are that are kind of legacy and difficult to maintain but but they need to be maintained and they need to be transferred over. I think that's a lot more a lot more challenging but I don't think that it's the size of the company so much. I mean even a very large company with with tons of data and tons of of big data engineers etcetera, I think will get a lot of value out of using Upsolver.
But I do think that that depending on how much legacy infrastructure you have, that can add complexity to any kind of migration or adding another system in the middle that changes anything. Like like usually these systems, like you change a comma somewhere and everything just completely crashes, so you have to be very careful, and that that's a, I think, a big barrier to entry. And so looking forward, what are some of the major features or improvements that you have planned for the future of upsolver that you're most excited by? Yeah. So I mean, well, where to start? I think most of it would be around lowering the barrier of entry. I feel like upsolver gives a ton of value to people who are actually in, who are using it on a day to day basis, but kind of getting the word out there, getting new people to try it out, and and having it kind of like just experiencing it is kind of the the main hurdle I would say. So definitely, you know, as far as the platform itself, I would definitely wanna see more like high level primitives.
So, like, supporting I mean, today, we we support joins, for example, between data streams, but not in a not not as a primitive. You have to build the join yourself and adding, let's say, a joined stream between 2 streams, I think, would be very valuable, to reduce complexity. So so, I mean, that kind of thing is is something that that that I'm personally excited about, but I think that the thing that's really gonna drive adoption the most is better cloud integrations. It's already pretty good, but but the better it is, kind of the less steps you have integrating, the the happier people are gonna be to try to do it. And just in general, lowering the barrier of entry, adding SQL as an alternative to our to our language, is something we're we're working on, and I'm I'm very excited about. Not because I think that SQL is the right language to do these things, but I do think that that that it's just the language people know. So giving them the opportunity to, to adopt or to try out upsolver without needing to learn something new is is is very, very valuable. But, yeah, it's basically, you know, getting the word out there. I think it's, I think it's super valuable. Like, I think it's, you know, people who have used our system versus after having done data project, not even necessarily big data project, but data projects before that. They're like, I would never work at an organization that doesn't use a system like this because it's like, you're just, like, underground kind of. You're trying to to dig your your way out. So, yeah, it's super exciting.
[00:49:49] Unknown:
And so with that, I'll have you add your preferred contact information to the show notes for anybody who wants to follow the work that you're doing and stay up to date with the work that Absolvers, working on. And so as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:12] Unknown:
I mean, I think the biggest gap is just that, I mean, there isn't really a consolidated platform yet, but you have to build everything from from disparate tools. So you're really kind of like, yeah. I mean, it's a project. Any any data management, I mean, if you're not using upsolver, of course. So any data management or or any data project, let's say, is is actually a project. It's not like a data a data task. It's a data project. And I definitely see that kind of consolidating in in the next few years. Like, I think that even putting aside Upsolver, which I obviously hope that we're gonna be leaders in the space and etcetera, but I think you're also gonna see a lot of a lot of companies starting to even just package, different open source tools together. Like spark together with spark streaming, together with airflow or or flink or like taking a bunch of these different tools today that that exist and kind of if you patch them together, they're gonna do what you want and and selling it as a as a single unified product. So I think that's it. I think it's just having everything communicate with each other without needing to to specifically manage each component. Alright. Well, thank you very much for taking the time today to discuss the work that you're doing with Upsolver and some of the,
[00:51:28] Unknown:
as you said, whys and wherefores behind data lakes in general. So I appreciate that, and, I look forward to seeing that the work you continue doing with Upsolver. And I just wanna say thank you, and I hope you enjoy the rest of your day. Thank you very much. It was my pleasure.
Introduction and Guest Introduction
Yoni Eeny's Background and Motivation for Upsolver
Overview of Upsolver and Its Goals
Data Lakes vs. Data Warehouses
Challenges of Managing Data Lakes
Upsolver's Architecture and Cost Efficiency
Handling Spot Instances and Data Processing
Data Storage Formats and Metadata Management
Schema Management and Validation
Evolution of Upsolver's Architecture
Monitoring and Alerting in Data Lakes
Challenges in Data Lifecycle Management
Upsolver's Interface and Target Users
Scale and Maturity of Organizations Using Upsolver
Future Features and Improvements for Upsolver
Biggest Gaps in Data Management Tooling
Closing Remarks