Summary
With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
- Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems.
Interview
- Introduction
- How did you first get involved in the area of data management?
- What are the main serialization formats used for data storage and analysis?
- What are the tradeoffs that are offered by the different formats?
- How have the different storage and analysis tools influenced the types of storage formats that are available?
- You’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort?
- Why is it important for data engineers to carefully consider the format in which they transfer their data between systems?
- What are the switching costs involved in moving from one format to another after you have started using it in a production system?
- What are some of the new or upcoming formats that you are each excited about?
- How do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity?
Contact Information
- Doug:
- Julien
- @J_ on Twitter
- Blog
- julienledem on GitHub
Links
- Apache Avro
- Apache Parquet
- Apache Arrow
- Hadoop
- Apache Pig
- Xerox Parc
- Excite
- Nutch
- Vertica
- Dremel White Paper
- CSV
- XML
- Hive
- Impala
- Presto
- Spark SQL
- Brotli
- ZStandard
- Apache Drill
- Trevni
- Apache Calcite
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and Go CD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to data engineering podcast.com/gocd to download and launch it today.
Enterprise add ons and professional support are available for added peace of mind. And go to data engineering podcast.com subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media. This is your host, Tobias Macy. And today, I'm interviewing Julian Ladem and Doug Cutting about data serialization formats and how to pick the right 1 for your systems. So, Doug, could you start by introducing yourself?
[00:01:25] Unknown:
Yeah. I'm Doug Cutting. I've been building software for 30 or more years now, often with a serialized component. The last 15 or so years have been dominated by work on open source, most notably Hadoop. But for the purposes of this podcast, I think we're gonna talk about a project I started called Apache Avro.
[00:01:51] Unknown:
How did you first get involved in the area of data management?
[00:01:55] Unknown:
I worked on search engines for a long time back at, Xerox PARC early in my career at Apple in the early 90s on web search at excite in the late 90s. And so we were always building systems that were analyzing large data sets, big collections of text, and through that ended up working on a project called Nutch, which led to Hadoop, and then it turned out even though we were we built Hadoop really to support the building of of search engines, people ended up using it a lot to manage all kinds of other data. So I sort of I sort of fell into, the data management business through search engines.
[00:02:38] Unknown:
And, Julian, how about yourself?
[00:02:40] Unknown:
Yeah. So I'm honored to, be interviewed along with Doc Cutting. So back back when I was at Yahoo, I was working on Content Platform and got to use Hadoop early on. I was there when, Avro started and remember seeing dog presenting this new format at the time. So after that, I worked at Twitter where, I started the Parquet project in collaboration with, the Impala team. And so working on a columnar format, to improve our storage needs. And, from there, I got involved in a bunch of Apache project like Apache Pig and Apache Arrow, which is columnar presentation for in memory data.
And from then on, I started being involved with more Apache projects. And so about, how I got to the in the data storage space, when I was at Twitter, we had Hadoop on 1 hand that was very flexible and very scalable. So we could have a a lot of machines and doing a lot of things like machine learning or analytics or any kind of code you want. So it's very flexible, and you can do a lot of things. But it was still very much, file system oriented. So a lot of the data was in flat files and not that efficient. And next to the Hadoop cluster, we had Vertica, which is a columnar database which lays out the data in a columnar or presentation to be much more efficient at retrieving the data from disk and doing analytics.
And so Vertica was much lower latency to answer queries, but at the same it was not as flexible and it was limited to SQL. Right? It couldn't run anything else. It was kind of a black box. And so what we tried to do is to make Hadoop more of a database as less of a file system. So starting from the ground up, having a columnar representation. So kind of state of the art things from, the c store paper, which is a academia work that, started the Vertica company. And, using the Dremel paper that describes this way of storing in a, listy data structure in a columnar representation and getting into how do we make Hadoop more of a database and be more efficient than retrieving data and processing data at that time. So I did a lot of at that time, I did a lot of reading in between the lines on the Dremel paper to kind of understand, you know, the missing parts that were not described there on how to use this representation in a more generic way.
[00:05:33] Unknown:
The reason that I invited the both of you on to this episode is that in a lot of the conversations I've had with people in a context of data engineering and data management is which serialization format should they use for storing and processing their data? Because as the sort of big data and data analytic spaces have continued to grow and expand in importance and capabilities, there are ever more different ways to store your data. And so that introduces a lot of confusion. And I'm wondering if we can just start off by briefly summarizing some of the main serialization formats that are available and in active use for data storage and analysis and some of the trade offs that they each provide.
[00:06:19] Unknown:
I can dive in if you like. I mean, I think classically in database systems, the data was captive. It was it was the the format that it was stored in was controlled by the people who created the database and wasn't wasn't a published standard format. And I think, these days, we've got this open source ecosystem of data processing projects, data storage projects, where interchange between systems is is common and is useful. And so it it to some degree, it's a new problem, this this having serialization format. See, the interchange that was done before was relatively uncommon. It wasn't a primary storage format, so it wasn't very optimized.
So we we had things like, CSV and XML. And we still see that a lot because that is what a lot of applications can easily generate. They are well known standard formats. The end, you know, the problems with XML is that it's verbose, slow to process, and with CSV, and XML to some degree it's they they don't do a very good job of really letting you store data structures with named fields that you can you can process quickly. There's other things, you know, if you're trying to some some sort of technical details that it's useful to be able to chop files into chunks that you can process in parallel called splitting files is we tend to call it. And you'd like a format that was splittable.
You'd also like a format that's compressible, and those 2 are can be at odds. Coming up with a format that is both splittable and compressible is is hard. You can't just take a CSV file and compress it and then chop it up. That that doesn't work. You've gotta you've gotta have a series of compressed chunks inside the file that you can you can find. So there's some some formats have developed over time as we've learned this is what we need to be able to interchange data between these these different components, between systems like, Hadoop and Spark and Impala and Hive and all these different things. It's really handy to be able to try different tools on on a single dataset and be able to generate data from 1 tool and and ingest it into another, and do so efficiently. So there's been a real real demand for that, you know, and Hadoop started with some formats which weren't very good for interchange and so did so did Hive.
And so the the formats we're talking about are really second generation designed to to to address this. And Avro is is a format that was designed to, you know, again, to to address all these, challenges, to be splittable, to be compressible, to, have, some metadata to give you stand alone datasets that you can you can pick up and and see what are the fields in here, what is the data structure, but still efficiently process it. And to work across components in written in different programming languages. And there weren't a lot of things out there like that. There wasn't really anything that I could find that that met all those requirements, which was what what led to Avro. Avro, there's stores things that record at a time, And, you know, in in in order okay. It has complete record and then another complete record and then another complete record.
And that's not the most efficient way to process things always. In a lot of cases, what you'd like to do is see all the values of a particular field in a record at once. So if you got you got a a a 1000000 records in a file and they all have a a date in them, you'd like to just process all the dates, for example, and not see all the other fields in all those records. And so then you want a columnar format and that's really what what Parquet is about is is responding to that need. So it's yet a generation beyond Avro, optimizing a really common access pattern, but also sharing all the other elements of being efficient, supporting compression, being language independent, system independent so that it it can work as a as an interchange format, but optimized for particular kinds of analysis and access patterns, that you see in data systems, that Avro is not optimized for.
Is that is that fair, Julian?
[00:10:33] Unknown:
Yeah. I think I think that's fair. Yeah. So and, you know, and likewise, Parquet is more efficient in certain access pattern that are very common when you do a SQL query. But likewise, Avro is going to be more efficient in a lot of other access patterns when you're doing a lot of pipelines that transform data from that read all the data and writes all the data. And Avro is going to be more efficient in those cases or in streaming use case where you want to reduce the latency when you read a single, record at a time, and you want them as soon as possible when you're processing streaming data. And Parquet is much more, efficient when you're doing a SQL query, for example, because you write the data once. And so you can spend more time compressing better or doing a different layout than the raw oriented.
But on the other hand, you're going to access, data from very different point of views, like you're selecting only And and so it's very beneficial to have this columnar layout to access that data very quickly. And it compresses a lot better. And there are a lot of things you can do to speed up this analytics side of things of data processing. But you mentioned, Tobias, that, you know, people have to choose. But I think what we may hope over time is to have better abstractions. And, again, it's a little bit still remnants of the starting point of Hadoop as this distributed file system. And the ecosystem has slowly evolved adding layers on top of that.
And it's becoming more and more of a database. And having those abstraction layers on top, that kind of make it more seamless whether the data is oriented or columnar oriented and what format it's in. Right? Because depending on the use case, you may want different layouts. And it's kind of make it difficult for people to use to take advantage of this if everything is hard coded against the file formats. So we're evolving slowly to better abstractions. And, again, it becomes more of a database but more deconstructed. Because, like, something that Doc said related to what I was talking about with Vertica. Vertica was this black box that you had to import your data into it to do queries, it was much faster to do analysis than Hadoop.
But once the data was inside of it, then there was nothing else you could do with the data than querying it through the SQL query engine. So with Parquet and Avro, you keep all the flexibility of the Hadoop ecosystem. Right? You can use many different query engines. You can use many different machine learning libraries or a lot of different programming frameworks. And it works with all of those. So you keep your options open. There's no importing your data in a silo anymore. You have your data in 1 place, and there are a lot of different things you can do to make things work together with different file formats or storage formats.
[00:13:57] Unknown:
Yeah. And it sounds like particularly in some of the conversations that I've had, a lot of the confusion was in sort of bundled up into the idea that people need to pick 1 format and then figure out a way to use it across every aspect of their system where it seems like what would be more beneficial is, for instance, using row oriented formats such as Avro in a streaming context where you're gonna be processing 1 record at a time or for, you know, data archival where you might need to just find all of the information about a particular record at at some future date. But then if you're going to be doing, live analytical queries where all of the data is going to be housed in, like, something like Hadoop or Hive, then you would, in most cases, be better served by having them in a columnar format such as Parquet or some of the other formats available for that.
And then maybe just using the, ETL pipelines as a means of transforming the row oriented data into column oriented data so that you're gaining the, benefits of each format in the context in which it's best suited?
[00:15:06] Unknown:
Yeah. So there are always going to be exceptions. And so, you know, in many analytics use case, a columnar representation is better. But there are always corner cases where it becomes more expensive or especially if you're going to access most of the columns every time. So it depends. And it helps a lot to have better metadata abstraction. So 1 of them is the Hive Metastore that can be used as an abstraction. But there are more and more, showing up or different companies have built different abstraction that kind of lets you abstract out from the users what format it's actually in.
And I think that's very important, to have these kind of capabilities.
[00:15:53] Unknown:
Yeah. And a lot of times too, people are judging some of the more, involved and elaborate formats against run of the mill systems that are using things like JSON or as you mentioned, CSV and, XML. And so really any format that's more suited to a, analytics workload in general is probably going to gain them a number of benefits versus what they had been using.
[00:16:18] Unknown:
Yeah. For sure. You know, they they sort of plain text formats like like, CSV and JSON are gonna are gonna be a lot slower. You know, they're nice, you can you can look at them a little more easily without a tool, but, you're gonna have some real performance impact and and, storage size impact. But it's also there's a in selecting a format, you don't wanna think about hyper optimizing for a particular application. I think people are, you know, realizing data is an is an asset. You wanna land data in a a a good strong format that will last a long time that you can use in as many different applications as you can and not have to reproduce it in a lot of ways. Now there there are times when you might wanna transform it into a different format as an optimization, have a a data set which is derived from another 1, but then have a pipeline so you can do that, repeatedly.
And think of of 1 as being, generated from the other, but having the original format. You know 1 of the reasons that that I went down the route of building Avro, was I was worried about a a proliferation of data formats. And if you've got a if you wanna have an ecosystem, and each component it has its own data formats, then you've got this, this number of of of translations that gets gets to be exponential between, the different formats in all the systems. And and you you what you really wanted to have some some common formats, that are useful by a lot of systems.
So I I think there's there might be an optimal format for a given application, but if you kept every data in that format, you might end up fragmenting your data and making it getting less value from it in the end. So there there's a trade off there between optimizing a single system versus having, maximal reuse of your data and enabling, easy experimentation and longevity of your data. So you so you wanna sort of curate a, data collection which is gonna, last a long time. So I I think there's there's a little more to it than just just the performance.
[00:18:29] Unknown:
Even beyond things like Avro and Parquet, there are a number of other formats that are available. And sometimes it can be difficult to determine whether some of them are superseded by newer formats or if each of them is being particularly tuned for a given use case. So some of the ones that I'm thinking of are Thrift and Ork, and then there are newer formats such as Arrow that is being promoted as a way to provide easy interoperability between languages and systems in an in memory context for being able to bridge those divides. So I'm just wondering if, either of you have any particular insight into the broader landscape of how some of the formats have evolved, if there are any that if somebody is just starting now that they should necessarily avoid because it's, been superseded or if it's or if if each format is still relevant for their particular case that they were designed for. So thrift and protocol buffers
[00:19:28] Unknown:
are, interesting. They're very good serialization systems, but they don't include a file format standard. So if you're talking about data that you're going to pass around as files, there isn't a standard 1 for protocol buffers and drift. Various people have have, you know, stored data in them. It's a little a little more challenging to make a standalone file in those because they're, the people they they have a compiler that, takes the the IDL and generates the, readers and writers, for various programming languages. And you could you could embed the IDL, you can better reference to it, but it's it's a little awkward for building, standalone file which you could pass between institutions say. So I don't they're not I wouldn't recommend, looking to thrift or protocol buffers for a file format. For an RPC system, that that's really their sweet spot. That's where they've been used a lot and tend to be used very successfully. So so there's that's for, you know, data on the wire rather than, data on on disk in a file.
Ork, you also mentioned, is a competitor to, I think I think it's safe to safe to use the word competitor. 2 Parquet started, shortly after parquet was started. It has minor pros and cons. I think it's unfortunate we have another format that is so similar in its capabilities to parquet and I I maybe Julian wants to speak more to the pros and cons of or Chris's Parquet.
[00:20:58] Unknown:
So, yeah, I can give a little bit of the history. I think Swift, protocol buffer, and Avro preceded Parquet, and Parquet kind of build try to be complementary to them. Like, 1 of the things they define is this IDL and, how you define your type system. And Avro is definitely better at all the parts about all kind of pipelines type of codes when you need to understand the schema and do transformations and makes this easier to deal with schema evolution and understanding your schema and be more self describing and passing the schema schema along with the data. And so Parquet is trying to not redefine the IDL, but just define a columnar format that you can become complementary to those things. Right? So you can reuse your same you have this seamless replacement when you can use your same IDL that you're using with Avro, for example, that describe your type system and use this columnar representation on disk when it's convenient, right, when it's the right use case. So maybe you were using Avro before, and you can still use Avro as your model, but you can swap with the Avro file format, which is raw oriented when it's useful.
And you can swap to Parquet columnar representation when it's better for SQL analysis. So that's 1 end. And so in the history of Parquet versus ORC, I think back in the day there was this need for a columnar representation on on disk for Hadoop. Right? So I had this use case when I was at Twitter, I was trying to make Hadoop more like Vertica. And there was this need and, you know, there was a little bit of overlap on people working on those columnar format. And then you start talking about it when it's ready, right? So you kind of publicize it and you say, Hey, look, it's open source. We're trying to build that. We think there's a need for it. So it's a little unfortunate that, you know, back in the day, I connected with the Impala team that was trying to do something as well.
And later on, we connected with other teams and kind of grow the Parquet community, but there was this, parallel effort. So, you know, the the representation of nested data structures is different. So Parquet uses a Dremel model and, ORC is using a different model. But they're going to have very similar characteristics because they are trying to solve the same problem. I think Parquet has been better at integrating in the ecosystem. Like, from the beginning, I was really aware that I didn't want to build another proprietary file format. You know, same problem that if you import in a database your data, then you can use it only in your database.
I really wanted it to become like a standard from the ecosystem. So from the beginning, from the community building point of view, I spent a lot of work kind of making sure people's opinion were integrated in the design. Like the drill, Apache drill team had some needs for new types, and we integrated their needs. The Impala team was coming with a c plus plus native code execution engine. So the Parquet format is very language agnostic, and we merged our designs early on to create Parquet. And so it's been very open and making sure people would come and get what they need. So a team at Netflix did the work of integrating with Presto, and they had some special needs because they were using Amazon and Estuary at the time. So we made we did the work to make sure it would work well for their use case as well. And just being open, and at some point you reach a critical mass and like more and more people start using it because that's what, you know, they see it starting as there are enough teams and projects using it that it makes sense for people to reuse the same format instead of inventing their own. So I think that was part of their success of Parquet was to be very open and very inclusive in the community early on. And, you know, Spark SQL started using Parquet, and we didn't even have to help them. Right? They just decided to do it and they did it, and once it went done, they talked about it.
So you know, the effort you put early on to be inclusive, it paid off pretty well and now Parquet is pretty much supported everywhere. And but I don't think I think, you know, technically the characteristic of Parquet are going to be very similar to r c. But, what makes it more valuable, I think and, again, you know, being the party guy, I'm biased. But I think that's something that was important to me early on is to make sure that we were making something standard that would, you know, would keep the flexibility of Hadoop, which is the beauty of the ecosystem is there are all those tools you can use, and you're not, like, siloed in 1 tool because of the storage layer you pick. And so the last part is talking about Arrow.
So it's kind of the next step. So we talked about serialization format and so Avro and Parquet as a storage layer on top of Hadoop and HDFS. And Arrow is thinking about the same problematic but in main memory because the access patterns and the characteristics, you know, the latency of accessing main memory comparing to accessing disks are different. So when you're storing data in memory, you similarly there are benefits to using columnar representation in memory that is arrow, but the trade offs are different. Right? The latency of accessing memory versus disk is different and you want to optimize more for the throughput of the CPU than in Parquet, you want to optimize more for the speed of getting it off of disk.
So there are different trade offs that warrant a different format. And so that's where Arrow is more from in memory processing. And as technologies evolve, we used to have late Intel main memory and more disks. And now there's more and more main memory and there are more tiers showing up because now we used to have spinning disks. Now you have SSDs with flash memory. And you also have NVMe, which is nonvolatile memory, which is flash but in the DIMM slots. And so you have different characteristics of the latency of accessing the data, the throughput of reading the data are different. Right? So you have different trade offs and also the cost of storage. So the how much main memory versus how much NVMe versus how much SSD versus how much being this storage you have. And so those different trade offs will apply. Right? You have more range of where you store the data and, how fast you can access it and process it.
And so all those things are very interesting. So that's where you can have things now are more on the spectrum. So Arrow is more on the in memory end and Parquet is more on the on disk end of optimizing the layout for query processing. And, there's going to be in the future, there's going to be interesting evolution on where, which 1 is more efficient. And that's where kind of abstracting more where the data is stored and making this more managed like in a database, is going to be interesting in the future in simplifying that problem for end users. Ideally, Arrow is something that end users don't need to see or be aware of. I mean, they can be aware of it, but they don't need to write in their code than they're reading error or writing error. It's kind of more, from a database managed.
[00:29:10] Unknown:
That that's a good distinction that Aero will tend to be used, within tools and that that, you know, maybe people would will will indicate they want to use Arrow to to as the format to pass things between 2 systems, but it's not a it's not a persistent format, in the way that Avro and Parquet are. Anyway, they're they're all 3 very complimentary use cases, Avro, Parquet, and Arrow.
[00:29:36] Unknown:
Yeah. So those are the 3 categories. Right? So when you were listing all those serialization formats, you have the raw oriented, columnar oriented, and on disk for persistence and columnar for in memory processing as a streaming category. And for Arrow, you know, we started the community from we already build a community on top, like, while building parquet. Right? All this getting people together. So hopefully, for Aero, we can manage to have a single 1 representation that becomes that much more valuable because it's interoperable.
Right? If we can agree on having that same representation for in memory, then things are more efficient because you don't need to convert from 1 format to the other. And also simple because you don't need to write all those conversions from 1 format to the other. So there's a lot of benefits of, agreeing early on on what the format is going to be and build on top of that. So which is what we're doing with Arrow.
[00:30:35] Unknown:
1 of the questions that I had in here as well is the subject of, you know, how important it is for a data engineer to determine which format they're going to use for storing their data and what the switching costs are that are involved if they come to the realization that the format that they chose at the outset doesn't match their access patterns. But from our conversation earlier, particularly about Avro being used full as a format for being able to use it in multiple different contexts. It seems like what's more important is just making sure that all of the data that you store is in the same format so that your, tooling can be unified no matter what you're trying to do with it. And that if you do need to have different access patterns, then at that point, you would do the transformation for that particular use case. Just wondering if I'm sort of,
[00:31:32] Unknown:
representing that accurately. That that sounds right to me. And if if you know that your access patterns tend to be SQL, then you might use parquet and you tend to get batches of data at a time, then those are, that means that parquet could be your primary format, and if you've got more streaming cases and, then and you're not doing SQL as much then, then Avro might be the primary format. Converting between those to do can be done pretty much losslessly, there's probably a few edge cases, losslessly and automatically, so you're you're not you're not stuck forever. So knowing knowing a bit about your applications and then picking for for a given dataset, what what what the, the best of those 2 I think is probably a good good path for most folks.
[00:32:16] Unknown:
Yeah. So the the Java, libraries of Parquet have been designed with having this, you know, drag and drop capability. Then you can use let's say you use Avro for designing your model of all your data. And then you can let's say you use just MapReduce jobs for doing ETL, you can just replace the output format to be parquet versus avro, and it's very flexible. And so from a programming API standpoint, you still read and write Avro objects. But under the hood, you can swap to the Avro oriented format or the parquet columnar format. And it's pretty much seamless.
[00:33:00] Unknown:
1 of the other things that I'm curious about is particularly given the level of maturity for both of these formats and some of the others that are available, what the current evolutionary aspects of the formats are, what's involved in continuing to maintain them, and if there are any features that you are adding or considering adding, and then also the challenges that are that have been associated with building and maintaining those formats?
[00:33:29] Unknown:
My experience with file formats is that they're, things you don't wanna change very quickly, and because compatibility is so important, people don't wanna have to rewrite their datasets, They wanna be able to take a dataset that they they created 5 years ago and and process it today using the latest versions of software. If you version the format a lot then it can be really tricky to to make to guarantee that you can read it. You also wanna, in many cases, guarantee that things generated by a new application can be read by an old application. So you need both forward and backward compatibility, in in most in most of the way that most organizations work, they don't update all their systems in parallel.
So you really have very few opportunities to change the format itself. What what we tend to focus on is improving the tools, the usage, the, the APIs, the the integration with programming languages, higher level, ways of defining types, things like that rather than at least this is the case for Avro, rather than, extending the the the basic format, because we can't do that without breaking people and and people, want to be able to need to be able to rely, on the format having both that forward and backward compatibility?
[00:34:44] Unknown:
Yeah. So like Doug said, Parquet is evolving slowly for those same reason. Right? So first, we need to maintain backwards compatibility forever. Right? Some when something has been written, we need to make sure it's always you're always going to be able to read it. And, also, the forward compatibility means also when you add features to a file, you want to the old the old readers to be able to read the data. And they're not going to take advantage of the new features, but they're still going to be able to read data that has been written with the new the new library in a way that still works. Right? So, for example, some of the new features that are being added to Parquet, and there are some discussions about it, 1 of them some of them have been, like, very simple things. Like, there have been better compression algorithms that came, in the past few years, whether it's, Broadly from Google or the standard from, Facebook.
And then provide better compression rash ratio and speed of the compression. So those are relatively simple. And you kind of you know, you need to make sure that it's clear for people. Then they start using the new compression, then only the new version of the libraries will be able to read it. And then there are other things that are more advanced like Bloom filters, for example. And so there are different things, that need to be taken into account, when we have Bloom filters. You know, first, Parquet is a language and agnostic format. So you can't just, like, make a Java implementation, for example, and say, hey. It's done.
We need to make sure that there's going to be a Java and a c plus plus implementation. And we need to make sure we have a spec, that represents so that we document the binary format in the spec, right? It's not just, look, there's a bloom filter feature and here's the API to access it. We actually define every bit of the file format in the spec as well so that, you know, it can be implemented in both languages, in Java and Native, and it's going to be consistent. And so, you know, I'm doing cross compatibility testing and things like that. Other things that are challenging is more like semantic behavior. So for example, in Parquet, we added timestamps as a type.
Right? So you already, from the beginning, you add ints and floats and valence like strings. And so adding timestamps, actually there are a lot of ways you can interpret the timestamp and the SQL spec has different things like timestamp with time zone or without time zone. And it's a little bit challenging to make sure that the semantics are understood the same across the entire ecosystem. So that's where, you know, you need to make sure there's good communication between communities. And there's a lot of work. You know, it's not just code. It's also collaborating between communities because you want to make sure that when you write a time stamp in Spark SQL, then a there's no time zone problem when you read it with Hive.
And so that you interpret the data the same way between Spark SQL, Hive, Impala, drill, and all those query engines and system that use the parquet format. So it's a little bit challenging sometimes. And, sometimes it's slow moving, but, you know, it's people's data. Right? It's not transient. Once it's stored, you want to make sure it's stored correctly. And, this is a persistent system. Right? So you want to make sure you're going to be able to read it in several years, and, it's not going to your data is not going to become obsolete as the library evolves.
[00:38:57] Unknown:
How do you think that the evolution of hardware and the patterns and tools for processing data are going to influence the types of storage formats that either maintain or grow in popularity?
[00:39:13] Unknown:
So I think I I touched a little bit to that earlier. You know, the evolving hardware, there are a lot of things that are evolving at the moment, Whether it's SSDs that have very different characteristics to spinning disks or, you know, NVMe, which is basically something that's cheaper than memory with, slightly more latency, of access. But, you know, the data is shifting. You have more and more tiers of storage for the data with different characteristic of how much it costs to store the data there, how fast is it to retrieve it and process it. And so this is going to influence how people store data. And it's kind of that explain why you have Parquet and Arrow and how you want to be able to convert from 1 to the other really fast and use have different trade offs on how much you compress your data.
Because, you know, comparing the speed of IO versus speed of CPU. And so it's going to be very interesting. The other technology aspect that's coming in is a GPU. And so people are using GPUs more for doing data processing. And, actually, Arrow has been used. There's a Go AI group that is defining columnar presentation, in memory representation for, GPU processing. And they're using Arrow now as a standard for interoperability and exchanging data between different, GPU based processing systems. And, GPUs are also getting more and more memory. Right? 1 of the problem of the GPU is the high the high cost of transferring data from main memory to GPU memory compared to the speed of the GPU itself. Right? So the GPU can process data really quickly, but it's costly to move the data from the main memory to the GPU.
But, you know, as you can see a pattern where the GPUs are getting more and more memory because they're using more and more for data analytics and machine learning and not just for, you know, video games. And so it's going to be interesting to see how those evolve. And this these different trade offs of main memory storage versus spinning disk storage, they're going to shape a little bit how we do the storage and how we improve the layouts and the compression. You know, having more or less compression, whether you want to do more speed or more compact storage, it's going to be very interesting.
[00:42:09] Unknown:
Yeah. I'll just second what, Julian has said for the most part that, you know, it's it's there's all these time space trade offs that that that you're you're making where you, you know, if you have something that's completely uncompressed, it can be very very fast to process in in memory, but it might take up a lot of memory. And if you compressed it a bit, you could store more of it in memory, and so you'd be able to get more work done before you had to hit some slower form of storage, and and those sort of trade offs are very tricky and they're very sensitive to the relative performance of these different tiers of storage.
We're, you know, we're starting to see some very fast persistent tiers which change things. You know, so you can start to think of things that are that are, you know, accessed within a few cycles as storage, systems, because because the the the memory persists. So we'll see what what end up being the most effective formats. You know, arrow is an interesting thing to to to track, you know, and and sort of fighting all of that is is this, need to have standard interchange formats. You don't wanna adopt a format for a, fringe architecture.
You really wanna, not not really keep a lot of your data in a format unless you've got an ecosystem of applications which can share it in that format and take advantage of that that format, for which it's an efficient format. So you know, Avro, Parquet and Arrow are are, each designed for for sweet spots of today's ecosystem, and I suspect will survive for quite some time, for many many years yet, but not unlikely that that some other formats will will join them as as we as this this sort of storage hierarchy evolves.
[00:44:04] Unknown:
And are there any other topics that you think we should cover before we start to close out the show?
[00:44:09] Unknown:
I I 1 amusing anecdote perhaps, when, before or probably around the same time that Julian was starting Dremel, I, I created myself a a columnar format called, Trevny, and I tried to reproduce what was in the Dremel paper and, and Julian mentioned the the missing bits in the paper. I could never recreate them, and, and and so I came up with yet another way of representing, hierarchical, structure within within a columnar file system. And then Julian came along and and bested me with with a a because he was actually able to understand that Dremel paper and and implement it fully and also really develop a a strong community around that. And, you know, and I I Trebnick hadn't caught on in in any quarters yet and so it was the wisest thing to do was to let people forget about it, because we don't need multiple formats that are very similar, that are filling the same niche, So I'm I'm pleased that, that Dremel came along and and, and replaced, Trebnick to to the degree that Trebnick never had a spot. Anyway, it was mostly I I couldn't couldn't figure out those missing bits that that Julian did figure out, in that Dremel paper. They were they were they're pretty pretty quick and breezy in in parts of it.
[00:45:32] Unknown:
Yes. It's a little hand wavy in the pitter paper, and I had to hit my head several times to kind of figure it out and kind of finding out what was going on. I felt really bad for a while about, you know, Trevny as kind of replacing it. But, you know, I'm glad, we're we're in good terms.
[00:45:56] Unknown:
You know, it was it was good that Trebni hadn't caught on and nobody people had built systems around it and had large amounts of data in it, then that way, you know, and some to some degree we would have had to commit to preserving compatibility with it, but it never really got got that critical mass before parquet showed up and started to become significantly more popular. There's no ill will. T r e v n I. It was invert spelled backwards for no good reason.
[00:46:29] Unknown:
Is there anything else that you think we should talk about before we close out the show? No. That's it, I think. Well, for anybody who wants to follow the work that both of you are up to and the state of the art with the your respective serialization formats. I'll have you add your preferred contact information to the show notes. And then just for 1 last question to give people things to think about, if you can each just share the 1 thing that is at the top of your mind in the data industry that you're, most interested and excited for. Doug, how about you go first? Sure. I mean, I'm
[00:47:05] Unknown:
just fundamentally excited by this notion of a ecosystem of open source based ecosystem of data software. I think, we're really seeing, an explosion of capabilities for people to to get value from data in a way that we we didn't in prior decades, and I I think it's we're gonna continue to see this that that the power that that people have at their at their fingertips, explode, and the the the possibilities, you know, this year we're we're talking a lot about machine learning and deep learning, and I don't know what it'll be next year, but, I there there will be something and it'll be, it'll be able to really take off, and it'll be something that is useful. It's not just just hype because, the way this ecosystem is is driven by users.
It's this this nation nature of the, loosely coupled set of open source projects. So I'm I'm continually amazed by that and continue to be excited. I think that's gonna deliver more good things to people. And, Julian.
[00:48:13] Unknown:
Yeah. I so yeah. That I agree with this. You know, you can see this deconstructed data stack where, like, you used to have the database was very siloed and, you know, a fully integrated stack. But in this ecosystem, it's kind of each component is kind of becoming standard independently. And so you have Parquet as a columnar file format. But you have also other components of this deconstructing database. Like Calcite is the optimizer, database optimizer, that have been used in many projects. And it's kind of the optimizer layer of a database. It's kind of reused component, parquet is a storage columnar layer. Arrow is an in memory processing component. And it's kind of those things being reused add a lot of flexibility to the system, right, because you store your data, and then you can have many different components that start interacting with each other. And you have the choice for different type of SQL analysis or different type of machine learnings, of different type of just plain ETLs and more streaming.
And all those things can interact together in an efficient way. And that's where, like, things like Parquet and Arrow are contributing is helping with interconnecting all those things in an efficient way. Because initially, you have things, you know, lowest common denominator like CSV or JSON or XML were the starting points because it was easy, it was supported everywhere, but it was not very efficient. And now we're getting to that second generation of we're looking at, so what common pattern that all those system needs and what's the efficient way of having them communicating? And that's where, like, the columnar presentation, things like Aero and Parquet or for analysis or for more streaming things or ETL side, Avro is, the better representation.
And so you have those standards that evolve and that enable having this deconstructed database, right? All those elements that are very flexible and is kind of loosely coupled and can interact with each other. So I think the next component that is starting to evolve is having a better metadata layer, like knowing what are all our schemas, how do they evolve, what are the storage characteristics, How do we take advantage of our storage layer or interconnection between systems? And it's going to become, more of more of that very powerful, very flexible deconstructed database.
[00:51:07] Unknown:
Well, I really appreciate the both of you taking time out of your day to join me and go deep on serialization formats. It's definitely been very educational and informative for me and I'm sure for my listeners as well. So thank you again for your time and I hope you each enjoy the rest of your evening.
[00:51:25] Unknown:
Thank you.
[00:51:26] Unknown:
Thanks, Tobias. It's fine to sign fun to find somebody who actually cares about these things. They're kind of the boring backwater of big data.
Introduction to Guests and Topic
Doug Cutting's Background and Apache Avro
Julian Ladem's Background and Parquet Project
Serialization Formats Overview
Choosing the Right Serialization Format
Thrift, Protocol Buffers, and ORC
Apache Arrow and In-Memory Processing
Maintaining and Evolving Serialization Formats
Impact of Hardware Evolution on Storage Formats
Closing Thoughts and Future Directions