Data Serialization Formats with Doug Cutting and Julien Le Dem

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and Go CD is the open source platform made by the people at Thoughtworks who wrote the book about it.

Go to data engineering podcast.com/gocd

to download and launch it today.

Enterprise add ons and professional support are available for added peace of mind.

And go to data engineering podcast.com

subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site.

To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.

This is your host, Tobias Macy. And today, I'm interviewing Julian Ladem and Doug Cutting about data serialization formats and how to pick the right 1 for your systems.

So, Doug, could you start by introducing yourself?

Yeah. I'm Doug Cutting.

I've been building

software for

30

or more years now, often with

a serialized

component.

The last

15 or so years have been dominated by work on open source,

most notably

Hadoop.

But for the purposes of this

podcast,

I think we're gonna talk about a project I started called Apache Avro.

How did you first get involved in the area of data management?

I worked on search engines for a long time

back at, Xerox PARC early in my career at Apple

in the

early 90s on web search at excite in the late 90s.

And so we were always building

systems that were analyzing large data sets, big collections of text, and through that ended up

working on a project called Nutch, which led to Hadoop,

and then it turned out even though we were we built Hadoop really to

support the building of of search engines,

people ended up using it a lot to manage all kinds of other data.

So I sort of I sort of fell into,

the data management business

through search engines.

And, Julian, how about yourself?

Yeah. So I'm honored to,

be interviewed

along with Doc Cutting.

So back back when I was at Yahoo,

I was working on Content Platform and got to use Hadoop early on. I was there when, Avro started and remember seeing dog

presenting this new format at the time.

So after that, I worked at Twitter

where,

I started the Parquet project

in collaboration

with,

the Impala team.

And so working on a columnar format,

to improve our storage needs.

And, from there, I got involved in a bunch of Apache project like Apache Pig and Apache Arrow,

which is columnar presentation for in memory data.

And from then on, I started being involved with more Apache projects.

And so about,

how I got to the in the data storage space,

when I was at Twitter,

we had Hadoop on 1 hand that was very flexible and very scalable.

So we could have a a lot of machines and doing a lot of things like machine learning or analytics

or any kind of code you want. So it's very flexible, and you can do a lot of things.

But it was still very much,

file system oriented. So a lot of the data was in flat files

and not that efficient.

And next to the Hadoop cluster,

we had Vertica,

which is a columnar database

which lays out the data in a columnar or presentation to be much more efficient at retrieving the data from disk

and doing analytics.

And so Vertica was much lower latency to answer queries,

but

at the same it was not as flexible and it was limited to SQL. Right? It couldn't run anything else. It was kind of a black box.

And so what we tried to do is to make Hadoop more of a database as less of a file system. So starting from the ground up,

having a columnar representation.

So kind of state of the art

things from,

the c store paper, which is a academia

work that,

started the Vertica company.

And, using the Dremel paper

that describes this way of storing in a, listy data structure in a columnar representation

and

getting into

how do we make Hadoop more of a database and be more efficient than retrieving data and processing data

at that time. So I did a lot of at that time, I did a lot of reading in between the lines

on the Dremel paper to kind of understand,

you know, the missing parts that were not described there on how to

use

this representation

in a more generic way.

The reason that I invited the both of you on to this episode

is that in a lot of the conversations I've had with people in a context of data engineering and data management

is which serialization format should they use for storing and processing their data? Because

as the sort of big data and data analytic spaces have continued to grow and expand in importance and capabilities,

there are ever more different ways to store your data.

And so that introduces a lot of confusion.

And I'm wondering if we can just start off by briefly summarizing some of the main serialization formats that are available and in active use for data storage and analysis

and some of the trade offs that they each provide.

I can dive in if you like.

I mean, I think classically

in

database systems,

the data was captive.

It was it was the the format that it was stored in was controlled

by the people who created the database

and wasn't wasn't a published

standard

format.

And I think, these days, we've got this

open source

ecosystem

of data processing projects,

data storage projects,

where interchange between systems

is is common and is useful.

And so it it to some degree, it's a new problem,

this this having serialization format. See, the interchange that was done before

was relatively

uncommon. It wasn't a primary storage format, so it wasn't very optimized.

So we we had things like,

CSV

and XML.

And we still see that a lot because that is what a lot of applications can easily generate. They are well known standard formats.

The end, you know, the problems

with XML is that it's verbose,

slow to process,

and with CSV,

and XML to some degree it's they they don't do a very good job of

really letting you store

data structures

with named fields

that you can you can process quickly. There's other things, you know, if you're trying to some some sort of technical details that it's useful to be able to chop files into chunks that you can process in parallel

called splitting files is we tend to call it. And you'd like a format that was splittable.

You'd also like a format that's compressible,

and those 2 are can be at odds. Coming up with a format that is both splittable and compressible is is hard. You can't just take a CSV file and compress it and then chop it up. That that doesn't work. You've gotta you've gotta have a series of compressed chunks inside the file that you can you can find.

So there's some some formats have

developed

over time as we've learned this is what we need to be able to interchange

data between these these different components, between systems like,

Hadoop and Spark and Impala and Hive and all these different things. It's really handy to be able to try different tools on on a single dataset and be able to generate data from 1 tool and and ingest it into another,

and do so efficiently. So there's been a real real demand for that, you know, and Hadoop started with some formats which

weren't very good for interchange

and so did so did Hive.

And so the the formats we're talking about are really second generation designed to to to address this.

And Avro is is a

format that was designed to, you know, again, to to address all these,

challenges,

to be splittable, to be compressible,

to,

have, some metadata to give you stand alone datasets

that you can you can pick up and and see what are the fields in here, what is the data structure,

but still efficiently process it.

And to work across components in written in different programming languages. And there weren't a lot of things out there like that. There wasn't really anything that I could find that that met all those requirements,

which was what what led to Avro. Avro, there's stores

things that record at a time,

And, you know, in in in order okay. It has complete record and then another complete record and then another complete record.

And that's not the most efficient way to process things always.

In a lot of cases, what you'd like to do is

see all the

values of a particular field in a record at once. So if you got you got a a a 1000000 records in a file and they all have a a date in them, you'd like to just process all the dates,

for example, and not see all the other fields in all those records. And so then you want a columnar format and that's really what what Parquet is about is is responding to that need. So it's yet a generation beyond

Avro,

optimizing a really common access pattern,

but also sharing all the other

elements of being efficient,

supporting compression,

being language independent, system independent

so that it it can work as a as an interchange format, but optimized for

particular kinds of analysis and access patterns,

that you see in data systems,

that Avro is not optimized for.

Is that is that fair, Julian?

Yeah. I think I think that's fair. Yeah. So and, you know, and likewise,

Parquet is more efficient in certain access pattern that are very common when you do a SQL query.

But

likewise,

Avro is going to be more efficient in a lot of other access patterns

when you're doing a lot of pipelines that transform data from that read all the data

and writes all the data.

And Avro is going to be more efficient in those cases or in streaming use case where you want to reduce the latency

when you read a single,

record at a time, and you want them as soon as possible when you're processing streaming data.

And Parquet is much more,

efficient when you're doing a SQL query, for example, because you write the data once. And so you can spend more time compressing better or doing a different layout than the raw oriented.

But on the other hand, you're going to access,

data from very different point of views,

like you're selecting only

And and so it's very beneficial to have this columnar layout to access that data very quickly.

And it compresses a lot better. And there are a lot of things you can do to speed up this analytics

side of things of data processing.

But you mentioned, Tobias, that, you know, people have to choose.

But I think what we may hope over time is to have better abstractions.

And, again, it's a little bit still remnants of the starting point of Hadoop as this distributed file system.

And the ecosystem has slowly evolved adding layers on top of that.

And it's becoming more and more of a database. And having those abstraction layers on top,

that kind of make it more seamless whether the data is oriented or columnar

oriented

and what format it's in.

Right? Because depending on the use case, you may want different layouts.

And it's kind of

make it difficult for people to use to take advantage of this if everything is hard coded against the file formats.

So we're evolving slowly to better abstractions.

And, again, it becomes more of a database but more deconstructed.

Because,

like, something that Doc said related to what I was talking about with Vertica. Vertica was this black box that you had to

import your data into it to do queries, it was much faster to do analysis than Hadoop.

But once the data was inside of it, then there was nothing else you could do with the data than querying it through the SQL

query

engine. So with Parquet and Avro,

you keep all the flexibility

of the Hadoop ecosystem. Right? You can use

many different query engines.

You can use

many different machine learning libraries

or a lot of different programming frameworks.

And it works with all of those. So you keep your options open. There's no importing your data in a silo anymore.

You have your data in 1 place, and there are a lot of different things you can do to make things work together

with different

file formats or storage formats.

Yeah. And it sounds like

particularly in some of the conversations that I've had, a lot of the confusion was in sort of bundled up into the idea that

people need to pick 1 format and then figure out a way to use it across every aspect of their system where it seems like what would be more beneficial

is, for instance, using

row oriented formats such as Avro

in a streaming context where you're gonna be processing 1 record at a time or for, you know, data archival where you might need to just find all of the information about a particular record at at some future date. But then if you're going to be doing,

live analytical queries

where all of the data is going to be housed in, like, something like Hadoop or Hive,

then you would, in most cases, be better served by having them in a columnar format such as Parquet or some of the other formats available for that.

And then maybe just using the,

ETL pipelines as a means of transforming

the row oriented data into column oriented data so that you're gaining the,

benefits of each format in the context in which it's best suited?

Yeah. So there are always going to be exceptions.

And so, you know, in many analytics use case, a columnar representation is better.

But there are always corner cases where it becomes more expensive

or especially if you're going to access most of the columns every time.

So it depends. And it helps a lot to have better metadata abstraction. So 1 of them is the Hive Metastore

that can be used as an abstraction.

But there are

more and more, showing up or different companies have built

different abstraction that kind of lets you

abstract out from the users

what format it's actually in.

And I think that's very important,

to have these kind of capabilities.

Yeah. And a lot of times too, people are judging some of the more,

involved and elaborate formats against run of the mill systems that are using things like JSON or as you mentioned, CSV and, XML.

And so really any format that's more suited to a,

analytics workload

in general is probably going to gain them a number of benefits versus what they had been using.

Yeah. For sure. You know, they they sort of plain text formats like like, CSV and JSON are gonna are gonna be a lot slower. You know, they're nice, you can you can look at them a little more easily without a tool, but, you're gonna have some real performance impact and and, storage size impact.

But it's also there's a in selecting a format,

you don't wanna think about hyper optimizing for a particular application.

I think people are, you know, realizing data is an is an asset.

You wanna land data

in a a a good strong format that will last a long time that you can use in as many different applications as you can

and not have to reproduce it in a lot of ways. Now there there are times when you might wanna transform it into a different format as an optimization, have a a data set which is derived from another 1, but then have a pipeline so you can do that,

repeatedly.

And think of of 1 as being,

generated from the other,

but having the original format. You know 1 of the reasons

that that I went down the route of

building Avro,

was I was worried

about a a proliferation

of data formats. And if you've got a if you wanna have an ecosystem,

and each component it has its own data formats, then you've got this, this

number of of of translations that gets gets to be exponential between,

the different formats in all the systems.

And and you you what you really wanted to have some some common formats,

that are useful by a lot of systems.

So I I think there's

there might be an optimal format for a given application,

but if you kept every data in that format,

you might end up fragmenting your data and making it getting less value from it in the end. So there there's a trade off there between

optimizing a single system

versus having,

maximal reuse of your data

and enabling,

easy experimentation

and longevity of your data. So you so you wanna sort of

curate

a, data collection which is gonna,

last a long time. So I I think there's there's a little more to it than just just the performance.

Even beyond things like Avro and Parquet, there are a number of other formats

that are available.

And sometimes it can be difficult to determine whether some of them are superseded by newer formats or if each of them is being

particularly tuned for a given use case. So some of the ones that I'm thinking of are Thrift

and Ork, and then there are newer formats such as Arrow that

is being promoted as a way to

provide easy interoperability

between

languages

and systems

in an in memory context for being able to

bridge those divides. So I'm just wondering if, either of you have any particular insight into

the broader landscape of how some of the formats have evolved, if there are any that if somebody is just starting now that they should necessarily

avoid because it's,

been superseded

or

if it's or if if each format is still relevant for their particular case that they were designed for. So thrift and protocol buffers

are,

interesting.

They're very good

serialization

systems,

but they don't include a file format

standard.

So if you're talking about data that you're going to pass around as files,

there isn't a standard 1 for protocol buffers and drift. Various people have have, you know, stored data in them. It's a little a little more challenging to make a standalone file in those because they're, the people they they have a compiler that,

takes the

the IDL and generates the, readers and writers, for various programming languages. And you could you could embed the IDL, you can better reference to it, but it's it's a little

awkward for building, standalone file which you could pass between institutions say. So I don't they're not I wouldn't recommend,

looking to thrift or protocol buffers for

a file format. For an RPC system, that that's really their sweet spot. That's where they've been used a lot and tend to be used very successfully. So so there's that's for, you know, data on the wire rather than, data on on disk in a file.

Ork,

you also mentioned,

is

a competitor to,

I think I think it's safe to safe to use the word competitor.

2 Parquet

started, shortly after parquet was started. It has minor pros and cons.

I think it's

unfortunate we have another

format that is so similar in its capabilities

to parquet

and I I maybe Julian wants to speak more to the pros and cons of or Chris's Parquet.

So, yeah, I can give a little bit of the history.

I think Swift, protocol buffer, and Avro preceded

Parquet,

and Parquet kind of build

try to be complementary to them. Like, 1 of the things they define is this IDL

and, how you define your type system.

And Avro is definitely better at all the parts about

all kind of pipelines

type of codes when you need to understand the schema and do transformations

and makes this easier to deal with schema evolution

and understanding your schema and be more self describing and passing the schema schema along with the data.

And so Parquet is trying to not redefine the IDL, but just define a columnar format

that you can become complementary

to those things. Right? So you can reuse your same you have this seamless

replacement when you can use your same IDL that you're using with Avro, for example,

that describe your type system

and use this columnar representation

on disk when it's convenient, right, when it's the right use case. So maybe you were using Avro before,

and you can still use Avro as your

model, but you can swap with

the Avro file format, which is raw oriented when it's useful.

And you can swap to Parquet columnar representation when it's better for SQL analysis.

So that's 1 end. And so in the history of Parquet versus ORC,

I think back in the day there was this need for a columnar representation

on on disk for Hadoop. Right? So I had this use case when I was at Twitter, I was trying to make Hadoop more like Vertica.

And there was this need and, you know, there was a little bit of overlap on people working on those columnar format.

And then you start talking about it when it's ready, right? So you kind of publicize it and you say, Hey, look, it's open source. We're trying to build that. We think there's a need for it.

So it's a little unfortunate

that, you know, back in the day, I connected with the Impala team that was trying to do something as well.

And later on, we connected with other teams and kind of grow

the Parquet community,

but there was this,

parallel effort. So,

you know, the the

representation of nested data structures is different.

So Parquet uses a Dremel model

and, ORC is using a different model. But they're going to have very similar characteristics

because they are trying to solve the same problem.

I think Parquet has been better

at integrating in the ecosystem.

Like, from the beginning,

I was really

aware that I didn't want to build another

proprietary

file format. You know, same problem that if you import in a database your data, then you can use it only in your database.

I really wanted it to become like a standard from the ecosystem. So from the beginning,

from the

community building point of view, I spent a lot of work kind of making sure

people's opinion were

integrated in the design. Like the drill, Apache drill

team had some needs for new types,

and we integrated their needs. The Impala team was coming with a c plus plus native code execution engine. So the

Parquet format is very language agnostic, and we merged our designs early on to create Parquet.

And so it's been very open and making sure people would come and get what they need. So a team at Netflix

did the work of integrating with Presto,

and they had some special needs because they were using Amazon and Estuary at the time. So we made we did the work to make sure

it would work well for their use case as well. And just being open, and at some point you reach a critical mass and like more and more people start using it

because that's what, you know, they see it starting as there are enough teams and projects using it that it makes sense for people

to reuse the same format instead of inventing their own. So I think that was part of their

success of Parquet was to be very open and very inclusive in the community early on. And, you know, Spark SQL started using Parquet,

and we didn't even have to help them. Right? They just decided to do it and they did it, and once it went done, they talked about it.

So you know, the effort you put early on to be inclusive,

it paid off pretty well and now Parquet is pretty much supported everywhere.

And but I don't think I think, you know, technically

the characteristic of Parquet are going to be very similar to r c.

But,

what makes it

more valuable, I think and, again, you know, being the party guy, I'm biased. But I think that's something that was important to me early on

is to make sure that

we were making something standard that would,

you know, would keep the flexibility of Hadoop, which is the beauty of the ecosystem

is there are all those tools you can use,

and you're not, like, siloed in 1 tool because of the storage layer you pick. And

so the last part is talking about Arrow.

So it's kind of the next step. So we talked about serialization format and so Avro and Parquet as a storage layer on top of Hadoop and HDFS.

And Arrow is thinking about

the same problematic

but in main memory

because the access patterns and the characteristics,

you know, the latency of accessing main memory comparing to accessing

disks are different.

So when you're storing data in memory, you

similarly there are benefits to using

columnar representation

in memory that is arrow,

but the trade offs are different. Right? The latency of accessing memory versus disk is different and you want to optimize more for the throughput of the CPU

than in Parquet, you want to optimize more for the speed of getting it off of disk.

So there are different trade offs that warrant a different format.

And so that's where Arrow is more from in memory processing.

And as technologies

evolve, we used to have late Intel main memory

and more disks.

And now there's more and more main memory and there are more tiers showing up because now we used to have spinning disks.

Now you have SSDs

with flash memory.

And you also have NVMe,

which is nonvolatile

memory, which is flash but in the DIMM slots.

And so you have different characteristics of the latency of accessing the data,

the throughput

of reading the data

are different. Right? So you have different trade offs and also the cost of storage. So the

how much main memory versus how much NVMe versus how much SSD

versus how much being this storage you have. And so those different trade offs

will apply. Right? You have more range of where you store the data and, how fast you can access it and process it.

And so all those things are very interesting. So that's where you can have things now are more on the spectrum. So Arrow is more on the in memory end and Parquet is more on the on disk end of

optimizing the layout for query processing.

And, there's going to be in the future, there's going to be interesting evolution on where, which 1 is more efficient.

And that's where

kind of abstracting more

where the data is stored and making this

more managed like in a database,

is going to be interesting in the future in simplifying that problem for end users.

Ideally, Arrow is

something that end users don't need to see or be aware of. I mean, they can be aware of it, but they don't need to write in their code

than they're reading error or writing error. It's kind of more,

from a database

managed.

That that's a good distinction

that Aero will tend to be used, within

tools and that that,

you know, maybe people would will will indicate they want to use Arrow to to as the format to pass things between 2 systems, but it's not a it's not a persistent format,

in the way that Avro and Parquet are. Anyway, they're they're all 3 very complimentary

use cases, Avro, Parquet, and Arrow.

Yeah. So those are the 3 categories. Right? So when you were listing all those serialization formats,

you have the raw oriented,

columnar oriented, and on disk for persistence and columnar for in memory processing

as a streaming category. And for Arrow,

you know, we started the community from we already build a community on top,

like, while building parquet. Right? All this

getting people together. So hopefully, for Aero, we can manage to have a single

1 representation

that becomes that much more valuable

because it's interoperable.

Right? If we can agree on having that same representation for in memory,

then things are more efficient because you don't need to convert from 1 format to the other. And also simple because you don't need to write

all those conversions from 1 format to the other. So there's a lot of benefits of, agreeing early on on what the format is going to be and build on top of that. So which is what we're doing with Arrow.

1 of the questions that I had in here as well is

the subject

of,

you know, how important

it is for a data engineer to determine which format they're going to use for storing their data

and what the switching costs are that are involved if they

come to the realization that the format that they chose at the outset doesn't match their access patterns.

But from our conversation earlier, particularly

about Avro being used full as a format for

being able to use it in multiple different contexts. It seems like what's more important is just making sure that all of the data that you store is in the same format so that your,

tooling can be unified

no matter what you're trying to do with it. And that if you do need to have different access patterns, then at that point, you would do the transformation

for that particular use case. Just wondering if I'm sort of,

representing that accurately. That that sounds right to me. And if if you know that your access patterns tend to be SQL,

then you might use parquet

and you tend to get batches of data at a time, then those are, that means that parquet could be your primary format, and if you've got more streaming cases and,

then and you're not doing SQL as much then, then Avro might be the primary format. Converting between those to do can be done pretty much losslessly,

there's probably a few edge cases,

losslessly and automatically,

so you're you're not you're not stuck forever. So knowing knowing a bit about your applications and then picking for for a given dataset, what what what the, the best of those 2 I think is probably a good good path for most folks.

Yeah. So the the Java,

libraries of Parquet have been designed

with having this, you know, drag and drop

capability. Then you can use let's say you use Avro for designing your

model of all your data.

And then you can

let's say you use just MapReduce jobs for doing ETL,

you can just replace

the output format to be parquet versus avro, and it's very flexible. And so from a programming API standpoint, you still read and write Avro objects.

But under the hood, you can swap to the Avro oriented format or the parquet columnar format. And it's pretty much seamless.

1 of the other things that I'm curious about

is

particularly

given the

level of maturity for both of these formats and some of the others that are available,

what the current evolutionary

aspects of the formats are, what's involved in continuing to maintain them, and if there are any

features that you are adding or considering adding, and then also the challenges that are that have been associated

with building and maintaining those formats?

My experience with file formats is that they're,

things you don't wanna change very quickly,

and because compatibility is so important, people don't wanna have to rewrite their datasets, They wanna be able to take a dataset that they they created 5 years ago and and process it today using the latest versions of software.

If you version the format a lot then it can be really tricky to to make to guarantee

that you can read it. You also wanna, in many cases, guarantee

that things generated by a new application can be read by an old application.

So you need both forward and backward compatibility,

in in most in most of the way that most organizations work, they don't update all their systems in parallel.

So you really have very few opportunities to change the format itself.

What what we tend to focus on is improving

the tools, the usage, the,

the APIs,

the the integration with programming languages,

higher level, ways of defining types, things like that rather than at least this is the case for Avro, rather than, extending the the the basic format,

because we can't do that without breaking people and and people,

want to be able to need to be able to rely,

on the format having both that forward and backward

compatibility?

Yeah. So like Doug

said, Parquet is

evolving slowly

for those same reason. Right? So first, we need to maintain backwards compatibility

forever. Right? Some when something has been written,

we need to make sure it's always you're always going to be able to read it.

And, also, the forward compatibility

means also when you add features to a file,

you want to the old

the old readers to be able to read the data. And they're not going to take advantage of the new features, but they're still going to be able to read

data that has been written with the new

the new library

in a way that still works. Right? So, for example, some of the new features that are being added to Parquet, and there are some discussions about it,

1 of them some of them have been, like, very simple things. Like, there have been better compression algorithms

that came, in the past few years,

whether it's, Broadly from Google or the standard from, Facebook.

And then

provide better

compression

rash ratio and speed of the compression.

So those are

relatively simple.

And you kind of

you know, you need to make sure that it's clear for people. Then they start using the new compression,

then only the new version of the libraries will be able to read it. And then there are other things that are more advanced like

Bloom filters, for example.

And so there are different things,

that need to be taken into account,

when we have Bloom filters.

You know, first, Parquet is a language and agnostic

format. So you can't just, like, make a Java implementation, for example, and say, hey. It's done.

We need to make sure that there's going to be a Java and a c plus plus implementation.

And we need to make sure we have a

spec,

that represents so that we document

the binary format

in the spec, right? It's not just,

look, there's a bloom filter feature and here's the API to access it. We actually define every bit of the file format in the spec as well

so that, you know, it can be implemented in both languages,

in Java and Native, and it's going to be consistent.

And so, you know, I'm doing cross compatibility testing and things like that.

Other things that are challenging

is more like semantic

behavior. So for example, in Parquet, we added timestamps

as a type.

Right? So you already, from the beginning, you add ints and floats and

valence

like strings.

And so adding timestamps,

actually there are a lot of

ways you can interpret the timestamp and the SQL spec

has different things like

timestamp with time zone or without time zone. And it's a little bit challenging to make sure that

the semantics

are understood the same across the entire ecosystem.

So that's where, you know, you need to make sure there's good communication between communities.

And there's a lot of work. You know, it's not just code. It's also

collaborating between communities because you want to make sure that

when you write a time stamp in Spark SQL,

then a there's no time zone problem when you read it with Hive.

And so

that you interpret the data the same way between Spark SQL,

Hive,

Impala,

drill,

and all those query engines and system that use the parquet format. So it's a little bit challenging sometimes.

And,

sometimes it's slow moving,

but, you know, it's people's data. Right? It's not transient.

Once it's stored, you want to make sure

it's stored correctly.

And,

this is a persistent system. Right? So you want to make sure you're going to be able to read it in several years,

and,

it's not going to

your data is not going to become obsolete

as the library evolves.

How do you think that the

evolution of hardware

and the patterns and tools for processing data

are going to influence

the types of storage formats that either maintain or grow in popularity?

So I think I I touched a little bit to that earlier.

You know, the

evolving

hardware, there are a lot of things that are evolving at the moment,

Whether it's SSDs

that have very different characteristics

to spinning disks

or,

you know, NVMe,

which is

basically

something that's

cheaper than memory

with,

slightly more latency,

of access.

But, you know, the data is shifting. You have more and more tiers

of storage for the data with different characteristic of how much it costs

to store the data there, how fast is it to retrieve it and process it.

And so this is going to influence

how people store data. And it's kind of

that explain why you have Parquet and Arrow

and how you want to be able to convert from 1 to the other

really fast and use have different trade offs on how much you compress your data.

Because, you know,

comparing the speed of IO versus speed of CPU.

And so it's going to be very interesting. The other

technology

aspect that's coming in is a GPU.

And so people are using GPUs more for doing data processing.

And,

actually, Arrow has been used. There's a Go AI

group that is defining

columnar

presentation, in memory representation for,

GPU processing.

And they're using Arrow now as a standard for interoperability

and exchanging data

between different,

GPU based processing systems.

And,

GPUs

are also getting more and more memory. Right? 1 of the problem of the GPU

is the high the high cost of transferring data

from main memory to GPU memory

compared to the speed of the GPU itself. Right? So the GPU can process data really quickly,

but it's costly to move the data from the main memory to the GPU.

But, you know, as you can see a pattern where the GPUs are getting more and more memory because they're using more and more for data analytics and machine learning

and not just for, you know, video games.

And so it's going to be interesting to see how those evolve.

And this these different trade offs of main memory storage versus

spinning disk storage,

they're going

to shape a little bit how we do the storage and how we

improve the layouts and the compression.

You know, having more or less compression,

whether you want to do more speed or more compact

storage, it's going to be very interesting.

Yeah. I'll just second what, Julian has said for the most part that,

you know, it's it's there's all these time space trade offs that that that you're you're making

where you, you know, if you

have something that's completely uncompressed, it can be very very fast to process in in memory, but it might take up a lot of memory. And if you compressed it a bit, you could store more of it in memory, and so you'd be able to get more work done before you had to hit some slower form of storage,

and and those sort of trade offs are

very tricky and they're very sensitive

to the relative performance

of these different tiers of storage.

We're, you know, we're starting to see

some very fast persistent

tiers

which change things. You know, so you can start to think of things that are that are, you know, accessed within a few cycles as storage,

systems,

because because the the the memory persists.

So we'll see what what end up being the most effective formats.

You know, arrow is an interesting thing to to to track,

you know,

and and sort of fighting all of that is is this, need to have

standard interchange formats. You don't

wanna adopt

a format for a,

fringe architecture.

You really wanna,

not not really

keep a lot of your data in a format unless

you've got an ecosystem of applications

which can share it in that format and take advantage

of that that format, for which it's an efficient format. So you know, Avro, Parquet and Arrow are are,

each designed for for sweet spots of today's

ecosystem,

and

I suspect

will survive for

quite some time, for many many years yet, but not unlikely that that some other formats will will join them as as we

as this this sort of storage hierarchy evolves.

And are there any other topics that you think we should cover before we start to close out the show?

I I 1 amusing anecdote perhaps,

when,

before or probably around the same time that Julian was starting Dremel, I, I created myself a a columnar format

called, Trevny,

and I tried to

reproduce what was in the Dremel paper and, and Julian mentioned the the missing bits in the paper. I could never

recreate them,

and, and and so I came up with yet another way of representing,

hierarchical,

structure within within a columnar

file system.

And then Julian came along and and bested me with

with a a because he was actually able to understand that Dremel paper and and implement it fully and also really develop a a strong community

around that. And, you know, and I I Trebnick hadn't caught on in in any quarters yet and so it was the wisest thing to do was to let people forget about it, because

we don't need

multiple formats that

are very similar, that are filling the same niche, So I'm

I'm pleased that,

that Dremel came along and and, and

replaced,

Trebnick to to the degree that Trebnick never had a spot. Anyway,

it was mostly I I couldn't couldn't figure out those missing bits that

that Julian did figure out, in that Dremel paper. They were they were they're pretty pretty quick and breezy in in parts of it.

Yes. It's a little hand wavy in the pitter paper, and I had to

hit my head several times to kind of figure it out and kind of

finding out what was going on. I felt really bad for a while about, you know, Trevny as kind of replacing

it. But, you know,

I'm glad, we're we're in good terms.

You know, it was it was good that Trebni hadn't caught on and nobody people had built systems around it and had large amounts of data in it, then that way, you know, and some to some degree we would have had to commit to preserving compatibility with it, but it never really got got that critical mass before parquet showed up and started to become significantly more popular.

There's no

ill will.

T r e v n

I. It was invert spelled backwards

for no good reason.

Is there anything else that you think we should talk about before we close out the show?

No. That's it, I think. Well, for anybody who wants to follow the work that both of you are up to

and the state of the art with the your respective serialization formats. I'll have you add your preferred contact information to the show notes.

And then just for 1 last question to give people things to think about, if you can each just share the 1 thing that is at the top of your mind in the data industry that you're, most interested and excited for.

Doug, how about you go first? Sure. I mean, I'm

just fundamentally

excited by this notion of a

ecosystem of open source based

ecosystem of data software. I think, we're really seeing,

an explosion of capabilities

for people to to get value

from data

in a way that we we didn't in

prior decades, and I I think it's we're gonna continue to see this that that

the power that that people have at their at their fingertips,

explode,

and the the the possibilities, you know, this year we're

we're talking a lot about machine learning and deep learning, and I don't know what it'll be next year,

but,

I there there will be something and it'll be, it'll be able to really take off, and it'll be something

that is useful. It's not just just hype because,

the way this ecosystem is is driven by users.

It's this this nation nature of the,

loosely coupled set of open source projects. So I'm I'm continually amazed by that and continue to be excited. I think that's gonna deliver

more good things to people. And, Julian.

Yeah. I

so yeah. That

I agree with this.

You know, you can see this deconstructed

data

stack where, like, you used to have the database

was very siloed and, you know, a fully integrated stack.

But in this ecosystem,

it's kind of each component is kind of becoming standard

independently. And so you have Parquet as a columnar file format.

But you have also other components

of this

deconstructing database. Like Calcite

is

the optimizer, database optimizer, that have been used in many projects.

And it's kind of the optimizer layer of a database.

It's kind of reused component, parquet is a storage

columnar

layer. Arrow is an in memory

processing

component. And it's kind of

those things being reused add a lot of flexibility

to the system, right, because you store your data,

and then you can have many different components that start interacting with each other. And you have the choice for different type of SQL analysis or different type of machine learnings,

of different type of just plain ETLs

and more streaming.

And all those things can interact together

in an efficient way. And that's where, like, things like Parquet and Arrow

are contributing

is helping with interconnecting

all those things in an efficient way.

Because initially, you have things, you know, lowest common denominator

like

CSV or JSON or XML

were the starting points because it was easy, it was supported everywhere,

but it was not very efficient.

And now we're getting to that second generation of we're looking at, so what

common pattern that all those system needs and what's the efficient way of having them communicating?

And that's where, like, the columnar presentation,

things like Aero and Parquet

or for analysis

or for more streaming things or

ETL side, Avro

is, the better representation.

And so you have those standards

that evolve and that enable

having this deconstructed

database, right? All those elements that are very flexible

and is kind of loosely coupled and can interact with each other.

So

I think the next component that is starting to evolve is having a better metadata layer,

like knowing what are all our schemas, how do they evolve,

what are the storage characteristics,

How do we take advantage

of our storage layer or interconnection between systems?

And it's going to become,

more of more of that very powerful,

very flexible

deconstructed

database.

Well, I really appreciate the both of you taking time out of your day to join me and go deep on serialization formats.

It's definitely been very educational and informative for me and I'm sure for my listeners as well.

So thank you again for your time and I hope you each enjoy the rest of your evening.

Thank you.

Thanks, Tobias. It's fine to sign fun to find somebody who actually cares about these things.

They're kind of the boring backwater of big data.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Information

Links