Migrate And Modify Your Data Platform Confidently With Compilerworks

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data in motion.

That leaves data ops reactive to data quality issues and can make your consumers lose confidence in your data.

By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end to end metadata, Data Band lets you identify data quality issues and their root causes from a single dashboard.

With Data Band, you'll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives.

Go to dataengineeringpodcast.com/databand

today to sign up for a free 30 day trial and to take control of your data quality. Your host is Tobias Macy. And today, I'm interviewing Shevek about Compiler Works and his work on writing compilers to automate data lineage tracking from your SQL code. So So, Shevek, can you start by introducing yourself?

Hi. I'm Shevek. I'm technical founder of Compiler Works.

I guess I started writing compilers,

I don't know, by accident

20 years ago probably.

You've given me the introduction challenge, which is to figure out

why.

I think it's because they're just 1 of the hard problems,

and an awful lot of things that are put out there as languages

aren't really compilers.

They're just a syntax stuck on top of an executor. And I think almost all toy languages, whenever anybody says I've invented a language,

they haven't actually invented a language. There's no semantics to the language itself. They've just stuck a syntax on top of an executor.

And as I've gone on with writing compilers, I found that the real challenges are where there's an actual semantic transformation between the language being expressed

and the language and the target system.

And that's when it really starts to get interesting. I'm not really interested in parsers.

Parsers, when you speak to people who learn about compilers in university,

they have been taught to write parsers.

And I think 1 of the reasons for this is it's really fun and entertaining to write a 12 week course about parsers, and you can teach it very slowly, and you can do LL and LR and Early's algorithm and railroads and all of these other algorithms that nobody in their right mind would use because they're theoretically interesting. But

please teach people about languages and compilers and semantics. Because until you're talking about mapping between semantic domains, you're not really doing the job. Given that your accidental introduction to compilers and subsequent fascination with them, how did you end up in the area of data management?

It's

where I'm gonna be brutally honest. It says where the money is. What you've got out here in the world of data management is a vast world of enterprise languages,

each of which has a single vendor.

And by the nature of a single vendor,

they get to charge what they please and tell you what you can do with it. And so there's a traditional joke about exactly what filling you would like in your sandwich.

You can have this marvelous enterprise capability, but you have to have all of these restrictions with it.

And it started with

taking those restrictions

off the enterprise language and just saying, what else could we do if we had a truly generic open compiler for each of these proprietary languages?

And actually, that philosophy started a little bit earlier

because there were a lot of languages out there. Like, there's no point writing a commercial compiler for C these days unless you're doing something particularly obscure with, like, FPGAs. But, fundamentally,

C is free.

That doesn't necessarily make it easy.

And so if you go back into my GitHub, you'll see that 1 of my earliest projects was writing c preprocessor implementations

natively in a whole set of languages.

Because

back in the nineties particularly, anytime you wanted a preprocessor, this was before we had the whole modern world of assorted preprocessors and templating languages,

People just used either CPP or M4.

But if you were working in, say, Java and somebody defined something as using the c preprocess, you didn't have a ready implementation.

And so that philosophy of writing things that were compatible with other things and just opening up the world

has been an underlying philosophy of what I've been building for a lot longer than data processing languages.

Just turns out that data processing languages is a commercially viable place to do it.

As good a reason as any to be in the business. And so that brings us to what you're doing at Compiler Works. And I imagine you've sort of given us the prelude to the story behind it, but I'm wondering if you can just share a bit more about sort of the business that you've built there and some of the motivation

and story behind how you ended up building this company

to address this problem of lock in by data processing systems.

The reason you build a new database is because you have a new capability.

And then we get this marvelous phrase, ANSI SQL,

which is a myth about as good as the Bigfoot. Lots of people claim to have seen it, but nobody actually has.

And so now we come back to this question of translating between

semantic domains.

I have 2 SQL vendors.

I have code on 1. I want to run it on the other. The code is syntactically

similar because wrote this great big expensive document that dropped some hints about how you might want your language to look if you were to consider making it look like something that might look like this.

But the languages don't fundamentally do the same thing. Simple case, like what does division mean? What's 1 divided by 10?

Now,

if your database server happens to operate in integers, the answer is 0.

If your database server happens to operate in numerics, the answer is 0.1

with 1 decimal place of precision.

If your database server happens to operate in floating points, then the answer is something very close to 0.1, but not exactly the same because that's not an exact binary number. And so now even with that very trivial case, I've motivated something beyond ANSI in terms of translating between database servers.

So Compiler Works has 2 products.

1 is we translate code from 1 data processing platform to another, and we do it correctly,

And that's correctly with a capital c.

And the other thing that we do is we compile the code, and we do static analysis,

and we tell you what it is that you need to know about this code. And a significant part of that is things like, if I change this,

what will the effect be on my organization as a whole?

So

you might imagine that you're writing code that produces a particular table in a particular server,

and you do or don't make a particular error, or you do or don't have a particular change,

somewhere, 10 levels removed from you in a different suborganization

of an organization that employs tens of thousands of people,

somebody is affected by this change.

Who are they? How are they affected? Do they need to be told? Did they make a critical business decision based on that? And what do we need to do in order to keep this organization running? And should you really be considering making this change before you made it? Or if you already made the change, what do you now need to do to fix up? So that's the static analysis side. Definitely a lot of interesting things to dig into there. And lineage in particular is an area that has been gaining a lot of focus and interest lately from a number of different parties and attempts at addressing it in different ways. So I definitely like the sort of compiler driven static analysis motivation for it. And

I'm wondering

before we get too much further down that road, you've already given a bit of an overview about some of the differences between parsing and compiling. But for the purposes of kind of laying the groundwork for the rest of the conversation, can you just give sort of your definition of what is a compiler

and how is that relevant to this overall space of language translation and data lineage?

I usually describe a compiler as something that turns something from 1 language into another.

In the case of lineage,

what we're doing is we're turning something from the underlying source code

into

an algebraic model,

and then that computer algebra model is the system of which we can ask questions

regarding, you know, what happened, what the consequences are, where the lineage is.

It's sort of interesting to think about the compiler, particularly the lineage side. It's like, are we a compilers company, or are we a computer algebra company?

I suspect,

really, we're the computer algebra company because that's where the hard stuff is.

I think

lineage analysis getting popular is an interesting question.

Because

if I, for instance,

write some piece of code that says, you will generate a piece of SQL that takes this table, and it processes like that, and it generates that table.

Well, I've got lineage from my source table to my target table.

But now we fall into a hole which is this.

If the language that I invented in order to generate this SQL is just as expressive as SQL,

then that language is going to be flamingly complicated, and it's going to have all of these semi joins and stuff that's

typically, the reason people like these languages

is to make them not be as expressive as SQL, because they want to make them more accessible to developers or they wanna have clicky, droppy boxes or something like that.

Now assuming that you followed that road,

which is basically universal, now you have the problem that your language is insufficiently expressive to do the thing the analyst wants to do. And so now what happens is you have this text box or this field or something where you type in a fragment of underlying SQL.

And now what you've got is not a language, but a macro preprocessor,

which doesn't actually know what's going on in that fragment of underlying SQL that the developer typed in.

And so all of these tools, they start out saying, yes, you're gonna build your thing in our nice GUI workflow. We're gonna show you this nice GUI workflow, and that will give you the lineage.

But you don't really have the lineage because you don't have enough expressiveness to really do the job. Therefore, the developers had to type some custom code into a box, and you don't understand that custom code. Therefore, you don't have lineage.

And if what's happening is that you're subject to something like CCPA or GDPR,

where you are going to jail if you get this wrong, then you don't have a lineage tool. You actually need to look at the code that the machine really executed

and analyze that code

accurately at a column level, and then you have lineage model, and then you've got a chance of not going to jail. But anything less, we do not define as lineage.

Yeah. And then another

motivation for

trying to reconstruct lineage is that the

so best practice, quote, unquote, for data processing these days is

spread your logic across these 5 different systems

where you need to run this tool to get your data out of this source system into this other location,

this other tool to process your data in that location, this other tool to build an analysis on top of that preprocessed data, and then this other tool to actually show it to somebody to make a decision based on it, that they then input other data back into this other system that we pull it back out again. And so trying to reconstruct that flow of operations

using

an additional set of language processing to, you know, post into an API that stores the data in a database or

trying to analyze the query logs from your data warehouse that has a very limited scope view of the entirety of the data life cycle and trying to sort of piece all this back together? Well, the query and audit logs typically give you a good start because they're at least partly written by security people, and the security people say, you must tell us everything that's gonna go on. The

fragmentation

of the data infrastructure, which is the other thing that you alluded to, is very real, And I think

this leads to a situation where in a typical installation, we're processing multiple languages

and stringing them together. And sometimes that stringing together is standardized and sometimes it's bespoke. But the challenge in putting together a lineage is to be able to identify a column in the ODS up front in, say, the web tier where a user has interacted with something

and list all of the back end BI dashboards

that that column affected

and then describe the effect of that use of interaction on each of those dashboards in a human readable way. That's the challenge.

Absolutely.

And so to that point,

you've mentioned a little bit about some of these enterprise processing languages and the tool specific semantics about how they manage that processing, and you've discussed some of the wonderful joys of the SQL ecosystem and trying to translate across those.

And I'm wondering if you can just give an overview of the

specific

language implementations

and areas of focus that you're building on top of for compiler works, whether you're focused primarily on the SQL layer and being able to generate, you know, these transformations and this lineage

for the databases or if you're also venturing out into things like Spark jobs or arbitrary Python or Java code and things like that? Yes. So you run into

a number of issues as you walk around the data infrastructure.

So the SQL languages are, for the most part, statically analyzable. There are a couple of holes in them, and there are 1 or 2 that have type systems that lead 1 to lose hair.

From there, 1 can fairly easily go out to the BI dashboards, particularly the richer products. So we're talking about some of the flow based languages. And at this point, as a compilers company, 1 ends up writing

a new piece of technology

because

basically, all of the SQL languages are are are tied into the relational algebra or some variant or some set of extensions thereof.

It's always amused me that so many of the papers on relational optimization start with a a phrase something like,

without loss of generality, we should assume that the only Boolean connective is and, which basically means that you're not allowed to use or and you're not allowed to use not. Well, guess what? If you do make that assumption,

the world becomes really easy and really simple, and they're denying the existence of outer joins. It's really easy to write an academic research paper on optimization if you only deal with and, but I disagree with the without loss of generality.

You've lost the whole of the real world there. Anyway, I diverged. Sorry. So, yes, so there's a bunch of data flow languages and BI dashboarding systems that actually work

effectively with data flow and data process management. So here we're talking about the informaticas, the tableaus, the things in that range, the data stages. So we grew out to work with those, and then we have a secondary core that speaks to the same computer algebra engine that deals with these data flow style languages.

Spark is a sort of a mixture because, you know, on the front end, Spark SQL looks like SQL.

The question then is I'm gonna speak generically, but not specifically about Spark SQL. But, like, what's the strength of the joint optimizer before you compile down to a data flow language, and are you really a data flow language?

The seminal paper, I think, for people wanting to understand why data flow languages

have benefits is probably the Google Flume paper,

particularly the statistics about

reduction in map reduced jobs by doing delayed evaluation.

But once you get out of that and you get into the ETL languages, you also run into things like SAS. And so now you end up with questions like, how do you port,

let's say, Informatica

to Spark?

And I picked those 2 because they're both data flow languages,

but

Informatica has this fundamental property

that computation is sequential,

which is to say that if you set the value of a read write port,

that value remains assigned and remains visible to the next data record.

And so you can actually generate a datum by saying, if the record number is 1, set the value to x.

If the record number is not 1,

just read x.

And in an MPP system,

you would get x in 1 record and null in every other record.

But in both SAS and Informatica, you get the same value of x everywhere. And this is a sort of hard semantic difference that makes it very, very difficult to map between

languages.

You know, this is where we break out of the traditional job of compiling. We actually have to up engineer into user intent.

The user said this.

If you're compiling c or Java and you're compiling it down to x86, the user said this,

Therefore, do this. And if you do anything else, it's your fault.

But if you're compiling some of these languages, it's like, the user said this.

We've had a look around.

We think they really meant this part of what they said, and every other part of what they said was irrelevant or a consequence of the implementation.

And, therefore, we're gonna generate high performance code for the target that preserves the thing they meant and discards the rest, and that's hard.

Yes.

Exactly.

The technology is easy. It's the people that are hard as with everything that has to do with computers.

I don't envy you editing that long term because I went very long winded.

No. There's nothing to edit there.

Continuing on the point of the semantics being the hard part of translating these data processing languages,

and you mentioned earlier the fact

that at the core, you think you're more of a computer algebra company than a compiler company.

I'm wondering if you can discuss a bit of the sort of abstract modeling and mathematical representations

that you use as the intermediate layer for

translating between and among these different languages and generating the lineage analysis

that is 1 of the value adds of what you're doing there? I won't,

but I will say some interesting corollaries.

But I hope you will forgive me for not answering the question as you directly asked it, which is a very interesting question. Absolutely.

There's an old party trick where you take

a floating point value,

and you go around the loop a 1000000 times,

and you add 1 to this floating point value.

And the question now is, what's the value of that floating point value?

And the answer is, it's not a million. It's about 65, 000

because, eventually, you reach the point where the exponent ticks over since the exponent is now 1.

And at the point where you're not seeing the last integer digit anymore because your exponent's ticked over, adding 1 to a floating point value has no effect.

Processors are weird.

There's another party trick where you write an array of memory, you fill it with random numbers,

and then you add all the numbers into an accumulator.

But then you try doing that iterating backwards,

and then you try doing that iterating in a random order,

and you see what the performance difference is.

And it turns out that

processor hardware and memory prefetch and so on is an absolutely delicious thing as long as you're reading memory forwards.

It

sort of

manages if you're reading memory backwards,

and it falls flat on its face, throws its hands up in the air, and screams if you read memory in a random order

to the order of about a 200 to 1 performance penalty.

So now let's think about

a tree data structure.

On paper, a tree data structure has logarithmic complexity.

Brilliant.

And, academically, we ignore the constants.

But what a tree data structure looks like to a processor with memory prefetch is it looks like random order access,

And so what that means is that the constant is an order of magnitude larger than anybody thinks it is, which is why

on paper, a heap and a tree have the same performance. But in practice, a heap is so much faster because you start to fit into cache lines.

Now

computer algebra systems

look awfully like

random order access to memory,

and I think 1 of the most interesting problems in any sort of computer algebra,

and you'll even find it in SAT solvers where people are optimizing c, and they're changing the order of the structs in the internals of the SAT solver

such that they're all hotter up the top end of the word because that's the bit of the word that will fit into cache,

we actually get a multiple order of magnitude

by having a solver within the computer algebra engine,

which itself works out what order to do things in so that we don't appear to be accessing the algebra structure in random order. The more you can do on a piece of memory while it's in cache and then drop it out of cache. And then, of course, the other entertaining question is, wait, you do all of this in Java? Isn't Java some language where you're a 1000000 miles away from the processor?

Actually,

I happen to think Java and the JVM are a beautiful, beautiful setup because you get to be a 1000000 miles away from the processor when you want to be. But when you actually want to get download, you've got control of everything down to memory barriers.

And at that point, you're pretty much able to write assembler. And so it's the joy of a language.

They say 90 something percent of your code doesn't need optimizing, and they're right. So the question is, can you ignore 90% of the job and do the 1% of the job? I think

the JVM is 1 of the greatest fleets of modern engineering for allowing that. Just for point of reference for people who are listening and following along, I'll clarify that when you say tree data structure that you're speaking of tree spelled tree,

not tree.

Well, either will do. A b tree,

a tree with an I, a,

you know, an rb tree. Anything where you're effectively allocating nodes into main memory and then making those nodes point to each other, and particularly where your allocator is you know, you've got some sort of slab allocator that's mixing your tree nodes up with other things. You know, if you just allocate a tree, then maybe, yes, your root node is allocated at the start of RAM and everything else is allocated sequentially.

But the moment you start rotating and mutating a tree, then a tree walk looks like random order memory access again.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do?

Send a CSV via email?

Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to data engineering podcast.com/census

today to get a free 14 day trial.

Digging more into

the technical architecture of what you're building at Compiler Works, can you give a bit of an overview

about the workflow and the life cycle

of a piece of code? I guess the data is irrelevant here since you're working at the level of the code,

but the life cycle of a piece of code as it enters the compiler work system and your processing thereof and then

the representation

that you generate on the other side for the end users of the system?

Yes.

And I'm going to answer about 3 quarters of that, of course.

So let's deal with the thing that we deal with at the start. Everybody knows about lexing and parsing.

Lexing and parsing

are

not necessarily

as immediate as everybody thinks they are.

So, for instance,

we're taught that foo is an identifier

and 5 is an integer,

and that foo 5 is an identifier because it's something that starts with a letter. And then you ask the question, well, what is 5 foo?

And the answer is it's an illegal identifier because it's an identifier that starts with a digit.

But if you go into any SQL dialect and you type select 5 Foo, what you will get is a value 5 aliased to the name foo

because we implicitly, as humans, assume that there needed to be a space between the 5 and the foo. But if you follow

the textbook instruction of how to write a lexicon parser,

you actually get the bug that I just described. If you do it the way they taught you in school, you get that bug.

So now it gets a little bit interesting because the first part of

writing a compiler for an enterprise language is working out what its structure is.

What are we even being given there? So let's take a language that exports itself as XML. You know, there's a number of them out there. So now you've got a load of XML, and this XML has words in it,

such as such ID equals foo.

Well, what does that foo refer to? It refers to some other foo. XML can really only represent a tree structure,

and all of these languages are data flow structures. Therefore, they must be representing graphs. Therefore, there must be linkages within this XML. So the first stage is looking at a load of samples,

working out what the semantics of the language are.

Now here we have an advantage because most of these languages were written

as relatively thin skins on top of their executors.

And so if you know the capabilities of the executor,

and I kid you not, but actually reading historical papers about how memory allocators worked and things like that will give you a lot of insight into

things like

the extent of variables.

You learn when values get reset or when they get deallocated just by knowing what technology was available to the authors of the language at the time they wrote the language.

So having worked out what all of the linkages are,

you now have a symbol table,

and

having done the parse, you do the compile, which is symbol table type check,

operation selection,

very classic compiler work. And now what you have is you have a

compiled binary

in the semantics of the source language,

which are not necessarily atomic semantics.

And so now what you need to do is to break those semantics down, where some of those semantics may be quite large,

into

effectively atoms,

meaningful atoms.

So now we will end up with something like 32 bit integer addition with exception on overflow,

And you might even get an annotation about what the exception is on overflow.

And now you've got the set of challenges where

now if you're doing lineage analysis, you have a whole set of computer algebra rules that will tell you what you need to know about this thing. Am I doing this? Am I doing the same thing twice when you're looking for, you know, matching regions of algebra? Without going too much into details, you could do something like our Wigner distance on a computer algebra data structure or something like that. It's fundamentally hard. But things like that are available for saying, why are my marketing department and sales department computing the sales figures that are coming up with different numbers? Because now they're disagreeing over who gets how much money. And that's the sort of problem that needs to be solved here.

Emitting code is actually a whole new set of challenges.

So this is for the case of migration from platform to platform

because

if 1 just takes the semantics and emits them to the target platform, you get code that has a number of issues.

First, you get non optimality.

You get the fact that it's not using the correct idioms of the target platform.

You also get the fact that

it's ugly. We've all seen computer generated code, and nobody wants to maintain it.

And a significant part of

generating code for a target platform

is working out what code to generate which

is idiomatic for the target platform,

idiomatic

for the particular development team that gave you the input code,

and

human readable and human maintainable.

So to give you a trivial case, there was an old joke about,

I stole the artificial intelligence source code from the government laboratory for artificial intelligence. I'm gonna prove it by dumping the last 5 lines. This joke came out when Lisp and Scheme are the popular languages,

and the punchline of the joke was 5 lines of closed brackets.

If you just do machine generated code, which everybody has done at 1 point, you either fail at 1 +2 times 3, or you generate 5 lines of closed brackets.

And that's the sort of problem that is non obvious to the emitter. And I've also alluded to the fact when I said idiomatic to developers who wrote the original source, that means that there are things that you have to preserve about the original source

which are not necessarily

semantics of the language, but which are in fact idioms of the development team in question. Yeah. It's

team in question. Yeah. It's definitely an interesting aspect to the problem because as you point out, there are certain

implicit meanings or sort of meanings as side effect of the structure of the code that has nothing to do with the semantics of the code or its sort of computational intent, but that does help in terms of the

cognitive and organizational

complexity management for the team that is writing and maintaining the code and that they might want to

maintain in the output because of things like

splitting logic on team boundaries, for instance.

Yes. And generating code that a machine will accept is vastly easier than generating code that a customer will accept.

I mean, the general market approach to doing machine translation is you write a parser, you jump up and down on the parse tree,

and then

you

emit the parse tree, and you say, this is the target language, and you wave the ANSI SQL flag as hard and as loudly as you can.

And you replace some function names while you're at it. But the moment you type check, you've now done things like inserting casts.

And if you generate code that contains all of those casts,

there are 2 things here, both of which you can't do. 1 is to generate code that contains all of those casts because your human maintainer will say, nope.

And the other is to assume that the target language does the same implicit type conversions, or even fundamentally has the same types as the source language.

And the answer to that is, nope. You cannot divide 1 by 10 in any financial institution unless you know exactly what you are doing. Absolutely.

And so to that point,

it's interesting to dig into some of the verification

and validation process of the

intermediate representation of the language and the

onboarding

approach to bringing new target platforms or new source platforms

under the umbrella of Compiler Works and just the overall effort that's involved in actually doing the research to understand the

capabilities and semantics of those systems.

Yes. And then you start to speak to, well, what kind of company are we? Are we

a computer algebra company, or are we a set of research historians?

What do I know about

unheard of platform x?

Well, who wrote it? When did they write it? Where did they write it?

And

an awful lot of that gets folded into the initial development of a language.

We are

utterly test driven.

You basically have to be. And so

starting out with a new language,

it really is just about passing test cases

and building customer acceptability.

There are other parts of this question which I apologize I'm not going to answer, and I'm trying to fish things out that I can say

because

1 of the things that we developed over the years is the ability

to implement a compiler for a language in a shockingly short space of time.

Once upon a time, we actually signed a contract to do a language

in a space of time which was,

you know, bordering on professional responsibility.

And, of course, we did it and we hit it.

And the thing that we didn't publish was we actually did it in less time than that

because 1 of the things that we know how to do is to understand languages and put together an implementation of language in a very short space of time. But a lot of this comes from

having a call where, for instance,

types only behave in certain ways.

And if you can express all of the ways in which types behave and types interrelate,

then you can describe a language in terms of, for instance, its type system.

And that, to us, is a tool that we have available.

It's sort of interesting when you're mapping between languages where types have inheritance and polymorphism

to a language where types maybe have inheritance and polymorphism but have different relationships between themselves.

So at that point, something which was a polymorphism conversion in 1 language is an explicit type conversion in another language.

Understanding of types

is very, very important.

Absolutely. Especially when dealing with data.

Yes.

Date times are the killers. Because even if you know that you've got a date time,

there's 1 hour in every year that doesn't exist and 1 hour in every year that exists twice. And then people do things like, okay. So adding 1 to a day to a time is simple. All you have to know is whether that particular language interprets as days or milliseconds.

But then you get into all sorts of craziness, like, if I take a time and I convert its time zone, did I get the same instant,

or did I get what we call a time zone attach?

So is 1 PM BST

in Pacific,

like 9 PM or whatever it is, 7 PM PST, or is it 1 PM PST? And different database servers do different things when given this operator,

and that again blows ANSI SQL out of the water.

And then what you actually end up doing is just figuring out how the database server does it internally and then modeling that. And then you're back into history. You're back into reading source code. You're back into the Postgres source code is 1 of those marvelous resources on the planet because it tells you how a lot of the database servers out there work, and then you wanna know when they forked to what they did, and why they did it, and who did it, and so on. Yes.

And I'll agree that Postgres is definitely a marvelous resource, and it's fascinating the number of systems that have either been built directly on top of it or, you know, inspired by it even if not taking the code verbatim.

Yes. It's also I mean, you've got this challenge when you say we're Postgres compatible and you sort of adopt the Postgres mutator, and people don't typically want to do anything with the Postgres mutator.

And compared to

almost every commercial dialect, the Postgres mutator has a fundamental weakness in its handling of time zones,

which nobody has ever seen fit to correct.

I suspect that

core Postgres

can't,

which is that you don't have a time stamp with time zone, and the time zone handling in Postgres is

basically to be avoided if you want to get the right answer.

And yet,

Oracle, Teradata, BigQuery, everybody else does it right.

So, yeah, Postgres is a wonderful resource, but I do wonder that people base things on it

with that weakness.

Most people who start basing their systems on top of Postgres haven't done enough of the homework to recognize that as a failing before they're already halfway through implementation.

I think the majority of people who

have a good idea

and they want to get a demo of their good idea out as fast as possible

really don't think about the consequences

of their

decisions

on the first 2 or 3 days.

And I think there is a

phenomenal bias among developers

starting out

to imagine that

because something gives you a very fast day 1,

that it will give you a very fast day 3. And they think, okay. We'll get 6 to 12 months down the line, and then we'll rewrite it.

And I think that for an experienced developer,

the crossover point with

technologies

is around day 3, not month 3,

and this is a big, big mistake.

We made some very interesting technological decisions about things that we were and were not going to do with this company right at the start of the company, and they paid off. And some of those decisions were that we were going to do a lot more hard work that was necessarily obvious.

And we've watched people come up

behind us and say, we're gonna make different technological decisions

and suffer the consequences of those things and sort of run into a wall.

But the number of times that I've been told that, for instance,

we want to develop the back end in Node because that way we get to use the same models on the front end and the back end.

It's like whoopie doo. Got

no type checker.

Okay, TypeScript. I'm looking at you. Okay. Whoopee doo. Let's,

what?

And your experienced developer's gonna sit down with a decent web framework with a DI framework and everything else, and we'll have you know, you're gonna be overtaken by the end of day 3 at best,

and there are companies out there that know this.

Yes. As you were talking about technologies that have been thrown together to get a fast solution, the first thing that came to mind was JavaScript. So I appreciate that you called it out explicitly.

I like

JavaScript as a language, but I also have this rule about writing shell scripts,

which is the moment you find yourself using anything like arrays, you're in the wrong language.

I have this sort of set of criteria that tell you that you're in the wrong language.

There were things I very much like about the JavaScript ecosystem, and there were things that I would definitely go to it for.

However,

it does make me kind of sad to see it slowly reinventing

or rediscovering

or hitting many of the problems that other languages have hit. Another example was some years ago, there was a great big fuss about

the ability of an attacker to generate hash collisions,

putting perturbations into hash tables.

And somebody pointed out that if I generated the correct set of

SYN packets in TCP and sent spoof SYN packets to a remote Linux kernel,

because it used a predictable hash and it was a list hash, we could convince the kernel to put all of those SYN packets into the same

chain in the list hash,

and it denies service to the kernel because it was spending all of its time walking this linear chain rather than benefiting from the hash table. And I watched the same bug get discovered in Perl, which taught it to use perturbation of hashes.

And then I waited something like 3 or 5 years

for somebody to point out that, actually, the same bug existed, and I forget whether it was either Python or PHP.

And then you get into this world where developers say, hang on a minute. My hash iteration order changed. You're not allowed to do that.

And then you say, yes, you are. It says so on the tin. And so this whole pattern of

watching developers

discover solutions to problems that other languages have already invented. It's like, once you've discovered it in Perl, which I think might have been the first 1, and then it might have been the second, but but I'd

have to check. Go around all of the other languages and look and make sure. Don't wait 5 years.

And the same thing's true for the JavaScript ecosystems. Like, they've waited

15 years to reinvent certain things.

Yes. Developers have remarkably short memories and attention span, at least in certain respects.

Correct.

And so bringing us back to what you're building at Compiler Works and the sort of usage of it as a static analysis and lineage generation

platform, I'm

wondering if you can just talk through some of the overall process of integrating compiler works into a customer's infrastructure and workflow

and some of the user interactions and processes and systems that people will use Compiler Works for and build on top of the Compiler Works framework?

So what you'll find is that most of the data processing platforms out there have some sort of log or some sort of standard presentation of their metadata.

And we at Compiler Works aim to make everything as easy as possible,

by which I mean we take that standard presentation of the metadata.

If you're working BigQuery, we take the BigQuery logs. If you're working Redshift, we take the audit logs. If you're working Teradata, we take the various things that that Teradata throws to us.

And having basically given the Compiler Works dumper permission to access these logs, and it makes a dump and it pulls them into the product,

the rest of it is automated.

Because the fundamental thing that we operate on is if it's possible for the underlying platform to understand that code, it's possible for us to understand the code. We have all of the temporal information. We have all of the metadata. We have all of the semantic information.

And from then on, it's all gravy. We pull the logs. We put it up into the user interface.

We make the data available as APIs. And from then on, you can just explore lineage much as you've seen in our video presentations.

In terms of

the migration process, you've discussed a lot of this already, so we can probably skip through this question a

little bit. But what is the overall process of actually doing the migration from platform a to platform b

and especially

doing the validation

that the answer that you get on the other side of the transformation,

at least close enough matches the answer that you were getting before you made the migration,

and then maybe a little bit of some of the

reasons that people actually perform those migrations in the first place. So lineage is totally easy. You can usually get up and running with a CompilerX lineage in, a few minutes as long as it takes you to pull the log. You pull the log, you run it.

Migration tends to be in practice a little bit hairier

because

the customer's presentation of their code is not standard.

Significant percentage of customers pre process their code or something. I mean, this is actually where some of the enterprise languages are nicer. The more capable enterprise languages,

while the compilers that we have to write for them are much tougher,

the customers tend to present their code in a more standardized form because the language itself is more capable.

When you get a relatively incapable language,

the customer tends to mess with it, procedurally generate it, do all sorts of things. It's almost like they're treating

the underlying language just as an executor.

So the first question you have to ask is, what's your presentation of your code? How did you mess with it? Once you've got a hold of the presentation of the code, what you do with Compiler Works is you specify what the input language stack is,

And this is actually quite nice because in Compiler Works, you can take a language that generates another language or contains another language or that preprocesses

another language

and say, this is a language stack. You're going to absorb this, you're going to transpile,

and you're going to emit to a target language stack that has some of the same preprocessing

or management capabilities as your source language stack.

And this is yet another hint to say that writing a purely academic Oracle to Postgres compiler

isn't enough

because the Oracle exists within the context of something else and may be incomplete and so on and so on. And again, if you don't do that, you fail human acceptability.

So the start of a migration process is

get the code,

work out how it's specified,

tell Compiler Works how this customer currently specifies their code, tell Compiler Works how the customer wants their code specified,

and then run it for the migration.

And that process,

I have walked into a meeting room and done it cold in an hour, and this could be done. You know, given that the customer typically doesn't know the answer, usually, they don't know the answer to the target platform. They've been sold something by a vendor. They think it's a marvelous idea, and you say, how do you want to use this target platform? And they say, we don't know.

And then we make a recommendation, and we work with their advisors, someone to make that recommendation and get that right. 1 of the things that you get after this is that we have a lot of versatility with respect to

doing the migration job, not just converting code.

Testing is an interesting 1.

Customers vary in what they will accept.

As I said, with the 1 divided by 10 example, I'm very, very precise in how we convert.

We have customers who absolutely lean on us for that, and they say, I want this accurate down to the last dollar. If you're dealing with financials, sometimes they care down to the last dollar.

I'm avoiding slightly naming names here. If you're dealing with some of the markets that we deal with, they're happy with anything that's within 5%.

And now, there's another thing that gets slightly interesting, which is if you're dealing with financials, you'll always use decimal types for data.

I have seen people in certain markets use floating point types for data.

And the consequence of that is that if you do a sum of a float,

you could get any answer at all.

It's not like you will probably get an answer that's within 5% of the result. People don't understand floating point arithmetic. You could get anything.

And the difficult case is the ones where the customer's done something like that.

The target platform

does something

in a deterministic

but different order to the source platform's deterministic

order.

Now you get customers who write code where the result of the code wasn't well defined, but the source platform had to execute it sufficiently deterministically

that they think that's the right answer. And now you have to sit down with a customer. You say, dear customer,

we love you. However,

you did not, in the source language, say what you think you said.

Can we now please work with you? And there's a marvelous piece of education there. With a good customer,

you could really help them to improve their infrastructure

as a whole. And that's also where we describe the the static analysis side of the lineage product as, tell me the things I need to know.

Am I in my infrastructure doing something that is odd?

1 of the funniest cases I ever saw was somebody had taken code from Oracle

that said, a

space

exclamation mark space

equals space b.

Now in Oracle, this means not equals because you've got a not you've got an equals. That's a not equal

because now we're going back to the what does the lexer do. In c, exclamation mark equals. That's a token.

In Oracle,

exclamation mark and then equals are separate tokens, and it's the parser that puts them together into a not equals.

Postgres

was written by C developers,

therefore, exclamation mark equals has to be a token. So what does a space exclamation mark space equals space b mean? It means a factorial equals b. It executes.

It does not return an error.

It doesn't give you remotely the same answer.

So

it is a legitimate static analysis to say, did we use the factorial operator? Because we almost definitely didn't mean to.

Yes. That is a hilarious bug.

What is

equally puzzling is the number of these things that we discover and find in source code, and we say, how long has this been in here?

And the answer is, this has been in here for years. It is generating a production dataset.

It's breaking the production dataset, and nobody noticed.

And so you start to ask questions like,

under what circumstances do you as a customer notice an error in the production dataset?

The most common answer we get is because data is missing.

But if data is present,

the customer tends to assume it's correct.

I used to teach undergraduate Java, and you get into a lab, and you'd say to a student, you're going to simulate a cannonball. You're gonna fire it into the air at 30 meters a second. Gravity, we'll assume, is 9.81,

and you're going to model the position of this cannonball at 1 second intervals and tell me when it hits the ground. Great. Okay. Well, I can do

basic calculus, and so I can say, okay. It's gonna hit the ground in 6 milliseconds. Fine. So they'd write their code, and they'd run their code, and they'd very proudly present me their answer. A cannonball hits the ground in 25 seconds, and I would say to them,

are you sure?

The tone of voice is critical here?

And it took them

a couple of months to work out that I would ask, are you sure in exactly that same tone of voice, regardless of whether or not they had the right answer? Because their duty to code was the same. It didn't matter whether I knew they had the right answer. I was not going to be the oracle. They were going to make sure.

It's definitely remarkable the

amount

of

that sort

of cavalier attitude that exists in the space of working with data and dealing with analyses

and

just assuming that because the computer says it that it's correct

and not being critical of the processes that led that gave you that answer in the first place. And you spoke briefly about testing. So the naive answer to testing is if the target platform gives the same answer as the source platform, great. You're golden. And that is in fact the easy case. There's a lot of cases where the target platform gives a different answer to the source platform,

and there's an awful lot of reasons why that might arise,

many of which are nothing to do with the translation was in fact accurate and preserved with the semantics expressed by the source code.

That almost makes me think that people should just use compiler works to migrate their code to a different system to see if it gives them a different answer and points them in the direction of finding that they had some horrible mistake for the past 10 years. Well, that's exactly why we've our lineage. You run lineage over your code, and it will tell you whether you had a horrible mistake, and you don't need a target platform for that.

And I imagine too that because of the virtue of being able to take a source language and then, you know, generate a different destination language, that that will also help people with doing sort of trial evaluations of multiple different systems in the case where they're trying to make a decision and see sort of how does it actually play out in, you know, letting my engineers play with it, letting my, you know, financial people play with it and see what the answers look like. And I'm wondering what the frequency of that type of engagement is in your experience.

Almost universal

because

1 of the things you had to bear in mind when you're doing a semantic mapping is the required semantics might not exist on the target platform.

So now you've got a group of developers on the source platform where you'll find some master developer, and he will find you some hairy piece of code, and he will say, this is the hairiest thing on the source platform, can you convert it to the target platform?

And

in the old world, somebody would sit down and they'd convert that piece of code, and they'd say, yes. But what he's given you isn't the hairiest piece of code for the target platform. He's given you the hairiest piece of code for the source platform.

So an engagement for us looks like, here's all the code for the source platform.

Can you qualify the entire code base against the target platform?

And the answer is, yes. If you hold on a minute or 2, we can actually give you that answer, and then we can say, in this file over here, is this operation which is really simple on the source platform? Because the source platform happens to have that operator, but the target platform doesn't and has no way to emulate it. Definitely an interesting aspect and side effect of the varying semantics of programming languages and processing

systems. Yes. And 1 of the fundamental assumptions of of the compilers world

is that

the target platform can do the thing. This is a very interesting compiler's world because that's not true. The target platform cannot necessarily do the thing. And in the world where your language is just a grammar, a a syntax stuck on top of the target executor, of course, you can do the thing because you just glued a keyword to every instruction in the target.

So,

yes, this is that way a case where there isn't a workaround.

It's not just a case where the instruction set isn't dense. I mean, even, like, compiling c to Intel, the Intel instruction set is dense. You can't do every basic arithmetic operation on every combination of widths of words. So sometimes you have to cast out, do your arithmetic operation, and then cast back down again. In databases,

there are conversions between platforms that have things that can't be done. And so the ability to run compiler works over a code base and say whether this could even be done based on some simple operation is is golden for a customer. Yes. The the the side effect of SQL not being Turing complete. And not fundamentally

having assignment.

You can sort of use sub selects

to do a little bit of functional programming,

but

the lack of assignment.

And then you end up in weird corner cases like

if

emulating a particular

piece of semantics

requires you to reference a value more than once and you don't have

assignment or an assignment like operator,

does the target platform then reevaluate a subtree

where that subtree might, for instance, contain a sub select with an arbitrarily complex join?

Expensive was the word I was looking for there. Expensive is such a marvelous word in industry.

Or if the sub select happens to happen at 2 different points in time where the query does not have snapshot isolation and somebody inserted a record in the midst of the query being executed.

Yes.

So you've got repeatable read, and then you've got things like stable functions. Like, if the thing that you had to duplicate,

for instance, read the clock,

and most

database servers are smart about this. And when you call the clock function or they will actually publish multiple clock functions, 1 of those clock functions

reads the time at the beginning of the query,

compiles that time into the query as a constant

so that when you do something with respect to now, you always are treating the same now even if your query takes a minute to run. But they will often also have another clock function,

which means the actual millisecond instant that the mutator hit that opcode.

And

now you get customers who confuse the 2, and sometimes it matters and sometimes it doesn't. And if you're running on a parallel database server or whatever,

you start to get different answers. So, yes, it's not just about data. I think what I'm doing here is I'm broadening one's view of

what isolation

and subtree duplication and so on and so on really do to you. So I'm sure that we could probably continue this conversation

ad infinitum,

but both of us do have things to do, so I'll start to draw us to a close. And so to that end, I'm wondering if you can just share some of the most interesting or innovative or unexpected ways that you've seen the compiler works platform used. I think the ones that we love most are the ones in the lineage product

where

because we show consequences at a distance,

somebody's looking at maintaining a column,

they say and we say, it affects such and such a business report.

You probably should think before you do that. And the user says, no, it doesn't. It can't possibly.

And then they click the button on Compiler Works that says, explain yourself, and we say,

this is how it does it. And they have that moment.

And I think the best mails that I get,

and we get them quite often, is not just where we gave the user a revelation.

It's where we gave the user a revelation that fundamentally

disagreed with what they believed about their infrastructure

and really opened their eyes to it. Those are the ones that I most enjoy. The ones where people run it because it's accurate and they say, okay. If I do this, I'm not gonna go to jail. That's great. But the ones where it contradicts them and substantiates

itself are the ones that I love. I could definitely see that being a gratifying experience.

And in terms of your experience of building the compiler work system and working with the code and working with the customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? There's an interesting definition of technical debt,

where

we define technical debt as not a thing that's done wrong, but a thing that's done wrong that causes you to have to do other things wrong.

It's really only a debt that matters if you have to pay interest on it in that sense.

So

knowing when to incur technical debt and how much interest you're paying on it.

Compilers, I've compared to

playing snooker.

I can go up to a snooker table, and I can roll a ball into a pocket. If I'm lucky, I can hit that ball with another ball and get it to go into a pocket.

I'm not skilled enough to have a stick and hit the 1st ball with a stick so it hits the 2nd ball so it goes into a pocket. That's a skill of 3 levels of indirection, which I don't have. Compilers, you have to be forever thinking about everything you're doing at 3 levels of indirection because you're writing the compiler, and a lot of customers don't even send you the code. So the customer says, my data was 42, and it should have been 44.

I'm not even necessarily going to tell you what the code was, and now you have to fix it in the compiler. So now you're playing snooker blindfold.

That's tough.

And 1 of the things that comes out of this is because the majority of the code that ever runs through your compiler is never gonna sharpen a support issue, and you're never gonna see it,

what this means is that

you mustn't cheat.

Absolutely. Do it right. Improve it right. And for people who are interested in performing some of these platform migrations, or they want to be able to

compiler works as the wrong choice?

There are cases

we come on where

the target platform has so little resemblance to the source, by which I mean

the desired target code

has so little resemblance to the source

that

it's not really a a migration. It's a version 2 of your product.

We come across people who try this, and for that, Compiler Works is the wrong choice.

I would tend to advise those people

you know, people use a platform migration as an opportunity to do a product version 2. This isn't always as stunning an idea as it seems.

You might want to consider

separating

the platform migration and the version 2 product,

Because I think any developer who's been around the block a couple of times

knows that the double set of unknowns is going to hit you.

And so we speak to people, and we say, do the migration. Do it apples for apples,

and then do the maintenance on the target platform. Because the other thing about not doing a relatively clean migration

is that you now don't have a test suite. You can't compare target platform behavior with source platform behavior because you explicitly specify target platform behavior to be different.

So we tend to advise people to do 1 thing at a time. But if you wanted to do them both together,

we would start to lose relevance.

As you continue to iterate on the platform and work with customers and build out capabilities

for Compiler Works. What are some of the things that you have planned for the near to medium term or any projects that you're particularly excited to work on? Lots more languages

and shortening the time to the moment. The latest versions of Compiler Networks that we've shipped have a completely redesigned user interface in the lineage,

and we've done a lot of work there to

put exactly the right things on screen

so that you can look at the screen, and the answer to your question is there. That's hard work. That's visual design work. It's got nothing whatsoever to do with compilers, and now you've got a compilers team saying,

now we have to do all of these user psychology and and so on and so on to do the visual design,

But then it goes back into the combiners team. Like, the user says, in order to make this decision, a user has an error somewhere with their data infrastructure. They want to know how to fix it. Really, what they want is the list of tasks they have to perform in order in order to fix that. And we can produce that, but we can also make it visible why and justify it. But the moment you've gone up the front end and decided that that's what your user story is and you need to do that, now you have to go back into the back end and make sure that the back end is generating all of the necessary metadata

to feed into the static analysis,

so that that visualization can be generated.

And so it's a very tight loop between

user story visualization front end

and pretty hardcore compiler engineering.

Absolutely. And running the risk that this is probably a subject for an entirely another podcast episode, what are some of the

applications of compilers that you see the potential for in the data ecosystem specifically

that you might decide that you want to tackle someday? There are applications

of compilers that I particularly

enjoy.

1 of my favorites, which is on GitHub, was the qemu Java API.

What it does is it takes the qemu source code, which itself has some sort of JSON ish preprocessor,

and writes another compiler that compiles that JSON ish code

into a Java API which allows you to remote control a QMU virtual

machine. Now, 1 could have sat down and tracked QEMU and written this thing

longhand and said, I'm going to write a remote control interface to QEMU, such that I can add disks and remove disks and so on and so on on the fly. But it made far more sense to do it as a compilers problem

because

it now tracks the mentions the QEMU. And if they add a new capability,

well, guess what? You rebuild your Java API by running this magic compiler over QEMU, and you've got a new qemu remote controller interface.

And the reason that I particularly love that piece of code as an application of compilers

was that now I can write a JUnit test case

that runs in Gradle, in JUnit, in Jenkins, in all of my absolutely standard test infrastructure,

installs

a storage engine on them,

writes a load of data to the storage engine, causes 3 hard drives to fail,

and proves that the storage engine continues operating in the presence of 2 failed hard drives

in JUnit.

Now, normally, when people start talking about doing that sort of infrastructure testing, they have to invent

a whole world and a whole framework for doing this, and yet 1 200 ish line compiler

run over the QEMU source code

gave the capability to suddenly

write a simple, elegant, readable test in the standard testing framework that allows you to do

hardware based testing of situations that don't even arise in the normal testing world. That's where I start to love compilers as a solution to things, and that's why I think

I will always have a thing for compilers whether it's data processing or not. Yes. Definitely amazing the number of ways that compilers

are and can be used and the amount of time that people spend overlooking compilers as a solution to their problem,

to their detriment, and to the extreme cost of time and effort put overengineering a solution that could have been solved with a compiler.

People think about it as, like, you could do this by hand. I could sit down and write

for libopengl.

But if I actually want to, like, how many method calls or how many function calls are there in OpenGL?

If I actually want to call Open GL from Java, I probably need to generate a Java binding against the c header file for Open GL,

several thousand function calls, and that's a job for a compiler.

Happens to be a job for a CPU Processor as well.

I think I know which 1 they used.

Alright. Well, are there any other aspects of the work that you're doing at Compiler Works or the overall space of data infrastructure and data platform migrations that we didn't discuss yet that you'd like to cover before we close out the show? I think we should have a long talk about compilers in the abstract sometime because we'll get into a very rich, probably a very opinionated,

and probably a very detailed territory.

1 day,

I will say

don't be afraid

to learn.

1 of the things that I think makes me a little bit odd in this world

is that I actually didn't

study all of the standard reference works.

We study a lot of history.

Most of the people who slapped these things together didn't study the standard reference works.

So by all means, take the course. I had some excellent professors whom I loved who put us through the standard compilers course. But I will say that the standard compilers course and even some of the advanced compilers courses that I've watched because the universities have been publishing them online,

they don't really touch on this.

They don't really want to touch on type checking.

They just about do basic things like flow control.

Get out there and learn and be self taught and dig into it, and don't be afraid to do that. And 1 day, I have a shelf of books where I went to 1 of the publishers, and I said, give me every book you've got on compilers.

And it's 1 of my

intentions 1 day to read them, but I haven't yet.

So my closing thought would be,

even if it's not compilers, whatever it is, go for it. Well, for anybody who wants to follow along with you and get in touch, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think

as we move away from some of the enterprise languages

and we move into

some of particularly the data flow systems that we've got these days,

we are moving into a world where the languages

become

harder to analyze and maintain.

We're accessing underlying platform semantics through APIs, not through languages,

and I think that that is going to have a cost.

I predict doom.

Definitely something to consider. It might be strawberry flavored doom. I don't know what sort of doom.

Alright. Well, it has truly been a joy speaking with you today. So thank you for taking the time, and thank you for all of the time and effort you're putting into the work that you're doing at Compiler Works. It's definitely very interesting business and an interesting approach to a problem that many people are interested in solving. So thank you for all the time and effort on that, and I hope you enjoy the rest of your day. Thank you for such a marvelous set of questions.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links