Summary
A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
- Your host is Tobias Macey and today I’m interviewing Shevek about Compilerworks and his work on writing compilers to automate data lineage tracking from your SQL code
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Compilerworks is and the story behind it?
- What is a compiler?
- How are you applying compilers to the challenges of data processing systems?
- What are some use cases that Compilerworks is uniquely well suited to?
- There are a number of other methods and systems available for tracking and/or computing data lineage. What are the benefits of the approach that you are taking with Compilerworks?
- Can you describe the design and implementation of the Compilerworks platform?
- How has the system changed or evolved since you first began working on it?
- What programming languages and SQL dialects do you currently support?
- Which have been the most challenging to work with?
- How do you handle verification/validation of the algebraic representation of SQL code given the variability of implementations and the flexibility of the specification?
- Can you talk through the process of getting Compilerworks integrated into a customer’s infrastructure?
- What is a typical workflow for someone using Compilerworks to manage their data lineage?
- How does Compilerworks simplify the process of migrating between data warehouses/processing platforms?
- What are the most interesting, innovative, or unexpected ways that you have seen Compilerworks used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Compilerworks?
- When is Compilerworks the wrong choice?
- What do you have planned for the future of Compilerworks?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Compilerworks
- Compiler
- ANSI SQL
- Spark SQL
- Google Flume Paper
- SAS
- Informatica
- Trie Data Structure
- Satisfiability Solver
- Lisp
- Scheme
- Snooker
- Qemu Java API
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
[00:00:18] Unknown:
When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data in motion. That leaves data ops reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end to end metadata, Data Band lets you identify data quality issues and their root causes from a single dashboard. With Data Band, you'll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand today to sign up for a free 30 day trial and to take control of your data quality. Your host is Tobias Macy. And today, I'm interviewing Shevek about Compiler Works and his work on writing compilers to automate data lineage tracking from your SQL code. So So, Shevek, can you start by introducing yourself?
[00:01:50] Unknown:
Hi. I'm Shevek. I'm technical founder of Compiler Works. I guess I started writing compilers, I don't know, by accident 20 years ago probably. You've given me the introduction challenge, which is to figure out why. I think it's because they're just 1 of the hard problems, and an awful lot of things that are put out there as languages aren't really compilers. They're just a syntax stuck on top of an executor. And I think almost all toy languages, whenever anybody says I've invented a language, they haven't actually invented a language. There's no semantics to the language itself. They've just stuck a syntax on top of an executor. And as I've gone on with writing compilers, I found that the real challenges are where there's an actual semantic transformation between the language being expressed and the language and the target system.
And that's when it really starts to get interesting. I'm not really interested in parsers. Parsers, when you speak to people who learn about compilers in university, they have been taught to write parsers. And I think 1 of the reasons for this is it's really fun and entertaining to write a 12 week course about parsers, and you can teach it very slowly, and you can do LL and LR and Early's algorithm and railroads and all of these other algorithms that nobody in their right mind would use because they're theoretically interesting. But please teach people about languages and compilers and semantics. Because until you're talking about mapping between semantic domains, you're not really doing the job. Given that your accidental introduction to compilers and subsequent fascination with them, how did you end up in the area of data management?
It's where I'm gonna be brutally honest. It says where the money is. What you've got out here in the world of data management is a vast world of enterprise languages, each of which has a single vendor. And by the nature of a single vendor, they get to charge what they please and tell you what you can do with it. And so there's a traditional joke about exactly what filling you would like in your sandwich. You can have this marvelous enterprise capability, but you have to have all of these restrictions with it. And it started with taking those restrictions off the enterprise language and just saying, what else could we do if we had a truly generic open compiler for each of these proprietary languages?
And actually, that philosophy started a little bit earlier because there were a lot of languages out there. Like, there's no point writing a commercial compiler for C these days unless you're doing something particularly obscure with, like, FPGAs. But, fundamentally, C is free. That doesn't necessarily make it easy. And so if you go back into my GitHub, you'll see that 1 of my earliest projects was writing c preprocessor implementations natively in a whole set of languages. Because back in the nineties particularly, anytime you wanted a preprocessor, this was before we had the whole modern world of assorted preprocessors and templating languages, People just used either CPP or M4.
But if you were working in, say, Java and somebody defined something as using the c preprocess, you didn't have a ready implementation. And so that philosophy of writing things that were compatible with other things and just opening up the world has been an underlying philosophy of what I've been building for a lot longer than data processing languages. Just turns out that data processing languages is a commercially viable place to do it.
[00:05:14] Unknown:
As good a reason as any to be in the business. And so that brings us to what you're doing at Compiler Works. And I imagine you've sort of given us the prelude to the story behind it, but I'm wondering if you can just share a bit more about sort of the business that you've built there and some of the motivation and story behind how you ended up building this company to address this problem of lock in by data processing systems.
[00:05:39] Unknown:
The reason you build a new database is because you have a new capability. And then we get this marvelous phrase, ANSI SQL, which is a myth about as good as the Bigfoot. Lots of people claim to have seen it, but nobody actually has. And so now we come back to this question of translating between semantic domains. I have 2 SQL vendors. I have code on 1. I want to run it on the other. The code is syntactically similar because wrote this great big expensive document that dropped some hints about how you might want your language to look if you were to consider making it look like something that might look like this. But the languages don't fundamentally do the same thing. Simple case, like what does division mean? What's 1 divided by 10?
Now, if your database server happens to operate in integers, the answer is 0. If your database server happens to operate in numerics, the answer is 0.1 with 1 decimal place of precision. If your database server happens to operate in floating points, then the answer is something very close to 0.1, but not exactly the same because that's not an exact binary number. And so now even with that very trivial case, I've motivated something beyond ANSI in terms of translating between database servers. So Compiler Works has 2 products. 1 is we translate code from 1 data processing platform to another, and we do it correctly, And that's correctly with a capital c.
And the other thing that we do is we compile the code, and we do static analysis, and we tell you what it is that you need to know about this code. And a significant part of that is things like, if I change this, what will the effect be on my organization as a whole? So you might imagine that you're writing code that produces a particular table in a particular server, and you do or don't make a particular error, or you do or don't have a particular change, somewhere, 10 levels removed from you in a different suborganization of an organization that employs tens of thousands of people, somebody is affected by this change.
Who are they? How are they affected? Do they need to be told? Did they make a critical business decision based on that? And what do we need to do in order to keep this organization running? And should you really be considering making this change before you made it? Or if you already made the change, what do you now need to do to fix up? So that's the static analysis side. Definitely a lot of interesting things to dig into there. And lineage in particular is an area that has been gaining a lot of focus and interest lately from a number of different parties and attempts at addressing it in different ways. So I definitely like the sort of compiler driven static analysis motivation for it. And
[00:08:18] Unknown:
I'm wondering before we get too much further down that road, you've already given a bit of an overview about some of the differences between parsing and compiling. But for the purposes of kind of laying the groundwork for the rest of the conversation, can you just give sort of your definition of what is a compiler and how is that relevant to this overall space of language translation and data lineage?
[00:08:42] Unknown:
I usually describe a compiler as something that turns something from 1 language into another. In the case of lineage, what we're doing is we're turning something from the underlying source code into an algebraic model, and then that computer algebra model is the system of which we can ask questions regarding, you know, what happened, what the consequences are, where the lineage is. It's sort of interesting to think about the compiler, particularly the lineage side. It's like, are we a compilers company, or are we a computer algebra company? I suspect, really, we're the computer algebra company because that's where the hard stuff is.
I think lineage analysis getting popular is an interesting question. Because if I, for instance, write some piece of code that says, you will generate a piece of SQL that takes this table, and it processes like that, and it generates that table. Well, I've got lineage from my source table to my target table. But now we fall into a hole which is this. If the language that I invented in order to generate this SQL is just as expressive as SQL, then that language is going to be flamingly complicated, and it's going to have all of these semi joins and stuff that's typically, the reason people like these languages is to make them not be as expressive as SQL, because they want to make them more accessible to developers or they wanna have clicky, droppy boxes or something like that.
Now assuming that you followed that road, which is basically universal, now you have the problem that your language is insufficiently expressive to do the thing the analyst wants to do. And so now what happens is you have this text box or this field or something where you type in a fragment of underlying SQL. And now what you've got is not a language, but a macro preprocessor, which doesn't actually know what's going on in that fragment of underlying SQL that the developer typed in. And so all of these tools, they start out saying, yes, you're gonna build your thing in our nice GUI workflow. We're gonna show you this nice GUI workflow, and that will give you the lineage. But you don't really have the lineage because you don't have enough expressiveness to really do the job. Therefore, the developers had to type some custom code into a box, and you don't understand that custom code. Therefore, you don't have lineage.
And if what's happening is that you're subject to something like CCPA or GDPR, where you are going to jail if you get this wrong, then you don't have a lineage tool. You actually need to look at the code that the machine really executed and analyze that code accurately at a column level, and then you have lineage model, and then you've got a chance of not going to jail. But anything less, we do not define as lineage.
[00:11:21] Unknown:
Yeah. And then another motivation for trying to reconstruct lineage is that the so best practice, quote, unquote, for data processing these days is spread your logic across these 5 different systems where you need to run this tool to get your data out of this source system into this other location, this other tool to process your data in that location, this other tool to build an analysis on top of that preprocessed data, and then this other tool to actually show it to somebody to make a decision based on it, that they then input other data back into this other system that we pull it back out again. And so trying to reconstruct that flow of operations using an additional set of language processing to, you know, post into an API that stores the data in a database or
[00:12:06] Unknown:
trying to analyze the query logs from your data warehouse that has a very limited scope view of the entirety of the data life cycle and trying to sort of piece all this back together? Well, the query and audit logs typically give you a good start because they're at least partly written by security people, and the security people say, you must tell us everything that's gonna go on. The fragmentation of the data infrastructure, which is the other thing that you alluded to, is very real, And I think this leads to a situation where in a typical installation, we're processing multiple languages and stringing them together. And sometimes that stringing together is standardized and sometimes it's bespoke. But the challenge in putting together a lineage is to be able to identify a column in the ODS up front in, say, the web tier where a user has interacted with something and list all of the back end BI dashboards that that column affected and then describe the effect of that use of interaction on each of those dashboards in a human readable way. That's the challenge.
[00:13:06] Unknown:
Absolutely. And so to that point, you've mentioned a little bit about some of these enterprise processing languages and the tool specific semantics about how they manage that processing, and you've discussed some of the wonderful joys of the SQL ecosystem and trying to translate across those. And I'm wondering if you can just give an overview of the specific language implementations and areas of focus that you're building on top of for compiler works, whether you're focused primarily on the SQL layer and being able to generate, you know, these transformations and this lineage for the databases or if you're also venturing out into things like Spark jobs or arbitrary Python or Java code and things like that? Yes. So you run into
[00:13:53] Unknown:
a number of issues as you walk around the data infrastructure. So the SQL languages are, for the most part, statically analyzable. There are a couple of holes in them, and there are 1 or 2 that have type systems that lead 1 to lose hair. From there, 1 can fairly easily go out to the BI dashboards, particularly the richer products. So we're talking about some of the flow based languages. And at this point, as a compilers company, 1 ends up writing a new piece of technology because basically, all of the SQL languages are are are tied into the relational algebra or some variant or some set of extensions thereof. It's always amused me that so many of the papers on relational optimization start with a a phrase something like, without loss of generality, we should assume that the only Boolean connective is and, which basically means that you're not allowed to use or and you're not allowed to use not. Well, guess what? If you do make that assumption, the world becomes really easy and really simple, and they're denying the existence of outer joins. It's really easy to write an academic research paper on optimization if you only deal with and, but I disagree with the without loss of generality.
You've lost the whole of the real world there. Anyway, I diverged. Sorry. So, yes, so there's a bunch of data flow languages and BI dashboarding systems that actually work effectively with data flow and data process management. So here we're talking about the informaticas, the tableaus, the things in that range, the data stages. So we grew out to work with those, and then we have a secondary core that speaks to the same computer algebra engine that deals with these data flow style languages. Spark is a sort of a mixture because, you know, on the front end, Spark SQL looks like SQL. The question then is I'm gonna speak generically, but not specifically about Spark SQL. But, like, what's the strength of the joint optimizer before you compile down to a data flow language, and are you really a data flow language?
The seminal paper, I think, for people wanting to understand why data flow languages have benefits is probably the Google Flume paper, particularly the statistics about reduction in map reduced jobs by doing delayed evaluation. But once you get out of that and you get into the ETL languages, you also run into things like SAS. And so now you end up with questions like, how do you port, let's say, Informatica to Spark? And I picked those 2 because they're both data flow languages, but Informatica has this fundamental property that computation is sequential, which is to say that if you set the value of a read write port, that value remains assigned and remains visible to the next data record.
And so you can actually generate a datum by saying, if the record number is 1, set the value to x. If the record number is not 1, just read x. And in an MPP system, you would get x in 1 record and null in every other record. But in both SAS and Informatica, you get the same value of x everywhere. And this is a sort of hard semantic difference that makes it very, very difficult to map between languages. You know, this is where we break out of the traditional job of compiling. We actually have to up engineer into user intent. The user said this.
If you're compiling c or Java and you're compiling it down to x86, the user said this, Therefore, do this. And if you do anything else, it's your fault. But if you're compiling some of these languages, it's like, the user said this. We've had a look around. We think they really meant this part of what they said, and every other part of what they said was irrelevant or a consequence of the implementation. And, therefore, we're gonna generate high performance code for the target that preserves the thing they meant and discards the rest, and that's hard.
[00:17:38] Unknown:
Yes. Exactly. The technology is easy. It's the people that are hard as with everything that has to do with computers.
[00:17:45] Unknown:
I don't envy you editing that long term because I went very long winded.
[00:17:49] Unknown:
No. There's nothing to edit there. Continuing on the point of the semantics being the hard part of translating these data processing languages, and you mentioned earlier the fact that at the core, you think you're more of a computer algebra company than a compiler company. I'm wondering if you can discuss a bit of the sort of abstract modeling and mathematical representations that you use as the intermediate layer for translating between and among these different languages and generating the lineage analysis that is 1 of the value adds of what you're doing there? I won't,
[00:18:28] Unknown:
but I will say some interesting corollaries. But I hope you will forgive me for not answering the question as you directly asked it, which is a very interesting question. Absolutely. There's an old party trick where you take a floating point value, and you go around the loop a 1000000 times, and you add 1 to this floating point value. And the question now is, what's the value of that floating point value? And the answer is, it's not a million. It's about 65, 000 because, eventually, you reach the point where the exponent ticks over since the exponent is now 1. And at the point where you're not seeing the last integer digit anymore because your exponent's ticked over, adding 1 to a floating point value has no effect.
Processors are weird. There's another party trick where you write an array of memory, you fill it with random numbers, and then you add all the numbers into an accumulator. But then you try doing that iterating backwards, and then you try doing that iterating in a random order, and you see what the performance difference is. And it turns out that processor hardware and memory prefetch and so on is an absolutely delicious thing as long as you're reading memory forwards. It sort of manages if you're reading memory backwards, and it falls flat on its face, throws its hands up in the air, and screams if you read memory in a random order to the order of about a 200 to 1 performance penalty.
So now let's think about a tree data structure. On paper, a tree data structure has logarithmic complexity. Brilliant. And, academically, we ignore the constants. But what a tree data structure looks like to a processor with memory prefetch is it looks like random order access, And so what that means is that the constant is an order of magnitude larger than anybody thinks it is, which is why on paper, a heap and a tree have the same performance. But in practice, a heap is so much faster because you start to fit into cache lines. Now computer algebra systems look awfully like random order access to memory, and I think 1 of the most interesting problems in any sort of computer algebra, and you'll even find it in SAT solvers where people are optimizing c, and they're changing the order of the structs in the internals of the SAT solver such that they're all hotter up the top end of the word because that's the bit of the word that will fit into cache, we actually get a multiple order of magnitude by having a solver within the computer algebra engine, which itself works out what order to do things in so that we don't appear to be accessing the algebra structure in random order. The more you can do on a piece of memory while it's in cache and then drop it out of cache. And then, of course, the other entertaining question is, wait, you do all of this in Java? Isn't Java some language where you're a 1000000 miles away from the processor?
Actually, I happen to think Java and the JVM are a beautiful, beautiful setup because you get to be a 1000000 miles away from the processor when you want to be. But when you actually want to get download, you've got control of everything down to memory barriers. And at that point, you're pretty much able to write assembler. And so it's the joy of a language. They say 90 something percent of your code doesn't need optimizing, and they're right. So the question is, can you ignore 90% of the job and do the 1% of the job? I think
[00:21:47] Unknown:
the JVM is 1 of the greatest fleets of modern engineering for allowing that. Just for point of reference for people who are listening and following along, I'll clarify that when you say tree data structure that you're speaking of tree spelled tree, not tree.
[00:22:01] Unknown:
Well, either will do. A b tree, a tree with an I, a, you know, an rb tree. Anything where you're effectively allocating nodes into main memory and then making those nodes point to each other, and particularly where your allocator is you know, you've got some sort of slab allocator that's mixing your tree nodes up with other things. You know, if you just allocate a tree, then maybe, yes, your root node is allocated at the start of RAM and everything else is allocated sequentially. But the moment you start rotating and mutating a tree, then a tree walk looks like random order memory access again.
[00:22:34] Unknown:
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to data engineering podcast.com/census today to get a free 14 day trial.
Digging more into the technical architecture of what you're building at Compiler Works, can you give a bit of an overview about the workflow and the life cycle of a piece of code? I guess the data is irrelevant here since you're working at the level of the code, but the life cycle of a piece of code as it enters the compiler work system and your processing thereof and then the representation that you generate on the other side for the end users of the system?
[00:23:41] Unknown:
Yes. And I'm going to answer about 3 quarters of that, of course. So let's deal with the thing that we deal with at the start. Everybody knows about lexing and parsing. Lexing and parsing are not necessarily as immediate as everybody thinks they are. So, for instance, we're taught that foo is an identifier and 5 is an integer, and that foo 5 is an identifier because it's something that starts with a letter. And then you ask the question, well, what is 5 foo? And the answer is it's an illegal identifier because it's an identifier that starts with a digit. But if you go into any SQL dialect and you type select 5 Foo, what you will get is a value 5 aliased to the name foo because we implicitly, as humans, assume that there needed to be a space between the 5 and the foo. But if you follow the textbook instruction of how to write a lexicon parser, you actually get the bug that I just described. If you do it the way they taught you in school, you get that bug.
So now it gets a little bit interesting because the first part of writing a compiler for an enterprise language is working out what its structure is. What are we even being given there? So let's take a language that exports itself as XML. You know, there's a number of them out there. So now you've got a load of XML, and this XML has words in it, such as such ID equals foo. Well, what does that foo refer to? It refers to some other foo. XML can really only represent a tree structure, and all of these languages are data flow structures. Therefore, they must be representing graphs. Therefore, there must be linkages within this XML. So the first stage is looking at a load of samples, working out what the semantics of the language are.
Now here we have an advantage because most of these languages were written as relatively thin skins on top of their executors. And so if you know the capabilities of the executor, and I kid you not, but actually reading historical papers about how memory allocators worked and things like that will give you a lot of insight into things like the extent of variables. You learn when values get reset or when they get deallocated just by knowing what technology was available to the authors of the language at the time they wrote the language. So having worked out what all of the linkages are, you now have a symbol table, and having done the parse, you do the compile, which is symbol table type check, operation selection, very classic compiler work. And now what you have is you have a compiled binary in the semantics of the source language, which are not necessarily atomic semantics.
And so now what you need to do is to break those semantics down, where some of those semantics may be quite large, into effectively atoms, meaningful atoms. So now we will end up with something like 32 bit integer addition with exception on overflow, And you might even get an annotation about what the exception is on overflow. And now you've got the set of challenges where now if you're doing lineage analysis, you have a whole set of computer algebra rules that will tell you what you need to know about this thing. Am I doing this? Am I doing the same thing twice when you're looking for, you know, matching regions of algebra? Without going too much into details, you could do something like our Wigner distance on a computer algebra data structure or something like that. It's fundamentally hard. But things like that are available for saying, why are my marketing department and sales department computing the sales figures that are coming up with different numbers? Because now they're disagreeing over who gets how much money. And that's the sort of problem that needs to be solved here. Emitting code is actually a whole new set of challenges.
So this is for the case of migration from platform to platform because if 1 just takes the semantics and emits them to the target platform, you get code that has a number of issues. First, you get non optimality. You get the fact that it's not using the correct idioms of the target platform. You also get the fact that it's ugly. We've all seen computer generated code, and nobody wants to maintain it. And a significant part of generating code for a target platform is working out what code to generate which is idiomatic for the target platform, idiomatic for the particular development team that gave you the input code, and human readable and human maintainable.
So to give you a trivial case, there was an old joke about, I stole the artificial intelligence source code from the government laboratory for artificial intelligence. I'm gonna prove it by dumping the last 5 lines. This joke came out when Lisp and Scheme are the popular languages, and the punchline of the joke was 5 lines of closed brackets. If you just do machine generated code, which everybody has done at 1 point, you either fail at 1 +2 times 3, or you generate 5 lines of closed brackets. And that's the sort of problem that is non obvious to the emitter. And I've also alluded to the fact when I said idiomatic to developers who wrote the original source, that means that there are things that you have to preserve about the original source which are not necessarily semantics of the language, but which are in fact idioms of the development team in question. Yeah. It's
[00:28:50] Unknown:
team in question. Yeah. It's definitely an interesting aspect to the problem because as you point out, there are certain implicit meanings or sort of meanings as side effect of the structure of the code that has nothing to do with the semantics of the code or its sort of computational intent, but that does help in terms of the cognitive and organizational complexity management for the team that is writing and maintaining the code and that they might want to maintain in the output because of things like splitting logic on team boundaries, for instance.
[00:29:22] Unknown:
Yes. And generating code that a machine will accept is vastly easier than generating code that a customer will accept. I mean, the general market approach to doing machine translation is you write a parser, you jump up and down on the parse tree, and then you emit the parse tree, and you say, this is the target language, and you wave the ANSI SQL flag as hard and as loudly as you can. And you replace some function names while you're at it. But the moment you type check, you've now done things like inserting casts. And if you generate code that contains all of those casts, there are 2 things here, both of which you can't do. 1 is to generate code that contains all of those casts because your human maintainer will say, nope.
And the other is to assume that the target language does the same implicit type conversions, or even fundamentally has the same types as the source language. And the answer to that is, nope. You cannot divide 1 by 10 in any financial institution unless you know exactly what you are doing. Absolutely.
[00:30:21] Unknown:
And so to that point, it's interesting to dig into some of the verification and validation process of the intermediate representation of the language and the onboarding approach to bringing new target platforms or new source platforms under the umbrella of Compiler Works and just the overall effort that's involved in actually doing the research to understand the capabilities and semantics of those systems.
[00:30:48] Unknown:
Yes. And then you start to speak to, well, what kind of company are we? Are we a computer algebra company, or are we a set of research historians? What do I know about unheard of platform x? Well, who wrote it? When did they write it? Where did they write it? And an awful lot of that gets folded into the initial development of a language. We are utterly test driven. You basically have to be. And so starting out with a new language, it really is just about passing test cases and building customer acceptability. There are other parts of this question which I apologize I'm not going to answer, and I'm trying to fish things out that I can say because 1 of the things that we developed over the years is the ability to implement a compiler for a language in a shockingly short space of time.
Once upon a time, we actually signed a contract to do a language in a space of time which was, you know, bordering on professional responsibility. And, of course, we did it and we hit it. And the thing that we didn't publish was we actually did it in less time than that because 1 of the things that we know how to do is to understand languages and put together an implementation of language in a very short space of time. But a lot of this comes from having a call where, for instance, types only behave in certain ways. And if you can express all of the ways in which types behave and types interrelate, then you can describe a language in terms of, for instance, its type system.
And that, to us, is a tool that we have available. It's sort of interesting when you're mapping between languages where types have inheritance and polymorphism to a language where types maybe have inheritance and polymorphism but have different relationships between themselves. So at that point, something which was a polymorphism conversion in 1 language is an explicit type conversion in another language. Understanding of types is very, very important.
[00:32:50] Unknown:
Absolutely. Especially when dealing with data.
[00:32:53] Unknown:
Yes. Date times are the killers. Because even if you know that you've got a date time, there's 1 hour in every year that doesn't exist and 1 hour in every year that exists twice. And then people do things like, okay. So adding 1 to a day to a time is simple. All you have to know is whether that particular language interprets as days or milliseconds. But then you get into all sorts of craziness, like, if I take a time and I convert its time zone, did I get the same instant, or did I get what we call a time zone attach? So is 1 PM BST in Pacific, like 9 PM or whatever it is, 7 PM PST, or is it 1 PM PST? And different database servers do different things when given this operator, and that again blows ANSI SQL out of the water.
And then what you actually end up doing is just figuring out how the database server does it internally and then modeling that. And then you're back into history. You're back into reading source code. You're back into the Postgres source code is 1 of those marvelous resources on the planet because it tells you how a lot of the database servers out there work, and then you wanna know when they forked to what they did, and why they did it, and who did it, and so on. Yes.
[00:34:04] Unknown:
And I'll agree that Postgres is definitely a marvelous resource, and it's fascinating the number of systems that have either been built directly on top of it or, you know, inspired by it even if not taking the code verbatim.
[00:34:16] Unknown:
Yes. It's also I mean, you've got this challenge when you say we're Postgres compatible and you sort of adopt the Postgres mutator, and people don't typically want to do anything with the Postgres mutator. And compared to almost every commercial dialect, the Postgres mutator has a fundamental weakness in its handling of time zones, which nobody has ever seen fit to correct. I suspect that core Postgres can't, which is that you don't have a time stamp with time zone, and the time zone handling in Postgres is basically to be avoided if you want to get the right answer.
And yet, Oracle, Teradata, BigQuery, everybody else does it right. So, yeah, Postgres is a wonderful resource, but I do wonder that people base things on it with that weakness.
[00:35:02] Unknown:
Most people who start basing their systems on top of Postgres haven't done enough of the homework to recognize that as a failing before they're already halfway through implementation.
[00:35:11] Unknown:
I think the majority of people who have a good idea and they want to get a demo of their good idea out as fast as possible really don't think about the consequences of their decisions on the first 2 or 3 days. And I think there is a phenomenal bias among developers starting out to imagine that because something gives you a very fast day 1, that it will give you a very fast day 3. And they think, okay. We'll get 6 to 12 months down the line, and then we'll rewrite it. And I think that for an experienced developer, the crossover point with technologies is around day 3, not month 3, and this is a big, big mistake.
We made some very interesting technological decisions about things that we were and were not going to do with this company right at the start of the company, and they paid off. And some of those decisions were that we were going to do a lot more hard work that was necessarily obvious. And we've watched people come up behind us and say, we're gonna make different technological decisions and suffer the consequences of those things and sort of run into a wall. But the number of times that I've been told that, for instance, we want to develop the back end in Node because that way we get to use the same models on the front end and the back end.
It's like whoopie doo. Got no type checker. Okay, TypeScript. I'm looking at you. Okay. Whoopee doo. Let's, what? And your experienced developer's gonna sit down with a decent web framework with a DI framework and everything else, and we'll have you know, you're gonna be overtaken by the end of day 3 at best, and there are companies out there that know this.
[00:36:55] Unknown:
Yes. As you were talking about technologies that have been thrown together to get a fast solution, the first thing that came to mind was JavaScript. So I appreciate that you called it out explicitly.
[00:37:05] Unknown:
I like JavaScript as a language, but I also have this rule about writing shell scripts, which is the moment you find yourself using anything like arrays, you're in the wrong language. I have this sort of set of criteria that tell you that you're in the wrong language. There were things I very much like about the JavaScript ecosystem, and there were things that I would definitely go to it for. However, it does make me kind of sad to see it slowly reinventing or rediscovering or hitting many of the problems that other languages have hit. Another example was some years ago, there was a great big fuss about the ability of an attacker to generate hash collisions, putting perturbations into hash tables.
And somebody pointed out that if I generated the correct set of SYN packets in TCP and sent spoof SYN packets to a remote Linux kernel, because it used a predictable hash and it was a list hash, we could convince the kernel to put all of those SYN packets into the same chain in the list hash, and it denies service to the kernel because it was spending all of its time walking this linear chain rather than benefiting from the hash table. And I watched the same bug get discovered in Perl, which taught it to use perturbation of hashes. And then I waited something like 3 or 5 years for somebody to point out that, actually, the same bug existed, and I forget whether it was either Python or PHP.
And then you get into this world where developers say, hang on a minute. My hash iteration order changed. You're not allowed to do that. And then you say, yes, you are. It says so on the tin. And so this whole pattern of watching developers discover solutions to problems that other languages have already invented. It's like, once you've discovered it in Perl, which I think might have been the first 1, and then it might have been the second, but but I'd have to check. Go around all of the other languages and look and make sure. Don't wait 5 years. And the same thing's true for the JavaScript ecosystems. Like, they've waited 15 years to reinvent certain things.
[00:39:00] Unknown:
Yes. Developers have remarkably short memories and attention span, at least in certain respects. Correct. And so bringing us back to what you're building at Compiler Works and the sort of usage of it as a static analysis and lineage generation platform, I'm wondering if you can just talk through some of the overall process of integrating compiler works into a customer's infrastructure and workflow and some of the user interactions and processes and systems that people will use Compiler Works for and build on top of the Compiler Works framework?
[00:39:40] Unknown:
So what you'll find is that most of the data processing platforms out there have some sort of log or some sort of standard presentation of their metadata. And we at Compiler Works aim to make everything as easy as possible, by which I mean we take that standard presentation of the metadata. If you're working BigQuery, we take the BigQuery logs. If you're working Redshift, we take the audit logs. If you're working Teradata, we take the various things that that Teradata throws to us. And having basically given the Compiler Works dumper permission to access these logs, and it makes a dump and it pulls them into the product, the rest of it is automated.
Because the fundamental thing that we operate on is if it's possible for the underlying platform to understand that code, it's possible for us to understand the code. We have all of the temporal information. We have all of the metadata. We have all of the semantic information. And from then on, it's all gravy. We pull the logs. We put it up into the user interface. We make the data available as APIs. And from then on, you can just explore lineage much as you've seen in our video presentations.
[00:40:42] Unknown:
In terms of the migration process, you've discussed a lot of this already, so we can probably skip through this question a little bit. But what is the overall process of actually doing the migration from platform a to platform b and especially doing the validation that the answer that you get on the other side of the transformation, at least close enough matches the answer that you were getting before you made the migration, and then maybe a little bit of some of the
[00:41:12] Unknown:
reasons that people actually perform those migrations in the first place. So lineage is totally easy. You can usually get up and running with a CompilerX lineage in, a few minutes as long as it takes you to pull the log. You pull the log, you run it. Migration tends to be in practice a little bit hairier because the customer's presentation of their code is not standard. Significant percentage of customers pre process their code or something. I mean, this is actually where some of the enterprise languages are nicer. The more capable enterprise languages, while the compilers that we have to write for them are much tougher, the customers tend to present their code in a more standardized form because the language itself is more capable.
When you get a relatively incapable language, the customer tends to mess with it, procedurally generate it, do all sorts of things. It's almost like they're treating the underlying language just as an executor. So the first question you have to ask is, what's your presentation of your code? How did you mess with it? Once you've got a hold of the presentation of the code, what you do with Compiler Works is you specify what the input language stack is, And this is actually quite nice because in Compiler Works, you can take a language that generates another language or contains another language or that preprocesses another language and say, this is a language stack. You're going to absorb this, you're going to transpile, and you're going to emit to a target language stack that has some of the same preprocessing or management capabilities as your source language stack.
And this is yet another hint to say that writing a purely academic Oracle to Postgres compiler isn't enough because the Oracle exists within the context of something else and may be incomplete and so on and so on. And again, if you don't do that, you fail human acceptability. So the start of a migration process is get the code, work out how it's specified, tell Compiler Works how this customer currently specifies their code, tell Compiler Works how the customer wants their code specified, and then run it for the migration. And that process, I have walked into a meeting room and done it cold in an hour, and this could be done. You know, given that the customer typically doesn't know the answer, usually, they don't know the answer to the target platform. They've been sold something by a vendor. They think it's a marvelous idea, and you say, how do you want to use this target platform? And they say, we don't know.
And then we make a recommendation, and we work with their advisors, someone to make that recommendation and get that right. 1 of the things that you get after this is that we have a lot of versatility with respect to doing the migration job, not just converting code. Testing is an interesting 1. Customers vary in what they will accept. As I said, with the 1 divided by 10 example, I'm very, very precise in how we convert. We have customers who absolutely lean on us for that, and they say, I want this accurate down to the last dollar. If you're dealing with financials, sometimes they care down to the last dollar. I'm avoiding slightly naming names here. If you're dealing with some of the markets that we deal with, they're happy with anything that's within 5%.
And now, there's another thing that gets slightly interesting, which is if you're dealing with financials, you'll always use decimal types for data. I have seen people in certain markets use floating point types for data. And the consequence of that is that if you do a sum of a float, you could get any answer at all. It's not like you will probably get an answer that's within 5% of the result. People don't understand floating point arithmetic. You could get anything. And the difficult case is the ones where the customer's done something like that. The target platform does something in a deterministic but different order to the source platform's deterministic order.
Now you get customers who write code where the result of the code wasn't well defined, but the source platform had to execute it sufficiently deterministically that they think that's the right answer. And now you have to sit down with a customer. You say, dear customer, we love you. However, you did not, in the source language, say what you think you said. Can we now please work with you? And there's a marvelous piece of education there. With a good customer, you could really help them to improve their infrastructure as a whole. And that's also where we describe the the static analysis side of the lineage product as, tell me the things I need to know.
Am I in my infrastructure doing something that is odd? 1 of the funniest cases I ever saw was somebody had taken code from Oracle that said, a space exclamation mark space equals space b. Now in Oracle, this means not equals because you've got a not you've got an equals. That's a not equal because now we're going back to the what does the lexer do. In c, exclamation mark equals. That's a token. In Oracle, exclamation mark and then equals are separate tokens, and it's the parser that puts them together into a not equals. Postgres was written by C developers, therefore, exclamation mark equals has to be a token. So what does a space exclamation mark space equals space b mean? It means a factorial equals b. It executes.
It does not return an error. It doesn't give you remotely the same answer. So it is a legitimate static analysis to say, did we use the factorial operator? Because we almost definitely didn't mean to.
[00:46:34] Unknown:
Yes. That is a hilarious bug.
[00:46:40] Unknown:
What is equally puzzling is the number of these things that we discover and find in source code, and we say, how long has this been in here? And the answer is, this has been in here for years. It is generating a production dataset. It's breaking the production dataset, and nobody noticed. And so you start to ask questions like, under what circumstances do you as a customer notice an error in the production dataset? The most common answer we get is because data is missing. But if data is present, the customer tends to assume it's correct. I used to teach undergraduate Java, and you get into a lab, and you'd say to a student, you're going to simulate a cannonball. You're gonna fire it into the air at 30 meters a second. Gravity, we'll assume, is 9.81, and you're going to model the position of this cannonball at 1 second intervals and tell me when it hits the ground. Great. Okay. Well, I can do basic calculus, and so I can say, okay. It's gonna hit the ground in 6 milliseconds. Fine. So they'd write their code, and they'd run their code, and they'd very proudly present me their answer. A cannonball hits the ground in 25 seconds, and I would say to them, are you sure?
The tone of voice is critical here? And it took them a couple of months to work out that I would ask, are you sure in exactly that same tone of voice, regardless of whether or not they had the right answer? Because their duty to code was the same. It didn't matter whether I knew they had the right answer. I was not going to be the oracle. They were going to make sure.
[00:48:03] Unknown:
It's definitely remarkable the amount of that sort of cavalier attitude that exists in the space of working with data and dealing with analyses and just assuming that because the computer says it that it's correct
[00:48:19] Unknown:
and not being critical of the processes that led that gave you that answer in the first place. And you spoke briefly about testing. So the naive answer to testing is if the target platform gives the same answer as the source platform, great. You're golden. And that is in fact the easy case. There's a lot of cases where the target platform gives a different answer to the source platform, and there's an awful lot of reasons why that might arise, many of which are nothing to do with the translation was in fact accurate and preserved with the semantics expressed by the source code.
[00:48:46] Unknown:
That almost makes me think that people should just use compiler works to migrate their code to a different system to see if it gives them a different answer and points them in the direction of finding that they had some horrible mistake for the past 10 years. Well, that's exactly why we've our lineage. You run lineage over your code, and it will tell you whether you had a horrible mistake, and you don't need a target platform for that. And I imagine too that because of the virtue of being able to take a source language and then, you know, generate a different destination language, that that will also help people with doing sort of trial evaluations of multiple different systems in the case where they're trying to make a decision and see sort of how does it actually play out in, you know, letting my engineers play with it, letting my, you know, financial people play with it and see what the answers look like. And I'm wondering what the frequency of that type of engagement is in your experience.
[00:49:34] Unknown:
Almost universal because 1 of the things you had to bear in mind when you're doing a semantic mapping is the required semantics might not exist on the target platform. So now you've got a group of developers on the source platform where you'll find some master developer, and he will find you some hairy piece of code, and he will say, this is the hairiest thing on the source platform, can you convert it to the target platform? And in the old world, somebody would sit down and they'd convert that piece of code, and they'd say, yes. But what he's given you isn't the hairiest piece of code for the target platform. He's given you the hairiest piece of code for the source platform. So an engagement for us looks like, here's all the code for the source platform.
Can you qualify the entire code base against the target platform? And the answer is, yes. If you hold on a minute or 2, we can actually give you that answer, and then we can say, in this file over here, is this operation which is really simple on the source platform? Because the source platform happens to have that operator, but the target platform doesn't and has no way to emulate it. Definitely an interesting aspect and side effect of the varying semantics of programming languages and processing systems. Yes. And 1 of the fundamental assumptions of of the compilers world is that the target platform can do the thing. This is a very interesting compiler's world because that's not true. The target platform cannot necessarily do the thing. And in the world where your language is just a grammar, a a syntax stuck on top of the target executor, of course, you can do the thing because you just glued a keyword to every instruction in the target.
So, yes, this is that way a case where there isn't a workaround. It's not just a case where the instruction set isn't dense. I mean, even, like, compiling c to Intel, the Intel instruction set is dense. You can't do every basic arithmetic operation on every combination of widths of words. So sometimes you have to cast out, do your arithmetic operation, and then cast back down again. In databases, there are conversions between platforms that have things that can't be done. And so the ability to run compiler works over a code base and say whether this could even be done based on some simple operation is is golden for a customer. Yes. The the the side effect of SQL not being Turing complete. And not fundamentally having assignment.
You can sort of use sub selects to do a little bit of functional programming, but the lack of assignment. And then you end up in weird corner cases like if emulating a particular piece of semantics requires you to reference a value more than once and you don't have assignment or an assignment like operator, does the target platform then reevaluate a subtree where that subtree might, for instance, contain a sub select with an arbitrarily complex join? Expensive was the word I was looking for there. Expensive is such a marvelous word in industry.
[00:52:19] Unknown:
Or if the sub select happens to happen at 2 different points in time where the query does not have snapshot isolation and somebody inserted a record in the midst of the query being executed.
[00:52:32] Unknown:
Yes. So you've got repeatable read, and then you've got things like stable functions. Like, if the thing that you had to duplicate, for instance, read the clock, and most database servers are smart about this. And when you call the clock function or they will actually publish multiple clock functions, 1 of those clock functions reads the time at the beginning of the query, compiles that time into the query as a constant so that when you do something with respect to now, you always are treating the same now even if your query takes a minute to run. But they will often also have another clock function, which means the actual millisecond instant that the mutator hit that opcode.
And now you get customers who confuse the 2, and sometimes it matters and sometimes it doesn't. And if you're running on a parallel database server or whatever, you start to get different answers. So, yes, it's not just about data. I think what I'm doing here is I'm broadening one's view of what isolation and subtree duplication and so on and so on really do to you. So I'm sure that we could probably continue this conversation
[00:53:36] Unknown:
ad infinitum, but both of us do have things to do, so I'll start to draw us to a close. And so to that end, I'm wondering if you can just share some of the most interesting or innovative or unexpected ways that you've seen the compiler works platform used. I think the ones that we love most are the ones in the lineage product
[00:53:52] Unknown:
where because we show consequences at a distance, somebody's looking at maintaining a column, they say and we say, it affects such and such a business report. You probably should think before you do that. And the user says, no, it doesn't. It can't possibly. And then they click the button on Compiler Works that says, explain yourself, and we say, this is how it does it. And they have that moment. And I think the best mails that I get, and we get them quite often, is not just where we gave the user a revelation. It's where we gave the user a revelation that fundamentally disagreed with what they believed about their infrastructure and really opened their eyes to it. Those are the ones that I most enjoy. The ones where people run it because it's accurate and they say, okay. If I do this, I'm not gonna go to jail. That's great. But the ones where it contradicts them and substantiates
[00:54:43] Unknown:
itself are the ones that I love. I could definitely see that being a gratifying experience. And in terms of your experience of building the compiler work system and working with the code and working with the customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? There's an interesting definition of technical debt,
[00:55:02] Unknown:
where we define technical debt as not a thing that's done wrong, but a thing that's done wrong that causes you to have to do other things wrong. It's really only a debt that matters if you have to pay interest on it in that sense. So knowing when to incur technical debt and how much interest you're paying on it. Compilers, I've compared to playing snooker. I can go up to a snooker table, and I can roll a ball into a pocket. If I'm lucky, I can hit that ball with another ball and get it to go into a pocket. I'm not skilled enough to have a stick and hit the 1st ball with a stick so it hits the 2nd ball so it goes into a pocket. That's a skill of 3 levels of indirection, which I don't have. Compilers, you have to be forever thinking about everything you're doing at 3 levels of indirection because you're writing the compiler, and a lot of customers don't even send you the code. So the customer says, my data was 42, and it should have been 44.
I'm not even necessarily going to tell you what the code was, and now you have to fix it in the compiler. So now you're playing snooker blindfold. That's tough. And 1 of the things that comes out of this is because the majority of the code that ever runs through your compiler is never gonna sharpen a support issue, and you're never gonna see it, what this means is that you mustn't cheat.
[00:56:18] Unknown:
Absolutely. Do it right. Improve it right. And for people who are interested in performing some of these platform migrations, or they want to be able to compiler works as the wrong choice?
[00:56:34] Unknown:
There are cases we come on where the target platform has so little resemblance to the source, by which I mean the desired target code has so little resemblance to the source that it's not really a a migration. It's a version 2 of your product. We come across people who try this, and for that, Compiler Works is the wrong choice. I would tend to advise those people you know, people use a platform migration as an opportunity to do a product version 2. This isn't always as stunning an idea as it seems. You might want to consider separating the platform migration and the version 2 product, Because I think any developer who's been around the block a couple of times knows that the double set of unknowns is going to hit you.
And so we speak to people, and we say, do the migration. Do it apples for apples, and then do the maintenance on the target platform. Because the other thing about not doing a relatively clean migration is that you now don't have a test suite. You can't compare target platform behavior with source platform behavior because you explicitly specify target platform behavior to be different. So we tend to advise people to do 1 thing at a time. But if you wanted to do them both together, we would start to lose relevance.
[00:57:52] Unknown:
As you continue to iterate on the platform and work with customers and build out capabilities for Compiler Works. What are some of the things that you have planned for the near to medium term or any projects that you're particularly excited to work on? Lots more languages
[00:58:07] Unknown:
and shortening the time to the moment. The latest versions of Compiler Networks that we've shipped have a completely redesigned user interface in the lineage, and we've done a lot of work there to put exactly the right things on screen so that you can look at the screen, and the answer to your question is there. That's hard work. That's visual design work. It's got nothing whatsoever to do with compilers, and now you've got a compilers team saying, now we have to do all of these user psychology and and so on and so on to do the visual design, But then it goes back into the combiners team. Like, the user says, in order to make this decision, a user has an error somewhere with their data infrastructure. They want to know how to fix it. Really, what they want is the list of tasks they have to perform in order in order to fix that. And we can produce that, but we can also make it visible why and justify it. But the moment you've gone up the front end and decided that that's what your user story is and you need to do that, now you have to go back into the back end and make sure that the back end is generating all of the necessary metadata to feed into the static analysis, so that that visualization can be generated.
And so it's a very tight loop between user story visualization front end and pretty hardcore compiler engineering.
[00:59:21] Unknown:
Absolutely. And running the risk that this is probably a subject for an entirely another podcast episode, what are some of the applications of compilers that you see the potential for in the data ecosystem specifically that you might decide that you want to tackle someday? There are applications
[00:59:38] Unknown:
of compilers that I particularly enjoy. 1 of my favorites, which is on GitHub, was the qemu Java API. What it does is it takes the qemu source code, which itself has some sort of JSON ish preprocessor, and writes another compiler that compiles that JSON ish code into a Java API which allows you to remote control a QMU virtual machine. Now, 1 could have sat down and tracked QEMU and written this thing longhand and said, I'm going to write a remote control interface to QEMU, such that I can add disks and remove disks and so on and so on on the fly. But it made far more sense to do it as a compilers problem because it now tracks the mentions the QEMU. And if they add a new capability, well, guess what? You rebuild your Java API by running this magic compiler over QEMU, and you've got a new qemu remote controller interface.
And the reason that I particularly love that piece of code as an application of compilers was that now I can write a JUnit test case that runs in Gradle, in JUnit, in Jenkins, in all of my absolutely standard test infrastructure, installs a storage engine on them, writes a load of data to the storage engine, causes 3 hard drives to fail, and proves that the storage engine continues operating in the presence of 2 failed hard drives in JUnit. Now, normally, when people start talking about doing that sort of infrastructure testing, they have to invent a whole world and a whole framework for doing this, and yet 1 200 ish line compiler run over the QEMU source code gave the capability to suddenly write a simple, elegant, readable test in the standard testing framework that allows you to do hardware based testing of situations that don't even arise in the normal testing world. That's where I start to love compilers as a solution to things, and that's why I think I will always have a thing for compilers whether it's data processing or not. Yes. Definitely amazing the number of ways that compilers
[01:01:47] Unknown:
are and can be used and the amount of time that people spend overlooking compilers as a solution to their problem, to their detriment, and to the extreme cost of time and effort put overengineering a solution that could have been solved with a compiler.
[01:02:03] Unknown:
People think about it as, like, you could do this by hand. I could sit down and write for libopengl. But if I actually want to, like, how many method calls or how many function calls are there in OpenGL? If I actually want to call Open GL from Java, I probably need to generate a Java binding against the c header file for Open GL, several thousand function calls, and that's a job for a compiler. Happens to be a job for a CPU Processor as well. I think I know which 1 they used.
[01:02:34] Unknown:
Alright. Well, are there any other aspects of the work that you're doing at Compiler Works or the overall space of data infrastructure and data platform migrations that we didn't discuss yet that you'd like to cover before we close out the show? I think we should have a long talk about compilers in the abstract sometime because we'll get into a very rich, probably a very opinionated,
[01:02:53] Unknown:
and probably a very detailed territory. 1 day, I will say don't be afraid to learn. 1 of the things that I think makes me a little bit odd in this world is that I actually didn't study all of the standard reference works. We study a lot of history. Most of the people who slapped these things together didn't study the standard reference works. So by all means, take the course. I had some excellent professors whom I loved who put us through the standard compilers course. But I will say that the standard compilers course and even some of the advanced compilers courses that I've watched because the universities have been publishing them online, they don't really touch on this.
They don't really want to touch on type checking. They just about do basic things like flow control. Get out there and learn and be self taught and dig into it, and don't be afraid to do that. And 1 day, I have a shelf of books where I went to 1 of the publishers, and I said, give me every book you've got on compilers. And it's 1 of my intentions 1 day to read them, but I haven't yet. So my closing thought would be,
[01:04:02] Unknown:
even if it's not compilers, whatever it is, go for it. Well, for anybody who wants to follow along with you and get in touch, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:04:19] Unknown:
I think as we move away from some of the enterprise languages and we move into some of particularly the data flow systems that we've got these days, we are moving into a world where the languages become harder to analyze and maintain. We're accessing underlying platform semantics through APIs, not through languages, and I think that that is going to have a cost. I predict doom. Definitely something to consider. It might be strawberry flavored doom. I don't know what sort of doom.
[01:04:56] Unknown:
Alright. Well, it has truly been a joy speaking with you today. So thank you for taking the time, and thank you for all of the time and effort you're putting into the work that you're doing at Compiler Works. It's definitely very interesting business and an interesting approach to a problem that many people are interested in solving. So thank you for all the time and effort on that, and I hope you enjoy the rest of your day. Thank you for such a marvelous set of questions. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Shevek and Compiler Works
Accidental Introduction to Compilers
Challenges in Data Management and SQL Translation
Understanding Compilers and Their Relevance
Language Implementations and Focus Areas
Abstract Modeling and Mathematical Representations
Technical Architecture and Workflow
Integrating Compiler Works into Customer Infrastructure
Evaluating and Testing Platform Migrations
Customer Revelations and Lessons Learned
Future Plans and Exciting Projects