Astronomer with Ry Walker

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at www.dataengineeringpodcast.com

/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

Go to dataengineeringpodcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. And to help support the show, you can check out the Patreon page which is linked from the site.

To help other people find the show, you can leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media. This is your host, Tobias Macy. And today, I'm interviewing Ry Walker, CEO of Astronomer, a platform for data engineering. So, Ry, could you please introduce yourself?

Sure. Yeah. Hi.

Glad to be on the podcast. Ry Walker,

CEO of Astronomer. We're a 2 year old

data engineering platform that we've been working on, and,

I'm a lifelong entrepreneur.

This is, my 4th

company of substance. You know, I've been doing this stuff for a long time, but a few years ago kind of got hooked on the whole data revolution, the data science

trend, and, you know, decided to to build a company in the space. We're pretty excited about what we're doing.

And how did you first get involved in the area of data management and what is it that interests you about it? We were working on a product called, Usercycle.

We actually took that over from somebody, a friend of mine, and

basically discovered the the biggest challenge in the business was

not so much making a sale or explaining the value prop. It was a it was a product analytics company similar to, like, Mixpanel or,

Kissmetrics, things like that. But the the biggest challenge was getting companies to hook their data up to us. You know, our product could do nothing without data being fed to it. So, you know, we we spent a lot of time trying to, you know, we said you should guys should use Segment. We built an integration with Segment so that it was easy to get the data to us that way, but not everyone wanted to use Segment.

And,

just realized that,

you know, at the time, I think there were 3, 000 tools in that, like, marketing,

tools,

chart. I don't know if you've seen that on the internet but we figured like every company in our space had to be having the exact same challenge,

there was nothing special about our situation.

And so, you know, we pivoted away from building an analytics

product to building a company that helped organizations,

wield their data, you know, for

for their own benefit.

And so that company, I'm assuming, is Astronomer. So I'm wondering if you can give a sort of high level overview about what Astronomer is and some of the origin story.

It was actually it was kinda funny. We were at a conference, the, Collision Conference in Las Vegas a couple years ago, and I think it was May 20th.

It's funny I know the exact date because it was a password I used for a while, the date when we pivoted. But we went to that conference on day 1 and we were we were user cycle. Just we weren't feeling it, you know. It was for the first time kind of out, like, talking about the product en masse. And that evening, actually, during the conference, there were a lot of DevOps, you know, and Dev tool, you know, selling to devs, kind of lot of stuff like that going

on, in in the world. So we decided to pivot that night, and and come back the next day as this other company, which we had been thinking about but hadn't pulled the trigger on. So we went back, I had some meetings with VCs the second day. And so

that day instead of talking about user cycle, we talked about the new thing and it was,

you know, it just, it was a way more timely product.

I know this isn't a podcast about startups, but you know, 1 of the things you generally learn is that the idea that you may have had initially is already played out in the world of, venture capital. So you have to go somewhere a little bit more on the edge. And

data engineering

is still a pretty young, discipline. Most people don't know the term. And so,

even today, it's just now I think coming into a little bit more understanding on what what it is.

Yeah. A lot of people tend to still

question, you know, what is data engineering and don't you mean data science because that's sort of the hot buzzword of the moment and everybody knows what that is. But when they see your data engineering, it's sort of a little bit more of a nebulous concept to them because particularly because of the fact that data is so intangible in so many ways and trying to understand how that pertains to an engineering discipline in and of itself is maybe a little bit foreign to them.

So, like, now I built a web development company back in 95. This is, like, really early. I'd walk around,

downtown Cincinnati, the city I was in, and, we would, unplug people's fax machines and show them the Internet and say, you know, do you wanna buy a website? And they'd say, please plug my fax machine back

in. I've got important faxes coming in, you know.

And we're kind of in the same situation again. We're walking, you know, we go to these big companies,

companies in the middle of the country, and basically try to explain to them. You know, you have 40 analysts or you let's say you're talking to an insurance company and they have they have hundreds of people who are using their their company's data every day,

and it's in a really bad form. You know? It's it's like in a very hard to use format

and explain, you know, you should add a layer of developers to make this data better for your people.

And it's just a foreign concept, but, you know, we're we're doing our best to

educate and, teach people, what's going on in this space.

And do you find that you have a difficult time getting people on board to using a hosted platform for managing and processing their data, particularly because of the sort of inherent value that their data has? Though some companies might not necessarily realize the value that is contained within the information that they're gathering?

Yeah. I'd I'd say there's definitely some resistance to that. Our platform is is a little bit,

you know, some features that we don't really talk too much about on the website yet that they're they're still kind of new, but we actually you can actually deploy a VPC

through our platform

that has,

you know, that's essentially private network to to their network.

So we we also have, VPN capability. So, you you know, and, basically, at the end of the day, if you wanna send your data out into the world, it has to get out into the world at some point. And and and I think these, you know, companies that that live in a in a castle where all their data never gets to leave their premise

are really playing at a disadvantage compared to technology companies that are trying to unseat them, so I tell them that about all the time. It's like you're either going to evolve or die. It's digital Darwinism.

And, you know, most of them have tried doing some cloud things, so I'd say that I'd say, especially, like, here in 2017 compared to 2015,

the guards are coming down, a lot to doing things, especially if it's like, for example, we work we deal with clickstream data and or, you know, consuming data from the cloud. Obviously, it has you know, that has to be connected to the cloud in some way in order to get the data. So but, you know, our we've got certain features in our product that'll help to alleviate some of those concerns.

And some companies are gonna be dealing with data that has regulatory issues associated with it such as HIPAA or FERPA or, you know, SOX compliance or PCI. So I'm wondering if there are any challenges associated with being able to properly secure the environments that those customers' data are flowing through? Do you have to segregate your network in order to make sure that their information doesn't contaminate somebody else's or any other sort of special technical problems that come about from that kind of a situation?

We avoided any HIPAA

and type deals early on because we just didn't wanna deal with the headache. But our platform actually can be deployed completely on prem as well. So we, you know, we built it from day 0 knowing that just having it be a pure SaaS company would really limit our ability to work with larger businesses, so we built it from from, you know, so it's, you know, that kind of gets into, you know, some of the architecture stuff but we built it on top of

DCOS, Apache Mesos,

so that it was very portable and would run anywhere. We also early on, we kind of built in some AWS dependencies that we're now removing. But, you know, the vision is that this thing is completely, you know, based on open source components that are that that makes it easy to to move around.

And that was actually 1 of the other issues I wanted to bring up is a lot of companies

when they hear the idea of open sourcing your core platform, they get sort of allergic to the concept. So I'm wondering, when you're in that kind of a situation where all of the pieces of your

technical product can be obtained essentially free of charge, what's the motivation for somebody to actually pay you, and what are the differentiators

that you have at that point that somebody won't necessarily be able to just replicate in house without having a motivation to want to actually continue paying any sort of subscription cost or something like that? Yeah. That's a it's definitely a concern.

It

basically

and the fact that we're building this thing so that it's multi so it's sort of more dynamic than what you would build build yourself. So, like, for example, our platform uses Apache Airflow, and we have to figure out ways to essentially generate

pipelines

based on

configuration, let's say, where if you're just building it for yourself internally, you can literally just like drop a Python file in there as the as the thing. So it's definitely more complicated the way we're doing it, but I would just point to to the fact that it's if you're gonna do that, if you're gonna try to bring Airflow in house and Kafka in house and Spark in house and, you know, all the stuff that's inside of our platform, plan on hiring some people to manage that to learn and manage that stuff. It's not it's not like, you know, those I don't know. I'm assuming you've at least played around with those tools. They're not they're not small. You know, they have a lot of surface area to learn and a lot of management to keep them up and running. So, that's our our argument is, yes, you could pull all these open source tools off the shelf. Go ahead and give that a shot. Give yourself 3 to 6 months and see how far you get.

And, we'll talk to you, in 3 to 6 months, and you will not you'll not make as much progress as you thought you would in general unless unless you have some superstar developers, which, again, a lot of the kind of companies we're talking to, they don't have those guys. You know, they don't have those people in in house. Right. The old axiom that open source is only free if your time is worth nothing. Yeah. I never I never heard that before. That's a good 1. I'll definitely use that.

And is there a particular sort of size or scope of company that you find tends to actually have the dedicated resource for data engineering versus just having some developers or infrastructure people who are doing double duty for managing the company's data as well as their primary responsibilities?

Yeah. I mean, I think most data engineers are nonprofessionals

like like data scientists, obviously. If they know

a little bit of R and Python. They can they can write some code to to pipeline some data. You know, our experience is companies are either using tools like, you know, Microsoft

SSIS

or Dell Boomi or Alteryx, you know, various ETL type technologies,

or they're doing nothing. But, you know, it depends on what's, you know, what type type of company we're talking about. And the start up and start ups, they'll obviously write scripts and run them under Chrome and, you know, that that can kinda get you a certain certain way. But, you know, I I think that the argument we make is all companies

of any sort of substance, if you have a data scientist, you probably ought to have a data engineer sitting next to them or else the data scientist is gonna spend half their time doing data engineering.

And so I'm wondering if you can dig a bit deeper into what your data pipelining architecture actually looks like. You mentioned a few of the different components. But as you also said, there are some sort of special

configurations that you're using in order to be able to make it more general purpose. And as anybody who's tried to open source a project before has noted, it's a lot more difficult to make something general purpose than it is to just use it for a specific use case.

Yeah. And so, you know, our our platform sort of starts with, Mesos at the lower level and, we use, Marathon and, we're using Prometheus

and Grafana back on to mon to to do, the kind of instance level monitoring

Terraform to deploy. If I would say probably Airflow is the center point to a lot of what we're doing, but we deal with both streaming data as well as batch jobs. And I actually think the lines are blurring a bit between real time and and batch,

because, for example, like, you can take a a real time we're basically taking batch data and and throwing it onto a a Kafka queue sometimes and then pulling that off in in real time. So as data is being processed, let's say, from a a CSV download, you kinda turn into real time data. That's an option instead of just dealing with it batch batch batch. And then in the same time, like, real time data is,

oftentimes, like, if you wanna push data into Redshift, you wanna use

the the the Redshift's copy command, which means you have to micro batch it. So it's kind of interesting, like, that we're turning batch data into real time, which turns and then turn it back into micro batch and it's so anyway, Kafka is pretty important. Spark is, pretty important in our platform.

Then we have our our monitoring, our front end monitoring stuff. So we we push all the logs for everything into Elasticsearch, Logstash, and Kibana. Gosh. What am I missing?

Our front end app is React. We're using,

GraphQL

for our API. So we're gonna try to get by with our only API being a GraphQL API. I don't know if you saw GitHub recently released their GraphQL API, so we'll have some nice docs around that. And, you know, we're definitely an API first sort of company. So anything that can happen in the system, we want to be accessible through through that API. I think that's about it. I think those are the the main components. I think well, I mean, you know, we have a bunch of different databases like Postgres and and Mongo and Redis and stuff supporting that. 1 of the news things we're we're playing around with is, I think we're gonna put a time series database in there. I I understand Druid's gonna be the thing we we try and and we're looking at Ceph for data storage. So there's all kinds of fun little mini projects in there for our infrastructure team, but we're kind of in this no Hadoop,

sort of mindset. So our our platform

permanent storage, doesn't really do any data science, you know, so there's no magical, algorithms that happen in there. But, oh, I guess the other,

finally the final piece to it is our own open source connector library, which we call Ares. It's just

various

SaaS and database,

you know, source connectors and destination connectors.

So that's that's the that's the stack we've we're rolling. Does that sound like a lot?

Yeah. I mean, as you sort of dig down deeper into any technology company, you know, at a high level, you think, oh, you know, there's just a few different components. And then as you dig deeper and components.

And then as you dig deeper and deeper, you start uncovering more bits and pieces where I'm working. My primary responsibility is hosting the edX platform. And so you think, oh, you know, it's just a Django web app and a couple of databases, and it's like, oh, well, you know, I also have SaltStack to manage

web app and a couple of databases, and it's like, oh, well, you know, I also have SaltStack to manage all the infrastructure, and then I've got console for service discovery and vault for managing all the secrets and then RabbitMQ for all the task management and then I've got the elastic file store for a permanent storage of data. So yeah. Yeah. Yeah. We have vault on ours too. I forgot about that. So Yeah. It's it's amazing how fractal these things become.

Yeah. And that's and that's where I think it gets challenging. So again, like, yeah, you could you can spin up a a, you know, a airflow

thing and and get Spark running and you have a couple of the components. But the other thing that that we're we're trying to do is so, you know, we're building these recipes, you know. So like I don't know how many data engineers in this world have built a like, let's get our data out of Salesforce and put it into our warehouse before, you know. I imagine

it's been built 50, 000 times, let's say.

And, I think that's too many. You know? So our our theory is, like, let's let's build a library of recipes of kinda standard

best practices, and I and I understand, like, maybe there needs to be 10 standard, you know, Salesforce to warehouse recipes, but, like, I just think it's it's it's horrible that we all have to build everything from scratch. Go back to the, like, the root the API docs and and try to figure out what's the API providing me. And and so, you know, we're that's a big part of of the next phase of our our business is open sourcing this library of recipes that live on top of our platform that could be used even without our platform. But, you know, the easiest way to use them would be there. And then we'll also hook hook our, front end up to those recipes as well. So that if you just basically wanna grab Salesforce data using recipe number 4, you should be able to just pull it off the shelf, provide your credentials and say go and let that stuff flow. If you wanna if you wanna tweak it, you know, fork the recipe, upload that, and run a fork version of it. So I don't know if you've seen any systems like that, but that's that's the vision for what we're up to is, you know, we just think that if and if there's just like 1 little transform we wanna do in the middle that's different than the standard, like, let's put a little spot for that,

so that, you know, maybe you can just define a a mapper function of sorts. But, yeah, I mean, basically,

we wanna make the the simple things almost invisible and the hard things possible, you know, so we're going to let you,

run any Python, you know, in the in the pipeline and, you know, we oh, so, yeah, we we're obviously we're using because we're using Mesos and, you know, we have these dockers obviously super involved too. So every time a task starts in our world, it actually fires up a a whole Docker container to run that task and then shuts it down. So it's pretty cool the way the way we're dealing with resource,

isolation.

Yeah. Whenever you get into the point of having plugins to, you know, map 1 input to another output, there's always certain complexities involved. So I'm wondering if you're using something like the adapter pattern where you have a sort of standard

data format or representation of the data within the core of your system, and then you use adapters to translate to and from that common representation to get from the source to the destination?

Yeah. I mean, in clickstream,

there's definitely some,

standard formats,

you know, a lot of different so I'd say not quite so so much yet. I I could see that being a a a big issue that we have,

you know, that that there should be ought to be, like, standard formats for various types of data, but we quite we haven't quite dug deep into that so much. Everything is basically say, like, collecting data, it should

be simple getting just the raw data into your database, and it just comes in in whatever format the the vendor, you know, provides the data basically. But that's our next challenge is to help basically do some call it preprocessing

or I mean, it's more like post processing. After the raw data arrives, what can we do to make it immediately more usable than just having a all the a pile of raw data in there? So, but our platform is essentially infinitely wide because there's so many different areas and types of domain, so to the extent we can not be the subject matter expert there, we'd love not to be. We wanna be the kind of infrastructure,

for all this and hopefully let communities,

manage what the standards ought to be.

And, taking a brief aside, earlier when you were describing your architecture, you made a brief allusion to your infrastructure team. And 1 of the conversations that I had with Maxine Boesch in previous episode was the idea of segregating your data engineering between the data infrastructure and then the actual people who are doing the engineering on the specific data. So I'm wondering what your perspective is on that particular divide and, I guess, at what point that division actually happens in terms of overall scale of an operation.

Yeah. So we have 4 teams of engineers in Astronomer, which sounds like a lot considering we're early stage startup, but I'll I'll just describe that and and, I think 1 of them would you know, most companies wouldn't need, but we have the,

we call it the infrastructure team and they're dealing with basically

Mesos and all the services,

deployments, and, all that kind of stuff. Then we've got what we call the processing team. So these are people who, you know, like, if you need to build,

apps that pull data from Kafka and do something with that, that's kind of the level that these these guys work at.

And then we have, we have our front end team, which, again, I think most companies don't need to build a front end to their data infrastructure. However, you know, like even, like, Airbnb, obviously, they've got all those pretty, you know, they've got a lot of cool front end tools. And if you wanna make these things accessible to to your people in the organization, you have to sort of think about that kind of stuff. And then finally, we have what we call the, implementation

team that actually builds pipeline on top of all this stuff. So and they they generally work on our customer organization. So they're engineers that are just doing customer

specific, work. So I think I think if I was designing a data engineering team, obviously, I would say don't mess don't mess around. I I would start with people who build pipelines for 1, and use a platform like us, you know, that could be productive. But,

I think it's a big investment. I think if if you're gonna go, like, like, we're actually talking to some companies about helping them bootstrap the data engineering,

organization within their business. And I say, you know, plan to hire 5 to 10 people

to have

enough mass, you know, for this thing to feel

worthwhile. You know, if you're a company with 3, 000 employees,

a 10 person team is is nothing. Right?

And as you're growing the team at Astronomer

and growing the platform, I know that from personal experience, a lot of these sort of technical platforms tend to have an organic evolution to them, which leads to technical debt. And so I'm wondering, what are some of the biggest sources of debt accrual that you're seeing in your particular platform and some of the techniques that you're using to mitigate it?

Yeah. I mean, for sure, we built the first version of our clickstream system. Intentionally,

the the philosophy was let's use every single Amazon service we possibly could because it helped us get it built really quickly and it was amazing like we were able to get done super fast and we're

still trying to pluck,

some of that stuff out.

So I I would say, like, the tricky part is it's hard to change.

It's hard to switch away from, you know, the the Amazon services, like so the the ones I'm speaking of are, like, Kinesis and

API gateway. It's a pretty cool product,

that kinda gives you a a nice solid endpoint that's not gonna go down.

So now we're we're we're replacing that with, like, a Go,

front end with, I think, Kong,

is gonna be involved in there somewhere, which I barely know what that is.

You know, as CEO, I,

I'm a technical founder, but, you know, like, at this point, I don't get to have too much fun, writing code. But, yeah, yeah, I would say, you know, it it

we're we're just plucking up the last of our dependencies on Amazon services so that it's so that our entire platform just runs on, you know, plain old e t 2 instances.

And we're also, you know, making it so that it'll run on, you know, Google Cloud or Azure.

But I'd say I'd say that's the biggest the biggest 1. You know, there's other things that you basically learn and, you know, here like, basically

our strategy for

pulling data off of the Kinesis stream,

worked for a period of time,

until it stopped working and, you know, like the the applications behind would start to get behind. So we just recently reorganized

how like basically each

each destination,

that the data has to go to has its own dedicated,

service whereas before

we would have, like, you know, we had the red shift service, and then we had the everything else service, you know, which worked fine for a while, but then eventually, you know, you you have to start just breaking things down into smaller and smaller pieces.

You know, I'd I'd say that's,

we have a big reliance on s 3 as well. And so, we're trying to we're actually trying to break that reliance, which will probably be the last thing we we do.

But it's pretty nice. S 3 is obviously a really nice tool, so it's tough to not rely on things like that.

I think that's the reason that a lot of the other object store services have tried to replicate the s 3 API as well to try and,

ease the transition to their services because it's such a ubiquitous service. Even people who don't necessarily use e c 2 will typically find s 3 as their first entree into the Amazon Web Services,

more ass.

Yeah. Yeah. And our CTO, he tells me every all the time, he tries to make me feel guilty. Like, every time we we decide to pluck 1 of these things out, it means we're trying to build something that's as reliable, you know, or similar to s 3. So, yeah, it's a lot of pressure. And so, you know, I

think I think it's, you know, companies can get a lot done with just using the Amazon services. And I don't I don't really have a huge problem with that, but,

you know, we're we're we're trying to build some it it can get expensive, you know. Like,

we had a certain customer who was using what seemed like a pretty naive thing, and then they just they were using a ton of Lambda calls, you know. They Lambda's a really cool technology too from Amazon, but it isn't cheap, you know, once once you get to a certain scale. So, you know, our our own Amazon bill, I think, you know, is was approaching, you know, 15,

$20, 000 a month with all the stuff we have running on it. So, it can it can get, pretty pricey pretty quick when you when you go to scale,

so it's, you know, I think it's worthwhile to try to not be completely tied to it. From the platform perspective,

are there any sort of dead ends or failures that you have encountered that you've had to sort of back out of or try to pivot around in the process of building out your platform?

Yeah. So even right now, like, we've been working on a solution to use

Spark Streaming

to process click stream data off of Kinesis

and get that data in micro batch form to to Redshift.

And

that seems like a pretty simple process. But the way we were working it looked good on paper. Then we then we started pushing our production level

volume to it and

the workers couldn't keep up with the stream,

you know, even with with, you know, Spark Streaming and all that. Like, so we I mean, we were about ready. We're like, oh, I don't know if this is gonna work. Finally, we just had a breakthrough

this this week to to figure that stuff out, but that's I mean, the challenge is, like, not too many people know how are there there are not very many Spark

streaming experts,

period, you know, like, that are working with it at scale. So and a lot of times, it's a tricky thing is, like, we're

basically

building experts in all these technologies as we build the company. You know, we've got a 20 person,

engineering team and and a lot of them are new to a lot of these technologies. So, yeah, I'd say we we often, you know, especially as we have new people coming up, they do naive things and they figure things out and, you know we're building that competence but,

I'd say that there are a lot of knobs and switches on on like say Spark, that we're we're learning what all those things do and, it's pretty hard to predict when tasks are going to be done, you know, new feature new features that we're working on. Pretty much have to just, like, leave some of these things open ended, which is pretty frustrating as a founder and trying to deliver features to customers, but this stuff's pretty tough.

There are some areas of data engineering as a whole and the overall workflow that are, you know, very well defined and have a number of tools to choose from such as Airflow or Luigi

or Uzi for being able to do workflow management or things like Spark

and Flink for doing streaming data. But

there are other areas that seem to have just complete gaps, such as the ability to track provenance of data or anything like that. So I'm wondering if there are any particular holes that you're experiencing

that you see opportunity for building new tools to add into the sort of ecosystem of data engineering?

Well, I think the 1 that that we're we're we're we're contributing is is this idea of,

basically having a easy way to access

best practices

that you don't get it from, you know, you you really can't

it's a tricky thing because, like, you can't find out the best way to get Salesforce data into Redshift

from either I mean, it could be that Salesforce wrote a guide for that or it could be that Amazon Redshift wrote a guide and that's, like, the kind of the most common integration. But but from there, it it's a very big universe of of potential combinations

that most people haven't written anything about.

So like I said earlier it's it's tricky

to have to always be starting from scratch. A lot of times, you know, junior developers get thrown on these kind of projects and kind of make a mess of it so,

we're very inspired by like what Ruby on Rails did for web development and what Node did. And I know you're a Python guy, so Django was a part of that story as well. But, you know, there's there's in other areas of of software development, we have these awesome libraries that help us

get simple things done really quickly.

And I think that there's I think that the, essentially, the the infrastructure tools are are coming together. I think it's these higher level functions that are you're kind of on your own. Like, just take talk like cleaning data or for example, like, we we help companies,

get their Twitter data, Twitter ads, Facebook ads, and LinkedIn ads data kind of all talking together in the same format. That's like, it's it's a shame to me that there there are so many devs out there working on that exact same problem at the same time that we are, and we're not collaborating on it. So, you know, I think open source collaboration

of methods of higher level of higher level,

tasks is is a definitely an open area that we're excited about.

Yeah. It's,

definitely interesting to see how open source has become

much more of a sort of standard operating procedure for companies as its overall visibility and,

idea

has become more popular, you know, whereas back, you know, 10, 15 years ago, open source was something that, you know, all those Pippi software developers do it, but it's not something for a big serious company to get involved in. And now if you're a serious organization, you're starting out right now, if you don't use open source con and contribute back to it, then, you know, you're the outlier. And the overall perception of open source as a force multiplier is becoming more popular because it gets people to buy into your platform and, you know, make sure that other people's ideas are being considered,

Yeah

Yeah. Totally. And I think it's still like, I I'd talked to somebody the other day. He's like, I can do everything your platform does with with SQL Server Integration Services. I'm like, okay. That's fine. Bye bye. You know? And there's always there's still those people that are stuck on,

Microsoft technologies, and they don't wanna leave it. And that's fine. But I I

for us, like, I left Microsoft back in 2006 was the last time I ever wrote anything in Visual Basic or c sharp, I think, had just come out. But, you know, I think that there's

definitely a lot of that still. You know, we've got customers that still have, you know, mainframes and mini computers and AS 400 and all this kind of stuff. So, I mean, those technologies aren't going away anytime soon. But if you're doing data engineering work, the tools are just amazing, you know, like airflow.

I just explain to people, like, you don't have to put all your jobs on a on a timeline that where you hope 1 thing finishes before the next thing starts. You know, everyone who's done DevOps and you I don't know. Do you have any jobs that are scheduled like that? It's a very nerve wracking thing, and it's old school. Like, Airflow solves that. So very lightweight. It's easy to use. Like, use it. Right? But you have to use modern technologies,

to get the modern ideas.

Yeah. No. The ability to have event based actions is definitely

a vast improvement over, like you said, just having time based and hoping things finish or relying on the fact that, oh, this happens at the same time every day, but then 1 day it doesn't happen. And so then you have to go in and manually figure out why not and what to do about it. So it's 1 of the reasons that SaltStack is my,

preferred platform for doing infrastructure automation is because it does have that event driven core, so it makes it much easier to be able to be reactive rather than sort of trying to predict the state of things at a particular point in time and then take an action based on that, supposed state.

Yeah. I'm not super familiar with SaltStack. Can you tell me real quick what that does? And I I think you saw you had a blog post about it, but I didn't read the blog post yet.

Sure. SaltStack

is at 1 level, it's a configuration management

tool in the same ilk as Ansible and Chef and Puppet. But at another level, it's a framework for cloud automation. So it has an event driven nature. You know, the the entire core of the system is an event bus, and so you can consume those events and then create reactors that will trigger based on certain event,

matches.

Cool. You can also use it for distributed command and control where you will send a command to a fleet of servers based on a particular a particular set of attributes for targeting. It has the

ability to, you know, interact with the cloud providers directly so you can spin up infrastructure and tear down infrastructure. It contains a lot of primitives for being able to do essentially whatever you want. It contains a lot of primitives for being able to do essentially whatever you want. But, again, as you mentioned before, you know, the higher level

capabilities of it where you just have sort of a high level API of, you know, do this thing and it will handle all the orchestration,

they're still working on building up to that point. So as an individual, you can create those orchestration patterns and, you know, create your own high level APIs, but the tool itself is much more at the low level, you know, framework. Here are all the bits and pieces, go ahead and put it together.

Yeah. Yeah. Well, it's gotta start there, you know. So,

I think I think it's, you know, we're it's a really exciting time, I think, to

to be

a data engineer. I mean, there's so much so much stuff going on. I think

and it's gonna just go from here. The thing that the thing we talk about a lot a lot with our customers is the fact that, you know, we we really believe that we're in the early stages of a data revolution that, you know, I mean, think about, like, what what percentage of companies have machine, you know, modern machine learning, running algorithms in their company. It's it's a very it's like less than 1%.

Most companies

can barely

get a BI, you know, get a chart that works that's correct and all that kind of stuff so I think I think we have a huge,

there'll be a long run for this stuff that we're that we're working on, over the next

20 years, you know, as

general and, you know, limited AIs start to come aboard. They're so data hungry. You gotta get data from all these places into the right format. And so, we're pretty excited about the the space.

Yeah. My particular organization, we're sort of coming up to that same set of challenges where we actually just recently stood up our our first,

business intelligence dashboard using Redash. And so we're starting to populate that, and we're starting to try and plan out,

having more of a sort of cohesive

reporting capability. And so 1 of the things that's on my plate to start thinking about is building up an overall sort of approach to data management. And so

I'm definitely keeping an eye on the horizon for things that are coming along and, you know, keeping an eye out for the available tools and

interesting piece of data engineering is sourcing the data and figuring out, okay, what are all the information

avenues that I have? You know, what can I consume? What can I aggregate together? You know, what are the little hidden pools of data within the organization? Because that can be, you know, invaluable where you have

a picture, but it's missing that 1 piece. And, you know, somebody in a different department

just happens to have an Excel file that has all the critical data that stitches everything together that you need to actually go and interview people and understand where that data exists.

Yeah. And that's like we I we were meeting with an insurance company the other day, and they have a person who

pulls data from some raw sources, probably gets CSVs,

pulls it into Microsoft Access, you know, to do some additional manipulation

to power his Excel report. And I'm just like, wow. Like, you know, that that access data could be useful to other people in the organization but there's no way they're they're gonna get to it or use it and so that's where we think like professional

data engineering

layer inside the company can tear down those,

those ad hoc, you know, processes

like make every Excel spreadsheet a little bit simpler

but it's a long journey. Like that's the thing we we talk to our customers about. Like at first, like anything, like any any new endeavor, like, there's gonna be false starts and,

you're gonna just run into roadblock after roadblock, but

think about the value of having a single pool of fresh, you know, clean

data and how that can impact your company and and the only investment there is you have to hire hire a few developers to to go on that journey and and make it better. So,

I think it's,

I think it's tough thing to do by yourself for sure. So I hope hope you'll have some help, in what you're working on there.

Yeah. Absolutely. And,

from from what you're saying there as far as, unifying the availability of data across departments, it reminds me a lot of sort of the ethos of the DevOps movement of trying to break down silos and

collaborate across all units in the business in order to act as a force multiplier rather than having the each of the different pieces working alone and not necessarily having all the broad open communication channels. So, I definitely see a lot of parallels between the idea of data engineering and the movement that it's, you know, building up to and the DevOps movement and what it has been trying to carry forward over the past few years. Yeah. I could definitely see the the analogies as well. And so as a platform where you're working with a lot of other people's data and helping them to make sense of it, I'm sure that you've seen a lot of sort of curious use cases and scenarios. I'm wondering what are some of the most interesting or unexpected uses of your platform that you're aware of?

Well, I'd I'd say there was 1 potential 1 that didn't happen yet or maybe it will happen. We'll see. And then I I like, I I there was we did a project for an insurance company where we were consuming

NOAA weather data.

And interestingly enough, like, the data comes in as raster files. You know? So essentially, bitmaps of

and we had to turn that into,

actually, like, geospatial data. So I don't know. That 1 was kind of an interesting I, you know, I always think of data as

files with numbers in it and not pictures,

but obviously,

they are they can be pictures. But they it just was it's actually a really convenient compression tool, I guess, for for the weather data to be pictures, which I guess makes sense. But I don't know. I've just been living in this world where, that hasn't been the case,

kinda getting exposed to just different ways,

you know, different ways people are are using compression.

Another another really interesting 1 for me that didn't come to pass, but, you know, almost almost got this 1 was basically, if you think about an ETL platform, so think about, like, Informatica

or,

SnapLogic or any of these kind of tools,

they

are doing work and they they create metadata. And so we actually had a customer who is basically a data,

providence sort of a product,

and wanted us to build ETL essentially from the ETL tools

into a central repository. So it felt really meta that, you know, we would essentially be consuming data from our some some of our older school competitors

as as the data source.

And I was like, I don't even know how this this would look so weird on the website, you know, if I'm like, oh, yeah. Informatica is a data source for

astronomer.

But, you know, luck luckily, I mean, I I didn't even know if I wanted to do that project because it was all it would have been all brand new sources. But, you know, we

we basically do these projects where basically,

hard to get data is is,

sort of what we thrive on. So that that would definitely have been some hard to get data that,

that, you know, if they have an API, we can get the data out of it, but it definitely would have been a weird 1 for us.

And,

as we talk some more and mention the name of your company, I realized that I've forgotten to ask where it actually comes from. Oh, the name? Yeah. So,

originally, it was it was a name, you know, we were we were actually it started as a tool in the Meteor ecosystem. So Meteor is a JavaScript framework for building web applications but the way I I explain it these days is, you know, actually,

astronomy was sort of the first

science that kind of moved us towards,

you know, facts over dogma. So,

you know, we like to think of astronomy as, you know,

you're studying

studying something and you're doing some math and then you try to make some prediction for what's gonna happen in the future. So we actually think that astronomy and data science I mean, it's those are I'd say astronomy is a very early form of data science. So, you know, we just think it's a kind of a cool, cool differentiating name that kinda sort of makes sense for the space.

Yeah. And also in my other podcast, I've spoken to a few astronomers. And the amount of data that they're dealing with on a daily basis is, you know, excuse the pun, but astronomical.

Yeah. It really is. You know? So and and the and the fun thing is, like, those people used to have to do all this stuff

without without machines to do the math so they have the people that their job title was computer and all they did was math problems all night long and,

yeah. So luckily we've got real computers now to do that. Computer machines so but, yeah, I don't know. It's it's been fun. Like we've actually learned a lot more about kind of the history of astronomy and and ironically because the name of our company, we've we've attracted

astronomer,

like amateur astronomers to the company so

it's it's kind of a very weird, virtuous cycle where the more we talk about astronomy, the where people we attracted are into that sort of stuff. So, but, yeah, it's kind of a little hobby for a lot of the people in the company at this point. Are there any topics that we didn't cover that you would like to talk about before we start to close out the show?

No. Not really. I mean, I think I think we we covered a lot of the the important areas.

You know, I I'm I'm curious, you know, to to see where the the tools,

take us.

You know, I'm I'm like I said, I'm very excited for the the potential for what comes next. And, I just think that this the space is

very exciting, and we're we're looking forward to being part of it for a long time. Well, with that, I'll have you add your preferred contact information to the show notes so that anybody who's listening who wants to get in touch or follow what you're up to will be able to do that. Oh, yeah. Okay. And with that, I would like to thank you very much for taking the time out of your day to tell me more about the work you're doing with Astronomer. It's definitely an interesting platform and, 1 that I plan to keep an eye on. And maybe at some point in the not too distant future, I'll be actually taking advantage of it on my own. Yeah. That'd be cool.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Information

Links