Making Data Pipelines Self-Serve For Everyone With Shipyard

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do?

Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more.

Go to data engineering podcast.com/census

today to get a free 14 day trial. Your host is Tobias Macy. And today, I'm interviewing Blake Burch about Shipyard and his mission to create the easiest way for data teams to launch, monitor, and share resilient pipelines with less engineering. So, Blake, can you start by introducing yourself?

Hey, and thanks for having me on the podcast, Tobias. I'm Blake Birch, and I'm the co founder of Shipyard. And I'm primarily focused on product development and marketing.

Previously, I was leading up the data teams for a digital advertising agency, where we were building end to end data workflows and automation for brands like Sephora, OpenTable, and Gap. Happy to be here. And do you remember how you first got involved in the area of data management?

Yeah. So I had a interesting transition over time. I originally was not directly involved in the data field, and I was working directly as a marketing manager,

Actually, building out and managing marketing campaigns. But a big passion of mine has always been automation. I was doing the same things again and again and looking at the data in the same ways and realized how could I potentially take this data from services that we were working with and grab the information and analyze it in the same way every day. So I started doing that by learning SQL. And from that, I figured out, okay. Well, I'm manually implementing things. I should probably have a way that I can directly interface with this service via, like, an API. And so I started teaching myself Python and gradually all that kind of built itself as we were building out solutions for, like, bid management, budget management, ad creation, and whatnot for our marketing clients. I was able to build out the data team at the advertising agency where I was

really focused on trying to figure out how we could get in the right data initially, work with clients to use their proprietary

datasets to better manage their marketing platforms, and ultimately try and figure out how we could drive the most value to data we have on hand. So it's been a journey so far and now I'm at the stage where from some of the things that I learned previously

found some opportunities

that I wanted to take advantage of in the data ecosystem

to help make data teams' lives easier.

And so that brings us to what you're building at Shipyard. I'm wondering if you can give a bit more detail on the product itself and some of the story behind how it came about and what your motivations were for turning this into a business.

Yeah. So Shipyard is a data operations platform that really helps teams be able to create solutions easily without having to worry about infrastructure,

monitor them without worrying about getting all that set up appropriately, and being able to share the solutions that they're building not only with the team they're working with, but kind

of

kind of brought me on the journey to building out Shipyard.

I realized that a lot of companies had this data on hand, but weren't doing anything with it besides just building out dashboards. They weren't able to easily action on it and automate that data, and it was partially because it's just really hard to get things to production. You might not have the right people with the right skills internally. You might not have data teams that have, the level of interest in making sure the infrastructure is set up and maintained in a sustainable fashion. So wanted to make that as easy as possible to get any solution that was being built up and running. I also noticed from working at an agency that templates were very difficult to make and maintain and usually had a high barrier to entry. We were doing a lot of monotonous code work which is taking the same dataset

for 30 different clients and loading it in the same way every single time with made a few slight tweaks and managing those types of templates. There really wasn't a good solution out there to be able to effectively manage that and also to get it into the hands of non technical users. And the final thing that I kinda noticed early on was that there's a lot of tools that are popping up across the data ecosystem, but there's no centralized spot to see

the end to end data process and verify that everything from the initial

extraction, the transformation,

the loading, sending out the vendors,

that all of that is running smoothly. And so about a little over a year ago, I set off with a friend of mine that worked in engineering in the previous company as well and sat down and tried to figure out how can we build a platform that would really be able to help service these

specific problems that we had seen at the agency and help other teams be able to leverage their data

extremely quickly and be able to do it in a way that doesn't require as much technical and DevOps knowledge upfront.

So it's been a journey over the past year, to get where we are today. That's kind of our origin story and what we're trying to develop.

And so there are definitely a number of adjacent products to what you're building, and it seems like they're sort of in the Venn diagram that where a number of them overlap. I'm wondering if you can just talk through some of the

evaluation of the market that you went through as you were determining what the sweet spot was for the shipyard product itself and the

specific problems that you're trying to solve that weren't addressable by some of those adjacent products?

A lot of the adjacent products that we saw were very heavily rooted in workflow as code. A lot of them were frameworks that were specific to a single language and had that barrier to entry where it was only coders

that could, really get involved and actively start building with them. 1 big thing for us was trying to make sure that the system we were building

didn't require

users to have any sort of proprietary configuration that had to be added to their code

and mixed with the business logic. We saw that a lot of tools required that sort of addition, which meant that the code itself got convoluted and you were writing any sort of script to work for the system that you were writing it for. So we do try and make sure that if you can write and run something locally then you can get it up and running on Shipyard without any changes and without any specific

configuration files. And we also noticed from what a lot of these

tools were doing is that they oftentimes required you to

set up and maintain your own infrastructure and there were headaches involved with that where people maybe didn't have the right sort of information

to make sure that things were able to dynamically scale. If you had multiple jobs running in parallel

or if you had some jobs that ran for longer periods of time or more memory intensive,

They would sometimes take down the whole system or managing virtual environments

across

multiple different tasks and ensuring that package dependencies weren't causing conflicts. And these were all things that you had to focus on if you were using those orchestration frameworks. And we're trying to abstract that and make it as easy as possible

to not have to

really

worry about those infrastructure components. And there's a lot of other aspects as we were kind of evaluating things where because we don't have the proprietary configuration, we're actually language agnostic. So you can have a node script that connects to a Python script, connects to a Go script, and each of them can actually share data in between each other so that you can actually build modular workflows without having to upload data to 1 external cloud location, download it back down and adding all that extra, stuff to your code. So when we were looking at a lot of these orchestration tools,

each of them is made because someone thought there was a way to do it better. We wanted to look at all of them and figure out, okay, how can we take the good bits and streamline this process to get things up and running in a workflow

for data teams? As you were describing, the initial problem that you're trying to address as well, some of the other products that come to mind, particularly focused on, like, the advertising and marketing capabilities are things like Zapier, IFTTT,

or some of these low code or no code platforms like Airtable.

And I'm wondering what you saw as some of the

lessons to be learned from those products that you wanted to incorporate into Shipyard to make it addressable by teams and some of the shortcomings that you saw to help you overcome the challenges that teams might be running up against as they're trying to deal with more data intensive workloads?

So it's funny that you mentioned those because in some instances, we have been compared to, like, a Zapier 4 data. And oftentimes, those low code tools are very focused on if this then that type rules. If 1 thing happens in a specific location, update a record in a existing data platform.

But they oftentimes don't handle

large scale datasets and I've used these platforms and they're great for building out a lot of different types of solutions. But oftentimes,

there comes a point where you can't quite build what you were trying to do and it's not

live that handles, like, a middle ground that you needed to accomplish, and that has to live somewhere. At that point, you've got a tool that's running some of these workflows and then a separate process that's acting as, like, an intermediary.

The way that we built Shipyard initially was that you

are able to take any sort of custom code that you have written and get that automated on its own. But we're providing templates

upfront that can get maybe 80 to 90% of the work done. So if you need to download data from, BigQuery

and you need to upload it to s 3 or something like that, we can handle that immediately out of the gate without you having to write code. But maybe you need to do, like, a custom transformation layer in the center that's specific for your business because we built out with the idea of running code on your own and templatizing that code for usability.

You're able to actually host that sort of gap script within the same interface that you're actually building out the end data solution.

Yeah. It's definitely always the biggest problem with any managed platform where, as you said, you get to 80 or 90% of the way to what you want, and then you you know, you can see everything in 1 UI, and then all of a sudden, you now have to go to 2 different places because you had to write your own custom solution or you have to go to 5 different places because you have to wire 2 things together across different systems and sort of the cognitive complexity starts to explode, and your brain starts to leak out of your ears trying to keep track of it all. Yeah. That's definitely the case and something that we heard from a lot of folks when we were initially kind of scoping out the problem. So In terms of the actual main goals for shipyard, I'm wondering if you can just talk through how you think about sort of the primary focus of it and who your target users are

both in terms of the consumers of the platform, but also the kind of maintainers or developer interface for building out these workflows?

So we are primarily targeted towards data engineers, analytics engineers, and data scientists that already know how to write some code, but frankly are in kind of a position where

they may be strapped for time and resources to get things done and ultimately, a lot is falling on them within the company to try and build out solutions as the scope of data is vastly increasing

over the course of the past few years. And so when we think about

reaching those individuals, there's really, like, 3

main goals that we're focused on. The first thing is that we feel like those individuals

shouldn't have to ever think about infrastructure again. You see the rise of, like, cloud databases like BigQuery and Snowflake that make it really easy

to run

hundreds of queries at the same time and scale up effectively where you're no longer having to worry about, am I overloading the database with too many requests? We think the same thing should be done for the actual data orchestration and processing

side of things where if you need to move data from 1 location to another, build out alerts, deploy machine learning models,

that that should just automatically

scale as you need it to,

so that you can focus on just building out the solution and making sure that it solves the problem

that you need to. I've talked to a lot of people that are in those types of roles and a lot of them may not have the exact DevOps skills required for the infrastructure, but a majority aren't really interested in managing it. They just have to because it's part of the job and because there's not a system out there right now that helps them kind of abstract that. A second thing for me that's a big goal for the company is

trying to make sure that you can drive action

off of your data and really drive value with that data at record speeds. For us, that means making sure that our processing power is extremely fast, But it also means that we have to make the UI something that's seamless, easy to understand, easy to set up, and something where, again, you can build 80 to 90% of that solution in a matter of minutes or days rather than taking

weeks to months to test and iterate and get something up and running. And the final thing that's a big goal for us at Shipyard is making sure that when we're showing these end to end data operations life cycles

and showing every process that is actively running that we make it really easy to know when,

where, and how something actually broke. Because

inevitably, things will go wrong, but a lot of the existing systems just dump all that data

directly into a single log file that you have to go through and parse. You might not know what specific process broke that caused all of these downstream issues. And we think that if you can have all of this within a more visual interface, it'll be easier for not only the data people, but business leaders to be able to tell, here's all the steps that our data is going through to have the level of

confidence to say that, oh, it broke in this spot which affects all of these downstream processes.

I can make sure that someone on the data engineering team knows what to look into, ensure that this gets resolved and we can send out emails to customers or clients and let them know what the issues are. So those 3 areas are things that we're really trying to address with Shipyard.

Another interesting element that you touched on there is the exposure

of Shipyard as a platform that the business leadership is going to interact with versus being a tool or a dashboard that, you know, the data engineering or the data analyst team is responsible for, who then has to translate to, you know, the other levels of the business. And I'm wondering if you can just talk through your thoughts on the role of data orchestration within an organization

and how the

formulation of the tool influences the ways that it's being used and the opportunities that it can provide to the business.

I think a lot of people consider

the data orchestration within a business to just be around the data team's initiatives. And I really find that

a lot of data is ultimately used to drive some sort of business decisions.

And that data orchestration should really be the focus of taking that data and helping these different teams, whether it be finance or HR or marketing, make decisions with quality data that has been automated throughout the entire life cycle. It's not just about trying to get data clean and in a good state where other teams can use it. It's about taking it to the next step and making sure that when you have created larger datasets that the entire organization can access and use, that you're able to track how it's being accessed and used, what value it's providing, which teams are actually using it, and to make sure that you know when you're making updates to,

upstream processes that could affect the data, These are all the downstream processes and teams that are gonna be affected as part of that. I think oftentimes,

teams might look at things in a very narrow scope, I think there's a big opportunity to think about how can the data be used beyond just getting it into a good clean spot for the organization. How can you actually take those decisions

and automate them and know how everything is interconnected?

Digging a bit more into the shipyard platform itself,

architectural

aspects of how you've designed it, how it's implemented,

and how your

particular focus on who the end users are has influenced the ways that you think about building the system.

From a high level,

we are fully hosted on AWS infrastructure.

Whenever you are running code directly in the platform, whether that's through code that you directly provide or through

the blueprint templates that we provide, everything is being run-in Docker containers in isolation.

Whenever those Docker containers run, they're actively installing any of the package dependencies

to make sure that there's no conflicts between different tasks that are running. All of the tasks that are actively running are being orchestrated

across different ECS clusters and unique aspect of something that we do is that even though we might have the task running across multiple different servers, you are running a workflow.

All tasks that are involved in that workflow are actually

running on a shared file system

so that you can immediately access data that's being generated or downloaded

and start working on it. And as soon as the workflow itself is completed, we're able to wipe that completely

from our system.

So that's how we're actually structured in terms of the structure and running things and scaling things on our side. From an actual feature development standpoint,

we are shifting our focus to really try and

make things as simple

and as seamless to use as possible. So some good examples of that and how development has changed over time.

Initially, when we built out templates or blueprints, we provided teams with the tools to take any sort of proprietary solution that they might have and turn it into

something that could be reused throughout their organization.

And we found initially that no 1 was using those blueprints initially because people weren't exactly sure where to start or they had very common overlapping ideas that they were trying to accomplish. Like, being able to

download data from BigQuery or upload data to, Google Cloud Storage.

And so we decided we would build out a library of blueprints to integrate with all the common data tools, databases, data storage that was out there. And once we had that built out, we allowed teams to be able to copy those blueprints to their organization because we figured, okay, they're going to want to make changes, customize a few things because we're not gonna be able to cover every use case. But that resulted in areas where teams were running with separate copies

of sort of the same code. And so whenever we had to update things, inevitably, it made it really difficult to roll out those updates across the board. So our next step was figuring out how can we have a unified system for these templates so that everyone is always using the same version and we can make updates and add inputs and change the underlying code across the board. That's just 1 evolution of how we've kind of had to change the scope of what we built. We wanted to have templates initially, but we started making it to where it was easier and easier to use to where eventually we're now at a point where anyone can hop into the form and they have 50 plus no code templates

available and ready to use for them. And we've seen similar things across building out the workflows themselves.

Initially, we didn't have a visual interface for building everything out over time. We found that that was a little bit more difficult to deal with because existing orchestration tools oftentimes only let you update the workflows directly within code. So we started actually building out a drag and drop visual UI where you can add additional tasks to the workflow easily. You can change the conditions that it's looking for. And we're currently trying to work on rolling out an all in 1 interface where you can build out the individual components of the tasks

and connect everything together as part of a workflow all within

1 1 system. Digging into the templating piece itself, I'm wondering if you can talk through how those templates actually manifest in terms of the user experience of them, being able to parameterize them or add any sort of wrapping tweaks that they want, but also in terms of how it's actually managed within the context of your system and how you manage the code and the deployment and rollout of those template versions?

When it comes to templates,

all the templates within the platform

are code based. So you're able to essentially take any sort of command line interface that you have made and translate it into a UI

that a business user or someone else on the data team themselves

are able to work with. We just kind of provide that transformative

layer to be able to link everything up. So within the platform itself, you're able to

actually provide inputs that you want the user to fill out and those inputs can be something where it's alphanumeric strings. It could be passwords that they need to provide. It could be drop down of select options. In each of those different inputs that you would have a user fill out, they get air handled in the UI so you can make sure that the right data is being provided without having to add all that to your code itself.

And whenever they provide,

an input,

that can be passed directly to your script via

a argument

or via environment variable

and then use directly in your code for however you need. So that's the initial process of actually setting up and managing a template.

And we built out the

the tool to be able to

look at how those templates are being used across the organization.

So you can see every single place that someone is downloading data from BigQuery and verify that all of them are

running the right way. And because everything is centralized around that template, if you need to make an update to that template, whether it's adding a input or whether it's updating the code, you're able to

make that in 1 spot and have the changes

proliferate to every

single task that is built

with that template which makes it really easy to have a strong, like, management philosophy

and a strong ability to troubleshoot

where things might be going

wrong down the road. And besides the fact that you can

update those templates in bulk, you can also connect

any code in the platform directly to GitHub. So you can have it pull in the most recent code from a specific repository and a specific branch or tag. So Shipyard starts becoming part of your, like, CICD flow where any new updates you make to GitHub immediately,

show up in Shipyard, which immediately shows up across all the tasks that you've built. So it's a really easy way to make sure that you have some sort of visibility

into how things are being reused throughout the organization and ensuring that nothing's going wrong. And when we built all of our no code templates that everyone starts out with, we're actually building things in the same exact interface that everyone else can and we open source all the code. So for us, it's important that you have trust that whatever you're using on Shipyard that you can see what's under the hood and how it's being operated. But you also know that we're building things out the same exact way that you would and our big focus is to make sure that templates are doing 1 specific thing very well. So it's usually just something like downloading,

uploading,

or running a query against a specific service and nothing else like that because the workflows themselves

allow you to mix and match and pass data between

each of those in different tasks that are each doing 1 very specific thing which makes it a lot easier to troubleshoot

in case something's going well. As far as the actual data interchange piece, I'm wondering if you can talk through how that functions

on the platform layer and some of the considerations that people need to be aware of as they're dealing with passing data across those different task boundaries. You know, is there any particular

serialization or interchange format that needs to be thought about and, you know, how you manage anything in terms of as you start to scale the volume of data that you're starting to pass between those different tasks?

Yeah. So there's not too many considerations

that actually have come into play because we're handling a lot of the scaling

of infrastructure on our side. So in terms of working with large data files, that's something that we see a lot of our customers doing right now without any sort of issues. But the biggest thing to think about it is if you were developing

scripts locally and then trying to run them 1 after another, if your first script was generating some sort of file on your

local system and then you went to run the second script, it would be able to see that file that had been generated.

We operate in exactly that same sort of mindset, but the shared file storage

is just temporary. So while the workflow is running,

all of the tasks that are part of that workflow have access to the data. As soon as it's done or if it errs out somewhere, the data is immediately

wiped. And we also make it so that whenever you're running a workflow, only tasks within that workflow

have access to that data. So from a security perspective, you could run the same work flow twice in parallel and each of those workflows would not be able

to see or access the other's data. And that makes sure that everything is truly running in isolation. You're not gonna have any conflicts with

names

of files or anything else like that. But you do have to watch out just like you would on a local file system

to make sure that you're not having 2 scripts writing,

to the same file at the same time causing issues or maybe

overriding 1 file or another. You do have to kind of be aware

of the file structure system that you're creating throughout the workflow.

In terms of the actual

operational challenges that you're working through, because of the fact that you're offering this platform that is intended to scale and offer multi tenancy to your customers and something that you're offering as a service. I'm curious what you have run up against as far as some of the complexities or difficulties of building out this business and being able to actually

realize these scaling promises that you're making for your customers and manage the SLAs,

especially as you're dealing with, you know, running on AWS and their SLAs and being able to be tolerant in that context?

It's definitely an interesting challenge just because we're in a situation where we don't know what the performance of something is going to be until it's run once. But at the point that it has run multiple times, we have a lot more data on it that allows us to be able to orchestrate it and place it into the right place according to how much

memory, CPU,

data storage, and all that that it might be using. But initially, we do have limits in place, to try and make sure that

there's no resource hogs where you could have a rogue script that takes down the whole system like you would have locally. Those limitations that we have though, they're not super restrictive. I will say that 1 of my big frustrations initially was having to work with, like, function as a service systems that are available via cloud services where you can only run 15 minute scripts and you might not be able to get more than a gig of memory available. We know that when you're working with large quantities of data, sometimes you're going to need to have those longer run times. You're gonna need to have more resources available. And so we are flexible about that. But the way that the platform has been structured, we're primarily scaling horizontally. So we have a lot of processing power available

for teams whenever they are initially running jobs on the platform. And the longer things run, the better we're able to orchestrate it. But if there's an influx of new jobs,

it might just be a slight delay in terms of how long something is taking before it gets scheduled while we're spinning up new servers immediately orchestrating it across those and spinning down the servers on our side. And 1 of the unique challenges was that idea of having that shared local storage

across a workflow. That's really difficult to accomplish

when you might have tasks that are part of a workflow that are running across multiple different servers. And it's something that we had to overcome and

now when it comes to, like, multi tenancy

and not only making sure that we're able to scale effectively for clients, but that we're able to have really secure environments

for our clients. We have a very heavily tested API with security checks in place where we have your typical

create, read, update, and delete permissions for almost every single level. So you can guarantee

with the types of explicit permissions that are in place that nobody will be able to access things that they shouldn't be able to access.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more.

Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at data engineering podcastdot

com/rudder

today.

Another interesting aspect worth digging into is you were discussing being able to optimize

the resource allocation for workloads

after they've been run multiple times. And I'm curious what your story is as far as being able to manage the

observability

and learning aspects of your system for being able to intelligently schedule those workloads to be able to optimize your resource usage and reduce costs and how that manifests in terms of the pricing that you're able to pass along to your users?

It's definitely a work in progress and something that we're continually trying to evolve.

For right now, we're mostly making sure that we have the level of information available to ourselves and that we're able to ensure that the back end systems are running effectively and not running into any sort of resource limitations. We don't actually limit

the server size or anything else that jobs might be running on right now because we know a job that could be relatively small right now could eventually be big. That's just something that's always, going to be the case but we do have to try and make sure on our end. We're trying to sell

the fact that

if you're running with any sort of orchestration system that's open source and you're running it on your own infrastructure, you're gonna be running that infrastructure 247 and you're probably only gonna be using it about 20% of the time and you're gonna be running into some of those limitations.

For us, it's important that it's easy to

have a kind of pay as you go model and pay for what you use to where you're more focused on the job is running and running effectively and that we're handling all of the

dynamic scaling and the math concurrency

without you really ever having to fret about something going down. And, hopefully, that works out favorably for both of us. We're we're able to help you not think about, any of those intricacies

and, you're not having to run and manage resources,

all the time. Another interesting aspect of the platform that you're building is because you're giving

users

the freedom to be able to write their own scripts and their own processes.

There are a couple of aspects to that, 1 being dependency management for their scripts. So if you're writing a Python job and you need to be able to pull in pandas and NumPy or what have you, how you manage the installation

and versioning of those requirements. And then also because you're running arbitrary code, how you're

approaching the sandboxing aspect of it to prevent people trying to break out of the environment that they're able to execute in and just the overall security implications

of running, you know, untested arbitrary code that your users are providing?

So when users run code directly

contains

the language that you've selected to make sure that that's pre installed on a version of Linux. But at runtime,

we're actually installing any dependencies that you provide up front. So there's 2 ways to provide those, sort of package dependencies.

You can either provide them directly within the UI or if your code

has

a requirements of text, package dot JSON or any of those files, we'll automatically find it and install those packages at runtime. So that does actually mean that when your script is running, there is a little bit of additional time

involved with making sure that things are set up correctly. But it does make sure that the package versions you're using

are entirely accurate every single time. You're not having to worry about which virtual environment should this task specifically

be run-in. You're not having to worry about making sure that virtual environments near each other across all the different servers that you might be running your tasks

on. Frankly, you're not having to write out Docker files for every single task that you want to run. We're kind of building that for you. We found that a lot of people working with data, while they might be familiar with Docker, there were very few that were actively engaged in writing Dockerfile scripts and actually hosting and running those themselves. And the way that we've built it, because we're installing it at runtime,

also means that we should be able to support in the future running different versions of languages as well as different versions of packages

simultaneously.

In terms of

the use cases

that you've seen people applying Shipyard for and the workflows that they've built, I'm wondering if you can just talk through some of the most interesting or innovative or unexpected ways that it's been used.

There's 2 that come to mind. The first 1 actually resulted in us kind of building out a new feature where there was a team that whenever they had to process

their data, they found that it was much faster to be able

to run multiple processes in parallel and split out the work across all of them rather than having 1 large process

that was actually transforming

the data. And so they were trying to find a way that they could kind of dynamically

end up splitting out this job to where if a 1,000,000 rows came in, they wouldn't have to hard code, like, hey. This particular script needs to do the first 100,000. The next 1 needs to do the next a 100,000 and so on and so forth. They wanted to kind of be dynamic. And so we actually work with them to build out

a system where

the tasks that you run are contextually aware

of how they're actively running within the shipyard platform. So you can know, hey, there are 10 other vessels with the same name running at the same time and I'm number 2 in line so that that can actually be passed to the code itself and depending on how much data comes through, it's able to automatically split itself across all those tasks and run them in parallel and then spit out the data for downstream tasks in the workflow. So that was a really unique case for us to help someone build out and make sure that they were able to have that sort of parallel processing that happened dynamically based on any of the data that came in. Another really big workflow that we've seen, that was kind of interesting

was

the idea of a data engineering team that was

trying to reduce the amount of tickets that they were getting internally

for, hey, my data didn't look right, can you help me? Initially, when we kind of built out the blueprints, we expected this to be pretty heavily used within the data teams themselves but not externally.

We saw this team actually build out a

system where their data loading processes were actually something that they gave internal team members access to where they could add an input and say, like, hey. I need you to load data for this specific client. I need it to be between these date ranges, and I need it to be for this specific data source, and they could click run now on their own. So instead of having to

address those constant streams of

support tickets, they were actually able to help serve themselves with reloading that data. And I definitely felt like that was an interesting use case and kind of opened up a lot of what we're trying to shift towards to figure out how you can potentially get some of these solutions in the hands of not only data team members, but other business teams, throughout the organization.

And in your own experience

of building out the shipyard platform and helping to understand and drive the product direction forward and work with your customers, I'm wondering what are the most interesting

challenging lessons that you've learned in the process.

1 of the biggest ones for me is that

data orchestration is not something that a lot of organizations are actively seeking or even know that they need to be seeking out. A lot of times, they're very focused on

finding a specific

solution for

integrating or working with 2 different vendors that they might be using. And those 2 different vendors are different for every organization.

Because data orchestration itself is not something that's commonly sought after, you find a lot of people that are

still fully dependent on running individually scheduled cron jobs

that are running all of these internal processes and scripts that people have written. You have some teams that are still relying on someone to have an open laptop that's running a script that they can't touch because it might run out of resources,

because it's just not a common tool like CRM is or something like that. A business doesn't quite know

that they need data orchestration tool. But once they find a tool that's able to help them create that solution, they're able to kinda find more and more use cases for it. And I think that's been a really big challenge from a marketing perspective

trying to help teams

that aren't necessarily

aware of that sort of data orchestration tooling,

figure out how else they can use it and what other problems it can help them solve internally beyond just the 1 integration of 2 vendors that they were trying to build out. I think a secondary thing for me

was

not knowing exactly how deeply rooted open source technology

was within data ecosystem initially. I'm very much someone that I look at things in terms of what's the cost of my time to build it and if there's another service that's less than that time, I'll just go and buy it. But there's a lot of folks in the data community that are really passionate about implementing and using open source technology which I think is a fantastic thing but they don't always think about the time investment and the money investment of managing and operating the infrastructure itself. And so sometimes it's more of an educational piece that we have to go through to help folks understand that if you go with an open source technology,

these are the uphill battles that you might face when it comes to infrastructure support and all those good things that you wouldn't have to face with a more proprietary solution. So I'd say between those 2 things, those have just been kind of interesting revelations that we've seen across the way. They ultimately just result in needing to have better education in the space because being in the data space, you kind of take some of the pieces of knowledge that you might have for granted and you don't always realize that not everyone's at the same stage and not everyone is thinking about things in the same way and that we're really in the early stages of figuring out how exactly do you build

these end to end data pipelines. And we have to make sure that everyone is brought up to speed with, the right information there. Yeah. It's definitely always amazing to see how far people will go with incremental changes to

address the immediate pain without realizing how much sort of technical debt they're accruing or how much additional pain they're about to run into because they've got this house of cards that's been piling up behind them just because they don't know that there is another option. And so you hand it to them, and they say, oh my god. This is amazing. How did I ever live without this? I'm going to go and use this for everything now. Yeah.

In terms of people who

have realized that data orchestration is an option and that they do have the opportunity for applying it to their business or within their workflows of processing data and analyzing data. What are the cases where Shipyard is the wrong choice and they might be better suited using 1 of these code oriented open source frameworks or, you know, a different managed platform that's perhaps more sort of broad based. What are the existing shortcomings of Shipyard that might lead someone to choose a different solution for their data orchestration needs?

I would say there's 3 areas where Shipyard would be the wrong choice. First 1 is if you have any sort of requirements for your business where everything

needs to be maintained and run-in your own infrastructure from security standpoint,

Because we are a managed platform, we're not able to work on someone else's existing infrastructure. And frankly, we wouldn't be able to have the level of scalability

and the don't think about infrastructure

component if we were able to run on your own infrastructure.

Another reason

why Shipyard would be the wrong choice is if you actively want a workflow as code

framework. I know that there are some individuals that want to make sure that everything they have is written out in code and that's something that we're not planning on ever really providing. We do really want to focus on the aspect of not having to have that proprietary logic

embedded into your is this code that you could take a script, run it locally, run it on Shipyard, run it on any other system.

And so if you're looking for a framework, we're not able to help solve that right now. And

finally, if you don't have someone that actively

has working knowledge of how to

write scripts to create solutions,

then Shipyard isn't going to be the right fit right now. We're trying to figure out how to get into a spot where if you're armed with SQL, you can create any sort of solution that you might want to. But we're currently not at that state. And like I mentioned earlier, there's

that 10%

gap of things that you may not be able to actually build unless you have some sort of scripting knowledge and that's really gonna be necessary

for taking full advantage of what ShapeBeard has to offer.

1 of the other things that we didn't dig into too much but you alluded to a little bit was the kind of versioning and CICD aspect of implementing your own custom scripts or blueprints, and then also

what your story is for being able to integrate things like data quality checks or somebody might want to use something like Great Expectations

or tools and platforms such as Monte Carlo or SOTA Data and what the opportunities are there.

So for us, we integrate really well with tools like Great Expectations.

We're actually, we wrote a blog post about this at 1 point, how the ideal state for a lot of data orchestration

is to have a test after every single step and ensure that data is not changing and that it's meeting your expectations

throughout the entire process. But for a lot of teams, that's not necessarily

sustainable.

The teams might not have enough manpower or enough time to be able to have all those tests in place. But within the Shipyard platform, you are able to add in those additional testing tasks and use the output of those tests

for actually

creating workflow logics. So you could have it to where if there's any sort of failed test, have the step error out, and then none of the rest of the processes end up running

so that you're effectively able to address the root cause of

why your test wasn't able

to pass.

In terms of integrating with external beta testing platforms, that's something that you can still do right now, but you wouldn't be able to actually run those direct tests directly within the platform. But I think the way that we're running testing and data quality right now mirrors the experience that a lot of data engineers are finding where the testing is an extra step. And I think there's a big opportunity in the market and for us to

figure out how the testing itself can be integrated as part of the task run to where whenever you run this step, it is running the testing at the same time. And that resulting output can actually be used for determining if a workflow should continue on or not. And I think there's also a big opportunity when it comes to testing

to not only test the data itself and make sure that it looks right, but to look at the metadata

of your data and of your execution process and verify, are the run times looking consistent?

Is it always starting

at the right amount of time? Is it using generally the same amount of memory or storage space? And to look at those sort of details to ultimately

try and determine if there could potentially be something wrong in your pipeline. I think when it comes to testing right now, it's an extra step. And the goal is to try and get it into something that's a little bit more

integrated and something that can be a little bit more context aware and provide some of those proactive

insights that something might be wrong with your data or with your process as a whole. As you continue to build out the platform and the business of Shipyard, what are some of the things that you have planned for the near to medium term or problems that you're particularly excited to try and tackle?

So 1 of them is definitely the aspect of having that integrated testing and having

smarter logs and alerting for anything that could change within the execution environment. We think as we're kind of building out a system that is integrating all of these different data tools that,

there's a large opportunity to verify and monitor

how the execution of all those tools is running and make sure that that's

consistent.

We're also really excited about scoping out data snapshots

for

any other workflows that you run. Currently,

we do delete data as soon as something has finished running. But we think there's a big opportunity for being able to kinda keep it in a holding pattern if something goes wrong so you don't have to start things back from the beginning. If something does go wrong, you're able to fix the issue and have it start over from midway

within a workflow. And then other things for us are just

increasing the simplicity

and speed of getting things up and running. So from a developer experience side of things, it means having some sort of command line interface or an API

where the teams can actually build out these workflows and tasks in bulk rather than having to go into the UI themselves.

And within the UI, we're trying to make sure that

we have a all in 1 interface where you can build out the tasks and the workflow all in 1 spot rather than having to kind of build out the framework in your head first, build out the task, then connect the task together as part of a workflow. So those are all areas that we're excited about implementing and scoping further over the course of the next year. Are there any other aspects of the work that you're doing at Shipyard or the ways that it's being used or just the overall space of data orchestration and its role within the business that we didn't discuss yet that you'd like to cover before we close out the show? I think, for me,

1 thing that I'm interested to see what happens over the course of the next

5 years is what happens with all of these different types of data tools that are popping up to ultimately

impact

a specific sliver

of the data operations life cycle. I think there's a big opportunity for

some sort of tool to ultimately integrate all these tools together and provide that level of visibility

of the end to end workflow. And that's something that I'm super excited about, and that's ultimately why I'm here building Shipyard, trying to make it easy for someone to deploy those solutions and connect things together without worrying about infrastructure.

But I think there's gonna be a real

real hurt in the future as more and more of these tools come out where we find a lot of fragmentation and segmentation

that ultimately needs to be kind of mended

back together in order to make sure that the process that you have that's touching your data is running smoothly.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Beyond the integration of multiple tools, I think being able to test data in an effective way with

data environment. A lot of times, teams are building off of trust

rather than based off of knowledge that things look exactly like they should look. But it is a very arduous task to test for every single piece of data and for every single step as part of the process.

I think the more that we can really innovate in the space of

trying to

look at all of the metadata available to us, the more that it will be easy

to potentially provide some sort of anomaly detection or proactive alerting

for the potential of something being wrong in the data rather than requiring teams to build out all those tests upfront. Because there's nothing worse than having bad data that you're ultimately trying to build up trust within the organization again. And the more that we can have better tooling that helps make sure those flaws aren't making their way to production and making their way into the hands of other people in the organization, the better off data teams are going to be and the more value they're actually gonna be able to provide for their organization.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Shipyard and your background and context on how the overall data orchestration

problem space can be more beneficial and more accessible to the business. So appreciate all the time and energy you've put into building Shipyard and helping to drive the overall ecosystem

forward. So thank you again for your time and effort on that, and I hope you enjoy the rest of your day. Thanks. You too. Appreciate it.

Listening. Don't forget to check out our other show, podcast.init

at pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of dataengineeringpodcast.com

to subscribe

to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts atdataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links