Scale Your Spatial Analysis By Building It In SQL With Syntax Extensions

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth $3,000

on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm interviewing Matt Forrest about doing spatial analysis in SQL. So, Matt, can you start by introducing yourself?

Yeah. Hey. My name is, Matt Forrest. I head up the solutions engineering team at Carto. We're a geospatial platform that focuses on spatial analysis, and a lot of our work overlaps with spatial SQL. So that's a big part of our day to day on a regular basis, which is a a great topic for us to to chat about.

And do you remember how you first got involved in the area of data management? Yeah. I mean, it's I think anyone in geospatial has to work with large or complex datasets. Right? You know, geospatial data comes in lots of different formats. There's hundreds of different individual file types that you have to work with. So by nature, I think anyone who works in GIS or or geospatial

has some familiarity with with data management. That was a lot of tools from, like, a kind of a data engineering perspective. But what I found is the most efficient way is to bring all those data sources instead of managing them in, you know, different files or locations,

in into a a database. And and the best databases for geospatial,

happen to be, you know, using spatial SQL. And and I say spatial SQL, that that can mean many different things. That could mean, you know, a data warehouse,

you know, something like a BigQuery or a Snowflake. And and commonly, that also means, you know, post GIS, which is the geospatial extension of postgreSQL.

So that's that's how I got started,

working with it. I think a lot of people in JS don't necessarily start with SQL first. You know, they're they're working with maybe even desktop tools or or, you know, notebooks that use libraries like GeoPandas, but still working with the core files. And then eventually, as things get complex or you work across an organization or team, you know, there there's lots more management layers that you have to put on top of it, and spatial SQL provides a really great way to to actually do that. So, you know, from a spatial perspective, you get all the benefits of everything that happens in the SQL world, plus some really efficient processes to do the analysis you were already doing, you know, on a on a local or desktop level. So 1 of my kind of most involved

forays into doing work with

spatial data was a combination of some Python scripting,

QJIS,

and generating some KML files to be able to generate a PDF of a heat map of network links between

different,

IoT devices

on street hardware in San Francisco. And that was a very convoluted and painful process. So I I definitely kind of I'm sorry that I missed out on the kind of revolution of tools that are available over the past decade. Well, that's, that sounds very familiar and,

that welcome to our world. So

Yeah. And so I guess before we get too much into the spatial SQL side of things, 1 of the things you said was that you're working at Carta where you do a lot of work with spatial analysis. And I'm wondering if you could just talk to some of the types of use cases that that is applied to and some of the kinds of analysis that you might be doing on these spatial datasets.

A lot of geospatial work tends to go back to the the interaction of spatial data. So you're you're looking at a few different components, but it's the spatial relationships

between

different features, like which things might touch or border another feature, what's the influence of those features, what are the influence of things within a certain distance.

So you start to see and layer these spatial relationships onto each other. You know, how how much does 1 thing overlap with another,

so on and so forth. Now how you use that and study that, you apply different statistical methods to that. Of course, there's also a time series component to that as well, how things change over time within the spatial relationships. Right? You know, do do things move or change or or progress as as you look at that from a spatial temporal perspective. That's the pure

spatial data science, right, is using spatial relationships

along with statistical methods to understand, you know, phenomena in in the

in the real world. Right. I would say on top of that, you know, we also use, you know, spatial features and machine learning workflow. So for more of a general data science community, then we would call that or you call that feature engineering, we call that spatial feature engineering. So it's taking those same methods to say, okay. For me, you know, if you're studying housing prices, right, what's the median income of the census tract that this house falls within? Or better yet, what's the median income of the census tract plus the surrounding census tracts that that, you know, house falls within. And that's a very simple example, but you can extrapolate that out over lots and lots of features, and then you start to whittle those down to the things that actually have influence over a, you know, housing price. So that spatial feature engineering, I think, is is 1 area that we

we, I think, can be used by other data scientists that maybe aren't using geospatial on a day to day, but, you know, 1 that's really interesting as well.

Now how we do that,

at CARTO, we we do a lot of workflows in in Python. Certainly, a lot these libraries are surfaced that way. But more and more, what we're seeing is that we can actually bring these workflows into some of the data warehouses

or databases through user defined functions. So actually bringing those in and, you know, either creating Python scripts on the database or even just using core SQL to add these statistical methods and then scale that up on on quite a larger level just because of the computing power that data warehouses provide you. You can store a ton of data. You don't have to move the data around. You can do it all on that level, and you can use common language to do that, which is a SQL in this case. So And so in terms of the spatial elements of the SQL dialects, you mentioned

post GIS, which is probably the most well known implementation of that. I know that there are other, sort of plugins for things like MySQL and other database engines. And I'm curious if you can just talk to some of the types of functionality

that are needed to be able to support these spatial analysis in SQL dialects and if there has been any

kind of emergent standard that ties together these different database engine implementations?

Most people are starting to follow the lead of of PostGIS,

and I I would say that across all the different databases or, you know, the different implementations. So I'll start with from a, again, non data warehouse level,

because most of the data warehouses have followed, like, the sort of function structure of of PostGIS

for the most part. Mhmm. There's other spatial databases out there. There's, like you said, implementations for MySQL.

There's spatial light, which is a, you know, implementation for SQL light and, you know, post GIS being sort of the most well known and I think fully featured. So there's, you know, a a ton of development that's gone into that that project.

Most spatial functions, well above 90% are gonna start with with s t, which is, you know, shorthand for spatial type. Right? And so it's s t underscore, and then you can see the different functions. So if you ever see a function that looks like that, it's it's probably dealing with geospatial data. And there's lots of functions. Right? You can, you know, manipulate geospatial data. You can turn text into a geometry. There's

simplification of geometries and measurements, and what's the perimeter of a polygon creating geometries. You know? So there's a wealth of stuff that can be, you know, really,

useful, and some of those utilities that help you do and transform these things. But there's also, you know, like I said, if you go to that spatial feature engineering or spatial relationships, things like intersections or overlaps

or what falls within another polygon, amount measuring of distances, you know, finding the count of things that fall within. And then there's even some window functions like, you know, creating clusters as a implementation for a spatial DB scan that is, you know, quite is is used, I think, post GIS and also big creative implementations for that. So, you're you're seeing all of those different types. So I I I would say for the most part, you know, this that standard of of what post GIS is set forward of, you know, what is ST intersects,

carries across,

you know, most of the different databases

or data warehouses. So you can you can find that, like, that function, for example, is is pretty well used, and you can find it across BigQuery and Redshift and Snowflake, and, we're working on, implementation for Databricks at Cardo. There's a project called GeoMesa that implements that in in, like, a Spark ecosystem as well. So there's lots of you know, usually, if you see 1, it's it's probably similar with slight nuances

potentially across all those different, you know, implementations. So And as far as the actual implementations themselves,

given that they might share a function

signature, I'm curious what you have found to be the level of consistency in

implementation and accuracy and some of the ways that you might validate that or some of the

edge cases and gotchas that you've experienced where maybe there's the same function name between post GIS

and BigQuery, but there's maybe a difference in terms of the default level of accuracy as far as, you know, how many places to go out when you're trying to do, you know, do some aggregate level roll up or anything like that. And and then to your point of creating these user defined functions also being able to hue to the same function signature and

underlying implementation.

Mhmm. Yes. Some some of them deal with data, and some of them will deal with the function implementation.

So from a data side, if we think of 2

state borders.

Right? And you have a a shape file or a GeoJSON, you put it in your spatial database. You would assume

looking at at them on the map, but they, you know, the border follows the exact same, you know, a point on 1 side of the border is the same on the other. And even a slight variance of that, you know, if if the point slightly overlaps and you're using different functions, could, you know, throw off the different functions. There's some really great resources that explain the difference between ST intersects versus within versus overlaps. There's a lot of these functions that have just slight variance between them, you know, within the individual implementations. And then with between,

you know, the different database types, there's there's different tolerances between these things. Right? So,

I I don't know the exact numbers,

the tolerance thresholds

for BigQuery versus PostGIS versus Snowflake, but it should be on in all the documentations. But for example, s t intersects, there's a slight tolerance of overlap. Right? You know, very very

small that will say if even if it just slightly overlaps or was within this this distance,

it will it will consider it true as an as an intersection.

So it will catch some of those edge cases. So you might run ST intersects for

a specific state, and you might get the bordering states because maybe they just are following within that that slight variance, and that might look different between the different implementations. So I I always encourage everyone to get very good at reading the docs. All these these services are very well documented, and we'll we'll go into detail on that. And then understand the differences between function types, and some of them are not always, you know, you you have to read them and understand what you're using and how to kind of use them in bulk and these different things. So there's lots of ways to use them, use the docs, and and kind of,

use them carefully. There's some great resources out there. I'll I'll send some. I definitely recommend,

if you wanna get very deep into this, Paul Ramsey, who's a major contributor to Post JS, he has a great blog that talks about it as the releases for Post JS come out, know, some of the different things they've done to improve the speed of of s t intersects. You know, there was a a previous a workflow that they used to kind of subdivide geometries and, you know, make the intersections faster than regroup them, and they've actually implemented that in the core function. So, you know, you can kind of follow the development of these tools as they go along and even at a function level, what you can do. So, that's that those are some there's lots of great logs and resources out to to actually check this stuff out. So And going back to your point of

outside of the database environment when you're working with geospatial data, you might have

a 100 or more different file formats and data types to represent different utilities or sometimes it's the same information, but for different use cases where perhaps you're working with a raster file versus a, vector file. And so I'm wondering if you can just talk to some of the work that's involved

in understanding

what is a useful

common representation

for all of these source information,

for all these information sources and information types and being able to translate that into a useful representation to store in the database so that you can scale these analyses

across multiple different datasets and just some of the data engineering and transformation work that goes into having this kind of common representation

of the information that you're trying to analyze?

Yeah. You know, I think I I I've seen a a sort of a I wouldn't say a boom, but a a growing trend

in more and more

people starting to either call themselves or or post jobs,

for geospatial data engineers. I think it's become a unique challenge. Like I said, there's there's hundreds of different

data

types. You know?

You know, there's there's common ones that

you'll have, you know, shape files,

you'll have GeoJSON,

of course, CSVs,

you know, getting a little bit deeper. You you mentioned KMLs and KMZs, but there's there's so many different types. Right? And and even weird ones. There's ones I've never heard of or used. There's there's the Norwegian

SOSI standard vector dataset, you know, and, you know, there's Oracle spatial data types.

You know, there's there's so many different well, you could go on and off. Right? So I'll spare everyone the the list here. But, GDAL,

the geographic data description library,

is is the best the best resource. I would I would recommend everyone sort of check that out and use it, and it's an open source toolkit. You can install it locally on your machine. There's Python drivers as well as many other drivers to to actually use that, and it's it's built to transform

spatial data types, you know, either raster to raster, vector to vector, or, you can actually go from raster to vector, vector to raster. There's tools for creating and manipulating geometries and handling projections, which is a whole other component of this. So, you know, just the challenge of taking all that data and getting in into the right formats or projections or, you know, common cases there is is a difficult 1. And and how do you scale that too? Right? When you're talking about planetary scale imagery, right, you're talking about, you know, pixels that could represent 3 meters or 1 meter on the globe, you know, and you're talking about it at a massive scale that that's 100 of millions or billions or even trillions of individual sort of records. Right? So there's there's a whole cloud architecture behind this as well in terms of bringing that data into a database. Right? So there there's a huge effort to do that, but PostGIS has amazing tools to do this for both raster and vector data. They have gDial drivers, that you can leverage through some of their command line tools in these different places. From a cloud perspective,

you know, there's some really interesting stuff we've been doing with Google Cloud and, their data flow product, which has Apache Beam installed. There's a a library that we've worked with called the Geo Beam developed by Travis Webb at Google, and that actually brings some of that gdal functionality as a file, you know, a a file based source for geospatial data to scale that, you know, in a serverless setting. So once it's there, of course, you've addressed all these problems, then you have some of these common ways to query and intersect and interact with this data to to draw some insights. But there's there's huge challenges just on that front end of of getting the data in the right spot. And then in terms of

the

kind of data modeling that goes into

once you have transformed this into a common representation, whether that is GeoJSON

or KML or whatever it is that you decide is most effective for your use cases, what are some of the considerations that go into how to think about

segmenting these datasets so that you can join across them so that you can build your analyses and just some of the kind of data modeling concepts that are necessary for working with these geographic representations

and how that might differ from purely tabular data sources.

Yeah. Yeah. 1, so so we've faced this challenge when we built,

our our data observatory at Carta, which is sort of a a data marketplace of sorts. You know, you you we have

challenges from the scale of data, you know, number of records to the complexity of geometries. You know, there's there's, you know, very detailed or granular geometries.

Those geometries change. You know, the Census Bureau might update the Census block polygons every year or so. So, you know, there there's all of that changing data. There's a few things I would recommend.

Number 1 is segmenting your geometries versus the tabular data associated with them and being able to perform

effective joins. I I saw a post a while ago on on how Airbnb sort of segments their data, and, you know, they expect a lot of their their team to be able to join or aggregate. That's sort of the same process we have in our our data observatory. So our geometries are separate from our the tabular data associated with them, and we join them on a common geo ID, whether that be a, you know, a unique identifier,

maybe something like a block group or census tract ID. And, then we have a table schema to sort of how we name our tables, right, effectively. You know, you can say here's the country it falls within. Here's the geometry it's associated with. Here's the, you know,

time, you know, span of that data. And then how do you kind of how can you create those those joins effectively and and then, you know, pull that data set with the geometry plus the the tabular data associated with that. The second reason for that is if you, you know, just a storage perspective. If you have a really complex geometry, those those geometry,

you know, records or rows can get quite large.

So if you only need to store them once, store them once and then you can, you know, do the aggregation, and then the last step just joined to that geometry file. So that that's like the biggest recommendation I would have. There's certainly steps you can do to simplify geometries, but you, you know, you could lose relationships or different pieces of that. You're sort of sacrificing as you go along. But that that's how I would kind of that's my number 1 recommendation

is being able to, you know, have and and leverage, you know, the power of sequel, which is saying you have those joins and those different pieces, and then you from taking them out of the database, right, if you need to put them into common file types. There's lots of great tools for bringing that data back down to a shape file or a GeoJSON or things like so I I think that that's that's kind of how I I would set up the infrastructure. So So for people who are starting to work with these

geospatial

data formats and they want to be able to start to build up these analyses,

how much background knowledge is necessary

as far as some of the principles that go into being able to

understand the features that exist in these geographic representations,

some of the complexities that exist because of the fact that the earth isn't perfect sphere, being able to understand what the mapping scales are between the different representations,

being able to translate between different units where maybe the geometry

in 1 table is represented

in square meters. In another case, it's square miles and just being able to make sure that you are building your analysis

accurately

and any sort of

common tooling that exists in

these SQL

extensions that allow you to encapsulate more of that knowledge without necessarily having to have a geospatial

expert doing everything? Yeah. There's a there's a few ground rules.

First is understanding the projection of your data. Now projection is the

flat representation

of geographic data from the Earth's surface. Right? So anytime you project data, you're making a a sacrifice in terms of, you know, accuracy and measurement of distance.

You could be you know, it could be the visual representation of the data. Understanding, you know, what that projection is

and some of what it might potentially sacrifice is important. Now for most spatial analysis, you wanna preserve

the relationships, right, the distance, the measurements, and things like that. So that that's important. That's why, you know, that's sort of a a common projection.

Most spatial databases, if they don't have a projection or only have 1, use wgs84,

which is the ESP ESPG code if you look in the SQL is 4326.

So that preserves some of those relationships

that are you know, you wanna be able to measure and interact with. There's basically 2 spatial data types if you think of that from a SQL perspective. So the the first 1 is a geometry, right, which is, when your data lives in a projected coordinate system or the flat representation of the Earth. The other 1 is a geography where your data is represented as spherical coordinates on the Earth. You can even create spherical coordinates in other places, but that's that's the very high level between those those 2 representations. Right? So, you know, in terms of measurement, right, there's there's certain things where you're measuring on the flat Earth versus the curve for sure of the Earth if you were to,

you know, have a geometry versus a geography. In in post g s, you might see slight differences there. So just just those are things to be aware of. Right? The curve of the earth is is gonna, you know, have some impact there. And specifically, if you're doing

really detailed analysis like a line of sight analysis. Right? You know, you wanna understand things like, of course, elevation and and blockages of of a line of sight, but also eventually, you know, the curvature of the earth is gonna play an impact into what you're doing. It's important to kind of, you know, understand the differences there. So those those are some ground rules. Understand your projection.

Understand what, you know, spatial data type, geometry versus geography you're using, and understand, you know, the the use case and how accurate those measurements need to be. For a lot of things, if you're doing that on, like, a local or a city level, the the the differences will be fairly minute. Right? You know, they're they're sort of slight. But if you're going on, like, a national level or inter you know, on a global scale or even a statewide scale, you're gonna see those slight differences. And then just being aware of that when you're when you're using the spatial data is is is pretty important.

And as far as the

details of the projections, as you were saying, if you're working at maybe the town level, you don't have to worry about that too much. I'm curious,

What

are the best practices of saying if you're at this particular level of scale, you can mostly ignore that. And then what are the gradations as you go up from there to saying, okay. Now you absolutely have to make a very informed choice about

which projection you're going to use because of the type of analysis you're trying to do and maybe some of the ways to map between

the format of the aggregation or the statistical methods that you're applying to these geographies or geometries

and the projection that is necessary to be able to support that and and determining sort of what the level of accuracy is across those kind of matrix of choices?

I

think my these are my ground rules.

So they they they're specific to my domain expertise and

say,

let's say the the Greater Dallas area or something like that. You know? But generally, you know, the you're you're gonna notice a a minute difference between things like distances or or things like that. There's a subset of use cases, like I said, that that will be important for even on a very local or small scale, and and I'm I'm talking about distances of of miles. You know, very short distances can can have some impact. When you start to get to maybe a more statewide or national level,

that's when you wanna be conscious of the projection

the analysis you wanna you you or the things you wanna preserve. So if you're visualizing data, you wanna preserve sort of the visual representation

or spatial layout. Right? There's different projections that you can use for those. There's different projections you can use for, you know, spatial relationships to further, you know, distance or or the actual

distance or relationships between those features. At that point, it's you you can start to study that and pick, you know, consciously which ones you you wanna use, especially if you're measuring distances, you know, and understanding that there's you know, if I'm measuring between, like, LA and Dallas, right, you know, there's gonna be a difference if I use a geometry versus a geography. And and there's you know, it's gonna that curvature of the earth is gonna make an impact in terms of how long that is. Those things are important and, you know, you wanna keep that in mind. And, also, if you're looking at things like elevation, you know, if you're getting into, like, 3 d data, you know, maybe there's a z coordinate involved. Like, now you're starting to you know, there there's even more layers you've you start to add in here too. So As far as the

kind of statistical methods, we were discussing some of the expertise that's necessary

from the geospatial

standpoint to be able to manipulate this data. But then from

the statistical analysis

side,

what are some of the backgrounds that are useful to be able to understand

what statistical methods or what kind of mathematical approaches are going to yield the best results for these types of data sets and maybe some of the ways that the,

geographic nature of the information and maybe you're joining it with demographic data to be able to do some type of census analysis based on neighborhoods within a city region or something like that. Just how you're able to kind of span

those different

knowledge domains.

Yeah. I I I posted something on on LinkedIn, I I think, last week kind of looking at,

you know, the geospatial roles that exist in terms of, you know, job function and things like that. And I think they're starting to segment a little bit. You know, if you if you think of it I've seen some posts in terms of how

data science has started to segment between a a pure data scientist

versus a, you know, a a data engineer versus an analyst versus,

you know, maybe a a machine learning ops person and and all these different roles. Right? I would say the same is only starting in geospatial,

and different roles have different, you know, domain expertise

areas. Right? You know, a geospatial analyst can is you know, they need they need to understand,

you know, geospatial data,

spatial relationships, and and how they interact and and, basically, how to do some some basic joins and interaction with data.

So I think that's that's 1 level that, you know, you know, they're not so deep into the statistical methods there, but maybe using some of the basics. For someone in spatial data science, having that statistical background and being able to join that with the spatial domain knowledge is is really important.

I think, you know, there's some different tools out there that make it easier for for lots for other folks who maybe haven't explored that space. PySal is is 1 of the most complete pure spatial data science libraries that I would recommend. It's, the Python spatial analysis library, and they have great analytical tools, you know, for using some some really detailed, you know, methods and and,

analysis types. So I I definitely recommend checking that out for use cases like network analysis or, you know, location allocation, you know, inequity and distribution,

you know, all lots of different core use cases. And then, you know, on the data engineering side, it's it's understanding data types or relationships and geometry creation and things like that. So I would say, you know, depending on where you come into the picture in your background, right, if you have more of that statistical background, that that spatial data science piece is probably where you would land. More of the, you know, analytical background, understanding spatial relationships is that kind of you would map to, like, a GIS analyst, and then be for more traditional, like, data problems or data engineering. Geospatial data engineer has

some of those core functions too. But, you know, they

I I think there's there's an interesting kind of interconnectivity

between that pure data science world and the geospatial data science world, and it's all sort of under underpinned with the understanding and and use of spatial relationships together. So And

as far as the

workflow where you have

a new dataset, you're trying to explore some of the interesting

geographic

or kind of map features of a new area,

and you want to understand what are some of the types of questions that I can answer with this data going from this exploratory approach

to saying, okay. Now I want to scale this and make it repeatable.

I'm wondering what you see as kind of the tipping point between this makes sense to do on my laptop using Pycel or GeoPandas,

and

then going to saying, okay, it makes more sense for me to load this into my database,

do this all in SQL,

maybe execute this as part of a pipeline so that I can do some of this geographic feature engineering that you're referring to earlier and turning it more into this kind of repeatable flow that the data engineer is likely going to be responsible for? Yeah. There's a few

areas of, you know,

local or, you know, even a data science workflow, whether that's cloud or local. And I'll say it like that with, you know, Python libraries and and files that exist in a, you know, file storage system. It works great, you know, for doing some, you know,

1 off or, you know, analysis where you wanna study 1 phenomenon with 1 set of data. Right? Let's say you wanna start to scale that from multiple geographies or or cities or something like that and sort of use some of that on a repeatable basis. That's when I would start to look at a database because you can, you you know, keep that data there. You can benefit from, you know, fresh or updated data that's that's going into that, that database. And and you're simply saying, you know, you might be able to just interchange table name and then run everything again and get a new result. Right? Instead of, you know, you you still could do that with your files, but you're still on the hook for finding the right file and keeping in the right location and having someone update that. And there's there's an interesting quote that, you know, like, geospatial has moved more and more towards the cloud instead of, you know, previously where it was, you know, someone's sits on someone's machine or on a USB drive and people are passing those file names back and forth. You know? So it it I think more and more when you start to say, okay. If I need to run this thing from multiple geographies or multiple times or if I have regularly updating data, that's when I would start to look at that from a a SQL perspective. And and I think also in terms of this, you know, if you're, you know, this if you're doing really large scale spatial joins. Right? If I wanna do a spatial intersection and underneath with points that are, you know, numbering in the the 1,000,000 or billions. Right? Then, you know, a spatial data warehouse makes a lot of sense. And then even post just can scale to to reach some of these, you know, capacities too. You know, so it has it's it's really well engineered where you can do and and use, you know, some of those large scale spatial intersections. But when you need to, you know, reach some of that really, you know, high scalability,

you know, that that's where that really comes in in handy. And you can scale up, you know, to trillions of records or, you know, like that enterprise level data that you might wanna be using that has some location attribute behind it, you know, that's, you know, you may wanna use with your polygons and and your spatial intersections that way. So there's there's no 1 clear lever. I would say if it's something needs to be highly repeatable, if data is streaming in on or changing on a regular basis, I think that's another key tipping point. You know? And then within scale, right, if you're, you know, in terms of complexity of data, in terms of number of datasets, in terms of, you know, the the number of rows,

all those things tend to lend themselves to more of a, you know, a database

first approach, if you think of it that way. And

as far as the

kind of

time dimension of working with these

geographic datasets where maybe you're dealing with satellite imagery, and so you're

loading in new raster files every 24 hours for a particular geographic

location.

I'm wondering how you're able to start to incorporate some of these data quality considerations

into that workflow to be able to say, okay. You know, this new piece of information for this geographical region is within this area of tolerance

for, you know, difference where maybe the kind of geography of the snapshot for a particular

expected time stamp is off by a meter because

of variance in orbital velocities or something like that. And being able to say, okay, this is still within the the range of tolerance for this geographical boundary that I'm trying to represent with this raster image. And then saying, okay, well, no. This other 1, because there's been too much orbital decay, it's no longer close enough to the same geography. It's actually, you know, 50 meters due west. And so now I need to say, I'm not going to accept that in this data import or for this particular analysis. Some of those additional complexities that arise as you're dealing with evolving

representations

of the same geographies?

Yeah. I think

from my perspective, 1 thing that a lot of the imagery providers have gotten a lot better at is is actually handling some of some of that on their own. And that's where I'm seeing a huge rise in that, like, geospatial data engineer

kind of career path, right, is is actually handling those types of problems. So, you know, a lot of the data that makes it to the end consumer has some of those considerations

built into it. Of course, there's data checks that you would wanna implement and and bring in on your own. Right? You know, I I think there's

a few different ways to do that, but I think that that pipeline

sort of driven approach is the 1 that makes a lot of sense. You can actually check geometries as they stream through or, you know, or they batch through if you think of it from that perspective. Even gDAL can can use some of that. So there's actually,

spatial light as a you can you can actually run SQL within your gdau workflow

and do some, you know, intersections or checks that way. So how do you sort of check and understand is what what's the tolerance or, is it is this same feature intersecting with the same feature from day you know, a few days ago or anything like that. So there's there's some quality checks you can build into your data pipelines, which is, you know, from a geospatial perspective, we're we're generally thinking about, you know, bits and pieces of of these things. But, you know, we're also

you know, there's there's whole other worlds we have to understand from either data engineering or database management up until, you know, complex SQL queries. You know, there's there's all these other revolving areas, and and I think that's why specialization is really important here is that, you know, can

can we build geospatial data engineers and have a robust background in in everything around data engineering, but have that that focus area as well to start to look at some of these very specific problems that data engineers have worked on for a long time. Right? To to answer the the the question, it's twofold. I think 1, it's an evolving space, and and, I think well, some of the the providers are doing a really good job at that already, but they're doing so because they're bringing in data engineers and and helping them become geospatial.

But I think also we're starting to see more people in geospatial start to gravitate towards those pieces as well because they have that geospatial expertise.

And the things that are enabling that are the common languages and tools to help communicate between the 2, whether that be SQL or Python or command line workflows that people can share back and forth. Right? These are never they are locked away, you know, in the way that they were in the past. So that makes that interoperability

really, really key as as it starts to scale. But pipelines and, you know, geospatial data engineering, I think are

really important pieces that are gonna evolve,

as as the space grows. So

StreamSets DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud architectures.

Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change.

Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming.

Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure.

Finally, 1 single pane of glass for operating and monitoring all of your data pipelines, the full transparency and control you desire for your data operations.

Get started building pipelines in minutes for free at data engineering podcast.com/streamsets.

The first 10 listeners that subscribe stream sets professional tier will receive 2 months free after their 1st month.

And then going back to

the level of support that exists in some of these different database engines that folks might be working with. I know that post GIS

is kind of the defacto

best practice. If you're working with geographic data in a database, you should use this 1. Obviously, some people are constrained where they can't just swap in postgres

because they decide to add a new data type. But I'm wondering

what you have seen as the kind of availability

of these geographic

analysis capabilities

in the SQL dialect

across different database engines and any evolutions

of

that support or interesting

kind of up and coming projects that exist in the space?

Yeah. It it's

I think we're more and more we're getting parity,

and and more of the data warehouses are starting to push some of those those features, and they're starting to adopt more of a pure

spatial SQL dialect. So, like, GeoMase is a really interesting project. It adds sequel you know, spatial SQL type functionality to to to spark and and, you know, allows you to have that same sort of language. If I want s t, you know, as text or, you know, some of those common functions, you can actually use and implement those,

in a in a spark environment. So there there's some really interesting

I I think it's becoming

commonplace. Like, everyone in geos kind of understood and accepted that. Of course, there's some work that there's, like, an, the open geospatial

community is doing to, you know, sort of create common spatial file types, whether that's for some of the the the spark file types or things like that to actually, you know, from a from a file format perspective that that works happening as well. But I would say, you know, more and more of those features are trying to get that level of parity with, you know, post GIS in terms of some of the functions that are there. In terms of advancements and interesting projects,

you know, at at Carta, what we're doing is actually building what we we're calling our analytics toolboxes, which is a a set of functions that port net new functionality into

databases or data warehouses. So,

if if anyone's familiar with Uber's h 3 spatial indexing method, that's 1 that we're actually using and bringing that in to actually create an h 3, you know, hex index on an individual point or polygon. You can create polyfills.

You can, you know, create h 3 geometries and do spatial aggregations with those. And and, you know, we're adding routing functionality. Of course, PG routing has existed for post g s for for a long time, but we're adding, you know, routing functionality

and then, you know, more advanced pieces

into,

some of the the spatial,

you know, data science functions or or functionality that you would see in Pycelle bringing some of those into a spatial SQL environment, I think, is really interesting. You know, it solves some of the ability to not have to move a ton of data back and forth and, you know, you can do it all at the database level. And even things up to creating tile sets in post j s, you can there's functions called s t as m v t, which is that's a specific tile format for

rendering vector data and tiles, which are really efficient ways to render map data online or in spatial applications. And then we we actually tried to you know, we we ported some of that functionality over to BigQuery. You can create tile sets and store them directly in BigQuery. So there's a lot of these extra use cases that are kind of finding more things and more geospatial workflows that you might be doing in other places and bring them into a database level to make that your sort of center point for,

doing spatial everything from the data engineering to the spatial relationship analysis

to spatial data science workflows

and then even into, you know, the actual visualization engines as well. So And in the actual

storage layer,

I'm curious what you have seen as some of the performance considerations

as to

saying, okay, well, I need to convert this

vector file into

a geo JSON array or something because it'll work better in BigQuery because they don't have native support in the actual

engine code to handle this file type as far as how it maps to their storage format or anything like that. And just some of the kind of low level considerations

that go into

maybe performance engineering around your analytical workflows

to influence the way that you approach the actual,

kind of data engineering and data processing to be able to support these,

these use cases.

Yeah. Geometry data comes in many formats, and there's a lot of, like, geometry assessor functions if you look in the different databases. So there's create there's a lot of tools to create geometries from different data, but that could be well known text, that could be bytes, that could be other representations

of that data. You could have numeric latlongs that you turn into,

into geometries.

So this is a long list of, like, first of all, create like, taking even GeoJSON data and turning it into a geometry in spatial database. On the flip side of that, there's a lot of tools to, you know, translate

those those,

that that same data back out to different formats, Whether, like I said, that could be spinning it out as a GeoJSON,

geometry,

or rendering as a as a tile set type, you know, or or bytes or these different, you know, methods.

I think

understanding that how your data is coming in, first of all, if that's, you know, stored as well known text, that's certainly 1 way or GeoJSON.

Those are probably some of the more common ones if you're having, like, a true geospatial file, you know, maybe like a shape file, you're gonna have some sort of geometry

object or things like that. Understanding if you do need to translate that, like, you know, from a from BigQuery, right, how if you do that natively with their command line tools, the the best recommended approach to turn your geometry into GeoJSON,

which can be accomplished once again through SQL, through when

you're turning into a CSV that you can push up there. You know, we we use that data flow plus geo beam workflow. You don't have to account for that as much. It just sort of handles it handles everything from the geometry

creation, validation.

If it's a raster file, it vectorizes it, reprojects it, does does make sure it's valid as as needs to be and does all those different things. So, like, you know, like I said, there's for for someone who's had, like, these I kinda have my foot in a little bit of everything in terms of geospatial space, data this data engineering piece is very interesting for me because of the just the scalability of and and the massive amounts of data. Plus, you can orchestrate all these things really nicely. So all these problems that you might have had before, they would do it at 1 off place. I can just write them into a nice file for Apache Beam, set it up, deploy the container, and then I'm ready to use that again and again and again. Right? So but I think, you know, generally, you're, you know, understanding that you're gonna get geospatial data in in different formats, GeoJSON,

a shape file or geometry that comes from with the shape file is pretty common or well known text.

How you use that, you might have to validate those geometries. There might be points that are missing. Right? You might need to connect polygon lines. I've seen that happen quite a bit before. And then there's even, you know, things like geo packages that,

you know, might have multiple different geospatial files within them. So understanding the data that you're looking at is is pretty important. Understanding,

you know, specifically if you're in a data warehouse or some of the different methods that you need to get them in for, like, post GIS. There's there's lots of tools

to do that. But find the workflow that works consistently for your data, you know, kind of leverage that. I definitely recommend checking out gDial. There's lots of, you know it's it's sort of the it's the Swiss army knife. It has a little bit of everything and covers pretty much every single file type you might ever wanna work with. And then you can set up some of those, you know, workflows

to to bring that data in and and even, you know, like I said, midstream, use SQL to do some of those transformations with with spatial light and things like that. I'll send over some resources because each each tool has its own

ingestion process.

You know, there's there's BigQuery has command line tools or you can use data flow. There's multiple different tool types for PostGIS. You can even load data by connecting your post just database to q g I s, and and then q g I s has g dial through a, you know, a a desktop GUI that if you wanna use it that way. You know, there there's the copy methods are pretty common, you know, for for copying the c you know, the CSV data and streaming that in. So there's lots of different ways to to do so. But it's always, you know, kind of understanding, okay, I have a geometry in this format. How would I turn into an actual geometry or geography on the database? And there's there's lots of functions to do that. So

And so in your experience of working with geographic data in this SQL environment. I'm curious, what are some of the most interesting or innovative or unexpected ways that you've seen those capabilities employed? Yeah. I I think in terms of innovative workflows,

there were a lot of great talks. I think it was post g s day 2021,

just at the end of last year. People are, you know, loading spatial data into a database that, you know, might have complex formats or coming from a variety of different sources. They're

analyzing or creating, you know, interesting insights from the data all in the database

and then rendering tiles to a web app. You know, all that entire workflow up until the point where the tiles actually show up in a web app is is all in a database. And there's, of course, there's an API layer that can serve the tiles, but all that's happening there all through SQL, which is is pretty cool when you think about it because, you know, even, you know, before you would you know, database could serve as the analytical layer. You might have to pull that out and as as a data file and create tiles from that separately, and then the the data load or ingest some of that transformation might even have to happen prior. So you've kind of compressed 3 pretty complex workflows into 1 1 space or location. Right? And that that's some of the the tools we're we're trying to create from a data warehouse perspective is bring that same level of functionality,

into a data warehouse and and use all the great things that exist in data warehouses with, you know, some of their machine learning workflows or some of their data connectivity. And, that opens up a lot of interesting possibilities from the geospatial perspective. So that's those are those are the coolest use cases that that I've seen. Some other ones that are really interesting is with, like, the raster data. Raster data or earth observation has always been

kind of a really unique and interesting field. And it almost in ways sometimes they feel

separate in the geospatial world. You know, if people are doing analysis

on census data or retail analytics, they might not be touching raster data versus others that are, you know, using pictometry or earth observation or satellite imagery and and creating derivative raster analysis from that. Those 2 don't always feel interconnected

in some ways, but now and I think they're just starting to because of SQL. And and Post GS has a lot of great

raster data and and creating analysis or, you know, smoothing raster data. So there's some tools there. And then from like the data ingest piece, that data flow plus Geo Beam workflow is discussing with BigQuery is great because we, you know, we set it up with a few other tools where I can kind of

put just put a raster file in cloud storage bucket. The entire workflow automates and then eventually have a table in BigQuery, you know, shortly thereafter.

So that entire process is automated, then I can do an intersection on raster data plus my other vector spatial data all within 1 SQL query, which is really unique and interesting. So I can either, you know, ingest the raw data with, you know, the the color bands associated with them or, you know, I can actually, you know, take the derivative outputs of an analysis and and pull that in there, which opens up a lot of unique use cases. You know, studying maybe deforestation

or we've seen some people leverage those types of workflows for, you know, agriculture study and and understanding where, you know, things are being planted. Even even consumer packaged good companies, they wanna use that data, you know, over time to see where they're sourcing materials from. And if, you know, these are actually sustainable,

you know, farms or practices, are they replanting?

How often are they replanting? What are they doing to actually, you know, preserve the environment around them? You know, so there's there's all of those types of questions that are being asked too, which you can't do without a satellite image or or something or just something derived from a satellite image. But, of course, you have that plus other vector data, and then how do you then make that easy to serve up and and interesting is is actually bringing that up into, you know, an application, which, you know, you have your tile layers and your visualization that can also be run from the the warehouse or the database. So the all those possibilities make life

very easy in places where it was potentially complex before, plus cracking into a wealth of other enterprise data that might be in these locations.

Those open a lot of really interesting possibilities. So

Alright. And in your experience

of working with

spatial data in a SQL environment

for your own work at Carta, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Challenging

lessons. You know, for for me, I had, you know, when I started to learn, everyone learns somewhere and I kind of that's what I I tell everyone, especially someone who's coming from a traditional

GIS background where things are more file based or kind of the workflow you discussed. Right? You're you're using QGIS

or, maybe a different desktop

tool. You're pulling in some different files and translating them and then pulling them maybe into a Python workflow. Everyone everyone starts somewhere, and, you know, I kind of had a nominal

understanding of, you know, a SQL query, and and just building

upon those lessons and continuing to go deeper is really great. I mean, unexpectedly

finding new things that are maybe on the Postgres side of the house, you know, maybe not this purely. The spatial SQL is great. I love Postgres triggers. I think, you know, especially for, you know, data that's being inserted to a database, I can let's say I have a point that's I can drop in from an application or a form somewhere.

I can detect that on the table. I can geocode that point. I can enrich it and see, you know, what's the population or a polygon it falls within. I can extract a bunch of other spatial data and then do all of that without ever touching it. It's like the it's like those really bad infomercials, you set it and forget it, you know, and it just kind of happens.

I I I think that's 1 of, like, those

unexpected discoveries of things are are really great. Plus when new releases come out, you know, the ability to to speed up or make things much more rapid, like that's, you know, that ST intersects,

that changed. They they added some new feat you know, took a feature that had been you could kind of do already, you know, with different functions, but actually put in the core function. Now you're you're much, much faster in in spatial intersections and things like that. So those are some of the unexpected pieces, but even, you know, as you go deeper, you can, you know, PG routing is is really awesome. You can make

an own custom your own custom routing engine using your own data, you know, and and really do some complex stuff with that. We've done some really advanced, you know, logistics analysis with that just because we're you know, you have to create OD matrices

for multiple depots

and, you know, potentially 100 of thousands of deliveries every single day. Right? You can't just you can't do that via, you know,

a API endpoint. Right? There's too much data there. So, you know, all of those different pieces are really amazing. And then I think it's the day to day things where you discover something or you change something or you run a query just a little bit in a different way and you go from something that might take, you know, minutes to to seconds,

or finding a new way to use a subquery, then, you know, that that makes life a lot faster. It's those things that are just

everyday kind of validate why you put some time and energy into into using that. So And for people who are working with

GeoData

and they're trying to decide what is the best workflow for them, what are the cases where putting it into a SQL database and using these SQL extensions might be the wrong choice and maybe they're better suited using some of these desktop utilities or Python scripting or putting it into a spark cluster or something like that? I would say if you're doing a lot of select star statements, you know, it's kinda like, okay, you have it in database and that's great. But

if you're not leveraging the spatial relationship

features,

if you're doing spatial, just pulling the data down and then doing the spatial join in in Python or in a desktop workflow, I would definitely recommend

going taking a step back and saying, can I do that on the database and and will it be faster? I think that's 1. The other piece is if you're just kind of doing a, you know,

a a a join to a lot of other

enterprise sort of data, you know, a spatial join is the unique joining of the relationship between 2 geometries. Right? If you just need to aggregate some data and then, you know, join it to something else, you know, you may you may not need to do a SQL component. You can join it by ID or some other ways like that. So it's knowing when you'd have a true spatial relationship,

you know, then it that it makes sense to kind of go that route as well. So so those are some some areas I recommend. And and obviously, I I listed, you know, like, the scale of data, these, you know, sort of the the volume of the data that's streaming in or changing,

and then the complexity of that data all sort of lend themselves to those use cases. If you're kind of doing 1 offs or, you know, wanna move very quickly and agilely and can use the files or, you know, like I said, just kind of use the spark cluster or something like that. Those are great choices. I mean, it's when you start to need to to scale up and go from there that I think those are some some logical points to be watching out for when you're when you're working with this data. And as you continue to

work with these

SQL extensions

in your work at Carto, I'm curious what are some of the things you have planned for the near to medium term as spot far as spatial analytics support or additional capabilities that you wanna push into these user defined functions for these other data warehouse engines or just other projects that you're excited to dig into? Yeah. I'm I'm excited to keep bringing a lot of the

stuff that I could previously just did in Python into the into, you know, the database or the data warehouse. So things like everything you look at in Pycel, kind of how can I do that in a in a data warehouse or a database?

Location allocation is 1 of the coolest ones that that I found which is sort of logically given a a network,

maybe like a road network and and a sort of set of demand, you know, sort of either individuals or, you know, people living there or customers. And I have some service that needs to serve them, whether that be, you know, a a pulling place or a school or a fire department or something like, you know, in New York City, we now have these, you know, 15 minute delivery grocery services where should I you know, given all a a set of, let's say, 30 places I need to decide from, which are the 10 most the 10 best that can serve that based on, you know, do I want to have complete coverage or things like that. There's a great library called spopt,

s p o p in Pycelle. It's a sub package of Pycelle that actually does and addresses some of this. You can do it all in Python now, but I can see like, you know, hey, can we generate the network faster, or can we create compute the OD matrices faster in SQL? And then the last mile of that, we, you know, can kind of compute with the, you know, the the location allocation functions and actually tells you, okay, here's your most effective places. Your coverage is 86%,

and, you know, here's how you can effectively do that. Because those are really common questions we get all the time, and and there's lots of data that could be driving those decisions, you know, in terms of where people are moving to or from or, you know,

mobility data, which, you know you know, numbers in the millions of or billions of of pings and things like that.

So it gives you a lot of, flexibility to do those different things. So Are there any other aspects of the work that you're doing at CARTO or applications

of SQL extensions to this arena of geospatial analysis or any of the related considerations

that we didn't discuss yet that you'd like to cover before we close out the show? I think I covered a lot of the, you know, the core ones. I think there's such a boom, you know, in in interest in this data.

Primarily not just because of, you know, more and more, I think, people looking at geospatial or modern GIS workflows

and and just the data science sort of, you know, growth over the last couple of years. But the problems where we're starting to to see and solve, you know, everything from how, you know, access was studied there, you know, within COVID, you know, at the beginning, who can access a health care facility, where's your nearest pharmacy, you know, all the way through things like supply chain or equitable distribution of vaccines.

All these and and even how people moved, you know, during the the pandemic. There's just been upheaval in terms of the way things happen in this world, climate change probably being the biggest ones, that there is a fundamental need for geospatial data

and that data is only getting bigger and more complex.

So we're finding that there's there's a push and a need for these new tools, these new analyses, and new ways to surface those insights, not just as a 1 off, you know, pull some data, create a dashboard, and then it lives on the web forever. But on a regular, you know,

daily or hourly basis, once new data comes in, how can I show those insights? And spatial SQL allows you, like I said, that complete workflow from the data engineering and ingestion and ETL to the storage and and management and querying to the spatial relationships

and then to the actual rendering of of the visual outputs that can be pulled into, an application or a day or a dashboard that someone can use and see. That all kind of lives there now. And, that someone can use and see, that all kind of lives there now. And that's that's exciting. That it's really exciting. So that that's where I I get, you know,

really excited to see and how can you take that to the next level. So Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?

I think I'll talk about it from my lens, which, you know, of course, is from the the from the geo side. I I think, you know, in terms of data management,

I think creating some

easy ways for discovery

and and using other datasets alongside your data is is sort of a big gap. And it's always a challenge we've tried to sort of use and and, sort of address at CARTO. If I'm looking at if I have a dataset in a specific city, what other data exists around there? You know, what data points can I use?

How can I enrich that other data? And we've created some functions to to start to address that, you know, to create easy enrichment across multiple different datasets. So if I wanna enrich with census data or maybe a premium dataset or, spend data or road traffic, doing that in 1 single shot, you know, has always been complicated. So first of all, I have to go get all that data, bring it into a single location, and then run that enrichment and do how many ever joins you need to do to get to that insight. And then we, you know, we try to make that easy and and orchestrateable behind the scenes with the data observatory, which serves as both a discovery

method of data, a marketplace,

as well as an enrichment function. So that's probably the biggest gap and then I think 1 that's that's sort of unexplored.

Spatial feature engineering is the other, but that's a whole other topic. So yeah. Alright. Well, thank you very much for taking the time today to join me and share your experience of working with all of these geographic datasets and applications

of SQL dialects to that. It's definitely a very interesting problem domain. And as you said, 1 that is expanding in terms of use and importance. So appreciate all of the time and energy that you and the folks at Cardo are putting into that, and I hope you enjoy the rest of your day. Same here. Thanks for having me.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links