Summary
The global climate impacts everyone, and the rate of change introduces many questions that businesses need to consider. Getting answers to those questions is challenging, because the climate is a multidimensional and constantly evolving system. Sust Global was created to provide curated data sets for organizations to be able to analyze climate information in the context of their business needs. In this episode Gopal Erinjippurath discusses the data engineering challenges of building and serving those data sets, and how they are distilling complex climate information into consumable facts so you don’t have to be an expert to understand it.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
- The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Gopal Erinjippurath about his work at Sust Global building data sets from geospatial and satellite information to power climate analytics
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Sust Global is and the story behind it?
- What audience(s) are you focused on?
- Climate change is obviously a huge topic in the zeitgeist and has been growing in importance. What are the data sources that you are working with to derive climate information?
- What role do you view Sust Global having in addressing climage change?
- How are organizations using your climate information assets to inform their analytics and business operations?
- What are the types of questions that they are asking about the role of climate (present and future) for their business activities?
- How can they use the climate information that you provide to understand their impact on the planet?
- What are some of the educational efforts that you need to undertake to ensure that your end-users understand the context and appropriate semantics of the data that you are providing? (e.g. concepts around climate science, statistically meaningful interpretations of aggregations, etc.)
- Can you describe how you have architected the Sust Global platform?
- What are some examples of the types of data workflows and transformations that are necessary to maintain your customer-facing services?
- How have you approached the question of modeling for the data that you provide to end-users to make it straightforward to integrate and analyze the information?
- What is your process for determining relevant granularities of data and normalizing scales? (e.g. time and distance)
- What is involved in integrating with the Sust Global platform and how does it fit into the workflow of data engineers/analysts/data scientists at your customer organizations?
- Any analytical task is an exercise in story-telling. What are some of the techniques that you and your customers have found useful to make climate data relatable and understandable?
- What are some of the challenges involved in mapping between micro and macro level insights and translating them effectively for the consumer?
- How does the increasing sensor capabilities and scale of coverage manifest in your data?
- How do you account for increasing coverage when analyzing across longer historical time scales?
- How do you balance the need to build a sustainable business with the importance of access to the information that you are working with?
- What are the most interesting, innovative, or unexpected ways that you have seen Sust Global used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Sust Global?
- When is Sust the wrong choice?
- What do you have planned for the future of Sust Global?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlas Active Metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy, and today I'm interviewing Gopal Arunjiparath about his work at SUST Global, building datasets from geospatial and satellite information to power climate analytics. So, Gopal, can you start by introducing yourself?
[00:01:40] Unknown:
Tobias, great meeting you today, and thank you so much for having me here. Everyone, thank you for listening. This is Gopal. Quick introduction. I'm an electrical engineer turned geodal scientist. My passion and expertise has been in geospatial data over the last several years. I started my the early days of my career working on multimedia and large scale imagery datasets. And along the day, got really interested in environmental data and got fascinated by remote sensing datasets and made the transition from large scale imagery to large scale geospatial datasets.
And through a serendipitous sequence of events, ended up leading analytics and insights at Planet Labs, where I spent 3 years prior to find founding SAS Global. While I was at Planet, in addition to exploring remote sensing imagery, I was also curious about what kind of datasets exist on the changing climate. And I essentially saw, like, the emergence of frontier climate models, and that got me interested in climate modeling. And a couple of years back, I start I stepped it at full steam ahead with founding SAS Global and exploring large scale climate datasets and climate models. So SAS Global was founded with the mission, to develop data driven products that enable every decision to be climate informed so that humanity can thrive on the forward looking changing planet.
And I would say last several years of my career, I've been focused on product engineering and research commercialization on data intensive applications.
[00:03:21] Unknown:
And do you remember how you first got started working in the area of data?
[00:03:25] Unknown:
Yeah. So my first experience working with data was looking at imagery. When I was at Dolby in 2, 007, 2008. Yeah. I I'm not skipping that decade. It was a long time ago. Dolby decided to make this transition from being the world's foremost and pioneering audio technology company to being the world's pioneering multimedia technology company and decided to invest into an imaging initiative, which led to a focused effort around building the world's first flat panel display for cinematic imagery calibration in studios and postproduction houses. So that's when I started dealing with large collections of imagery data for cinematic reproduction.
And as you can imagine, a movie file has, like, many images. And if you're looking for products that scale across the gamut of cinema production, then we're dealing with a lot of imagery for calibration and for tuning. So that's when I first got started with data data heavy capabilities.
[00:04:32] Unknown:
And you've mentioned a bit about what SUS Global is and some of the story behind how you got involved in it. And I'm wondering if you can talk a bit about the type of audiences that you're focused on as far as the products that you're building and the data that you're working with?
[00:04:47] Unknown:
Yeah. Yeah. We're primarily serving 2 camps of very motivated climate aware personas at the moment. So 1 of them is around ESG assessment and reporting. As you're all aware, there's been an increasing interest as well as increasing awareness, environmental, social, and governance indicators. And oftentimes, when you're doing those assessments, you can need the best validated cleanest climate data that you have access to. So reporting on the risk to a company from the environment, 1 of the avenues which we help our customers assess. Secondly, we've been enabling the creation of new sustainable finance instruments in the market intelligence space.
So that's the 2nd camp of analysts and data scientists who we're enabling with clean validated climate data. And then I would say there's this emergent 3rd camp, which is rich sets of climate aware dashboards and insights platforms that are trying to surface climate data in interesting ways. So that's a little bit more nascent of a camp, but quickly emerging and 1 that we are equally excited.
[00:06:03] Unknown:
Climate is definitely part of the zeitgeist right now. There are a lot of conversations happening around it, but the sort of concrete aspects of it are being able to understand what it is, what are the changes that are actually happening, what are the impacts that they're having on, you know, business, on communities, on geographical elements. I'm wondering if you can talk to the types of data sources that you're working with and some of the kind of scale that you're dealing with, some of the, maybe, heterogeneity that you have to deal with, and some of the ways that you're processing that data to be able to build some of these derived data products that are more consumable for your customers?
[00:06:46] Unknown:
That's like a nontrivial exercise. So when we started out looking at physical climate risk data, and this is a distinction worth making clear. You know, when you think about data in the context of climate, it's like an ocean like that that hasn't been another. You're looking at social data, which is human footprint and how that affects climate. There is, economic data, which is, you know, how is climate created, the changing environment and the changing climate created losses, and how that even loosely connects into the environmental footprint that is changing, that is enabling new schemes for carbon capture of a reduced carbon emission. So that connects loosely into the carbon offsetting space. And then you have physical climate risk, and that's the space we operate in. There, it's an understanding of what the environment is doing to humans, to property, to the land ecosystem, to the oceanic ecosystem.
So over there, we are looking at assessments across physical hazards, which are both acute and chronic. So think of acute hazards as wildfires, riverine flooding, coastal flooding, flash flooding, and cyclones, and then chronic hazards being heat stress, water stress, sea level rise like impacts. So in order to do each 1 of these hazards correctly, you have, like, you know, subsystem of different or sub collection of different, datasets that we need to bring together. And what is the common ground there? So we can start there, which is inherently climate is, geospatially non stationary process. So different parts of the land ecosystem are affected very differently by very different hazards.
So the datasets are inherently geospatial and have spatial context. The second piece is they have a temporal element because there is a different incidence of these events over the course of time and also projected differently over the course of time. And together, there's no 1 model of, like, the entire earth system. There are so many different subsystems to be taken into account. So bringing those together requires geospatial data transformations, and that's what we've been focused on building. So we tap into earth observation datasets primarily because they are good ground truth for what is happening on the ground in the context of these physical hazards.
And then we tap into frontier climate models that project into the future on what could be potential impacts and exposure across different climate scenarios, which is kind of loosely nominated by the IPCC based on their most recent reports. So if you look at projected capability and couple that with observational datasets, you can then look at hindcast that help calibrate models against each other. So that's kind of the basic framework we are starting with. And built on top of that, we have accessed. We have built a more sophisticated geospatial data processing stack that brings into account land layers, brings into account machine learning models so that you can enable asset level assessments of physical climate risk. So going back, going from refining climate data to physical climate data, to existing datasets, to what is meaningful is a journey.
And I would say the primary gap that we're trying to fill is this. When you look at frontier climate models, especially the ones that are global and we think climate is a global problem, we wanna start with assessments at a global scale. Global climate models inherently are meant for understanding fundamental variables and impacts at a global scale, but they're very coarse resolution. And when it comes to businesses that we are serving, they don't care about whether a country's at risk or like a state is at risk, but they do care, But they care a lot more about what's the impact of my holdings to my assets on the ground, my physical footprint, or my physical supply chain. So bringing it down to asset level and regional level on demand is the capability that we need to stand up, and that's where you need to use these data transformations and machine learning techniques.
[00:11:18] Unknown:
Yeah. I can definitely imagine a whole host of challenges and potential approaches that you could take to being able to do things. There are whole businesses that are focused on some of these subcategories of problems of, you know, saying, okay. Based on public records data, I'm able to determine that this, you know, bounding box at these latitude, longitude key pairs are a physical kind of business that a company owns and being able to identify that company and the entity resolution aspects. Like, talk to people where that's their whole business. It's just that piece of it. I've talked to people whose whole business is just being able to say, I can handle this mapping layer and be able to overlay some of this kind of, you know, social and civic data on top of these, you know, raw map tiles to be able to give you some ability to analyze some of this information. That's a whole business right there. And you're saying, okay. We'll take all of that and add another layer on top where we also wanna be able to understand what are the actual weather events, what are the long term trends, how does that map from, you know, discrete weather events into the overarching climate challenges.
So definitely interested in understanding kind of how much of the technology stack and the data manipulations that you're doing are things that you're able to pull off the shelves because other people have kind of treaded some of that ground for you, and how much of it is just we have to build all of this ourselves and invent new technologies because what you're trying to do is so kind of vertically integrated and complex and inherently needs to be able to address all of those concerns at the same time? That is a trade off, and I feel in in this particular context,
[00:12:52] Unknown:
we are building on the top of years decades of research and work. So, you know, climate models that exist today, kind of end generation, like I'd like to call it. IPCC calls them 6 or the couple model India comparison project version 6 0, but they're built on top of generations of climate models. Some of them written in, like, many years ago, and there's a lot of science and gifted thinking that goes into them. So rather than reinvent that, we've been focusing the problem more on what's most how do you make that data meaningful to a business on the ground? Same thing with remote sensing. You know? Like, if you look at Earth observation datasets, this has been an explosion in the context of, like, the amount of data generated from satellites in space.
So finding the right datasets, the right sensors that can be used most effectively for datasets is kind of where we're in a very opportunistic time to be able to make those choices. So that's kind of where we are focused our time on. In terms of the
[00:13:59] Unknown:
ways that organizations are using these data products that you're building, you mentioned that they wanna understand, okay, what is the overall impact of climate on my assets, my physical holdings? I'm just wondering if you can talk to the types of questions that they're asking and some of the ways that you think about structuring your derived data products in a way that allows them to be able to explore those answers?
[00:14:23] Unknown:
My role at SaaS Global is a deliberately crafted unique 1. I head up both the tech and the product functions. And primarily, it comes down to, okay, let's build the right tech and invest in the right technologies necessary to answer very, very specific and very meaningful sets of questions. So I would broadly set questions and query set, if you might, for climate data or physical climate data in today with 4 caps. So the first set of questions is around localization and reporting. So where is their clear concentration of climate exposure, and what is the expected severity? What is the estimated impact to my set of assets? So that's take 1 set of questions you can answer, which is given a large collection of globally dispersed assets, where's concentration, where is exposure, and what's the estimated impact? So that's 1 set of questions we'd have answered.
The second 1 is I would call relation, which is how does has an exposure from the physical climate lead to impacts in either the distribution, the sourcing, the supply chain, or even the allocation of capital. So that's the relation side of questions. The 3rd bit, I would say, or the 3rd camp would be what I like to think of as optimization. So if you were to deploy capital across a set of instruments or a set of assets in a climate and air fashion, how would you do it? If you were to deploy capital towards building new infrastructure, which is gonna happen over the next few years and make that new infrastructure informed by the changing climate, how would you do that, and how would you make those climate investments?
That's the optimization set of questions. And then lastly, I like to think of the the camp of adaptation, which is how could teams of users and our customers introduce adaptation measures in a capital efficient way to protect structures, species, land, and natural capital for the decades and generations to come. So at SaaS, at the early days of helping our customers build the understanding and awareness, and that's largely the reporting. That's where the market's at today. But longer term, we are building the software of climate adaptation.
We're building the pathway from answering the basic questions, which is localization and reporting, to relation and optimization that help you connect to larger scale problem sets and larger scale business impacts, and eventually to adaptation, which is kind of where the world needs to be in the years to come if you are to be serious about adapting to the changing climate, and that's the software we build.
[00:17:24] Unknown:
Everybody. 1 of the interesting aspects of this too is the question of education where because climate modeling is a very sophisticated undertaking, it does require a lot of contextual and domain knowledge to be able to interpret all of the ramifications of the information that you're generating. I'm wondering how you think about the customer education aspect of the product that you're building and being able to say, okay. This is the information that we're producing. These are some of the ways that you can and should be thinking about aggregating it because as with anything that you're doing with analytics, there are certain transformations that are legitimate and actually do provide additional kind of meaning and nuance to a problem. And then there are other ways that you can run an aggregation that, well, it will produce a result, doesn't actually make any sense.
And I'm curious how you've approached that challenge when working with your customers to be able to say, okay. You know, these are the datasets. These are kind of the contextual semantics around it and being able to bring that into their own analytical stacks and their own data products to be able to say, okay. You've combined our climate information with your information about the amount of capital that you have invested in a building or the amount of inventory that you're storing there. You know, these are the ways that you can combine them to be able to get a useful answer and useful insights out of it. But if you do it this other way, it's actually just gonna give you a number that has no meaning to anybody.
[00:18:46] Unknown:
Yeah. Yeah. And that's a darn trivial problem, especially right now with so many misconceptions around the changing climate. Many of the times people misunderstand weather for climate. Many of the times people misunderstand hazards. Many of the time people misunderstand causality. So I would say the way we have looked at it, and we're still in the early days of doing this, and we haven't cracked it 100 percent yet. We've tried to work on what I think of as a common domain language, which is, okay, what are the basic set of primitives on which we can start conversing on this topic, this fairly sophisticated topic, without misunderstanding each other? And I would say that starts with data literacy.
And in this context, it's climate data literacy. And I've kind of I've seen what Valerie, Logan, and the data lodge have tried to do in terms of going to digital transformation and going through data transformations and enabling companies to do that as, like, a pioneer in that space. And I try to borrow some of those methodologies into what I like to think of as climate data literacy and climate data as a second language. So that I take on what are the primary sets of things people need to have a common understanding of and distill that into workshops, hackathons, and deep dives we organize with our customers because this is not an established domain. Many of the customer like, outside of catastrophe modeling, reinsurance and insurance markets, like, people don't really know what to do with this data, but they have the right motivations.
And I don't wanna trivialize that. I feel like people have I shouldn't say people don't know how to. I think it's a function of you need an increasing awareness of the capabilities to be able to shape these solutions. Right? And towards that end, you know, we create venues where data scientists, data engineers, and from the customer side brought in solution engineers and data scientists from our side, climate data scientists at our side to create that shared knowledge base. And once you've done it a few times, you see the replicable pattern, and you can then create the knowledge base that helps make it easier to do it in the future. And to a great extent, almost any technology, like, even data engineering, be it, you know, some cloud native capability or a new open source open source package has kind of evolved that way, man. 1 use case has been clear, it's been documented, and then you suddenly have a community of people relating to that same problem set and then going from there.
[00:21:29] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder. Digging into the technology stack of how you're designing and building your platform, I'm wondering if you can talk to some of the architectural elements there and some of the unique challenges that you've had to address, some of the, you know, engineering that have been maybe most valuable, but also most kind of challenging or frustrating?
[00:22:25] Unknown:
Yeah. Yeah. I see that the exciting as well as challenging elements a lot of, like, building this climate data warehouse where you have the most relevant datasets, and I would say all the cloud providers are kind of going that direction. They are having either the planetary computer system from Azure slash Microsoft to, like, what's in the Google Earth Engine datasets today, but they're not quiet there in terms of all the capabilities and the stack you need. So we've been focused on building on top of those capabilities towards a climate data warehouse, which is very specific to what we need, and keeping that fresh is a nontrivial challenge.
Keeping that well maintained and organized is a nontrivial challenge. And then enabling a feedback structure where on ground intelligence and intelligence from our customers can help us learn to keep that warehouse focused, updates that warehouse effectively is another nontrivial challenge. We've also focused a fair bit on building a cloud native workflow. So we have built a distributed data processing system in Golan, and we have spent a fair bit of time on building out data pipelines that can that can process global scale or planetary scale data for large collections. If you're studying the supply chain of a commodity, that's like 1, 000 or 100 of 1, 000 of assets. So being able to process that is something we've been doing.
And I would say we kind of tuned our system a little bit more for batch mode operations. So I would say now the challenge we're looking at is, okay, how do we also enable rapid translation and being able to focus more on caching and querying because the use cases are expanding, and customer sets we are serving are expanding. So there's some new patterns that we need to support. And I would say in terms of the technology, the most exciting thing that I've discovered through this journey of building out the platform has been building the operational efficiencies where you can have these live streams, be it to satellite or to on ground sensing, get plugged into the warehouse very efficiently and retrain models on demand or in on demand based on certain triggers you said. So we've architected a system, I would say, the beginnings of a system that does that effectively, and we intend to, like, build that out over the the years to come.
[00:24:56] Unknown:
To that question of sensing, 1 of the other interesting problems in this space is the fact that there has been so much rapid innovation and evolution of the sensing capabilities where it went from you had to rely on government owned satellite imagery that had maybe a daily update cycle if you were lucky to having commercially owned and rapidly expanding satellite imagery capabilities with, you know, different, maybe, wavelengths of light that they're dealing with or, you know, dealing with not just light based imagery, but radio sensing or radar, things like that that maybe have hourly update cycles in some cases, at least over certain geographies.
That's another aspect is the fact that there is maybe heterogeneity in terms of the update cycles for those different geographies, or maybe North America has a rapid update cycle. You know, Antarctica is obviously gonna be very undercover because it's such a hard orbital pattern to maintain. And I'm just curious how much of that is something that you've had to engineer around or have to worry about, or are there ways that you're able to kind of normalize the problem so that it is kind of pushed to the boundaries and you don't have to worry about that all the way through to the kind of center of your systems?
[00:26:10] Unknown:
Yeah. So you raise a very interesting point, which is when you have very sophisticated imaging capability, you have to deal with all these boundary issues. And the big point to take away here is raw satellite data is often not very useful in this context. So when the context of satellite imagery, we often talk about these processing levels. So natively, what a sensor captures is kind of what people like to call L1, and then there's a process to auto rectify, which is down lies to the ground version of it, which they call l 1 b. And then there is a version which is Mosaic, which is kind of what you see on Google Earth, which is your l 2. And then l 3 is a stacked version of it, which has the time dimension. So Google Earth changing every day, so 1 can think of parallel imagery closer to that.
When it gets to that l 3 scale, the data becomes useful for the kind of spatial temporal analysis that we are talking about, which is primarily what you need for geospatial analytics related to climate. We've specifically gone with datasets where where that's either already available or very easy to derive. So that's kind of the qualifying criteria, which kind of immediately shrinks your your dataset space to a very select few. And then you add on top of that, you want global coverage, and then it it becomes even simpler for you to make the choice. So we've tried to push the problem more upstream and focus more on the application there because that's the thing we uniquely know how to do.
[00:27:52] Unknown:
In terms of the ways that the data is presented to your end users, I'm curious what types of data assets you're dealing with, the shape that the data is taking. Are you delivering image data with overlays? Are you delivering tabular or semi structured data? I'm just curious kind of what are the assets that your end users are actually interacting with to be able to then form their own analysis?
[00:28:16] Unknown:
Yeah. So we currently serve 2 products. We have a data API product that provides a base structure. It is not imagery driven. It's largely asset driven. So it's tabular. By asset, I mean, you know, physical location driven, which is the request which conforms to the request that the customers make to our API. And then the second product capability is that of a dashboard, which you can think of as a thin client built on top of the API or the data API, which enables you to visualize the climate related insights and serves the analyst or, like, the non developer audience.
So the data API basically allows the collection of insights and the serving of insights, which are pretty structured. So we looked at tiled datasets and essentially connecting this to the spatial of web feature services or web mapping or web tile services, but those are things that we have on the road map for the future. Primarily, the data API is serving the analyst and the data scientist personas very effectively at the moment because they have the geospatial metadata, so it's very easy to put it on the app and visualize as needed.
So the way we're trying to focus on that is on the API connected to we've stood up a Python client and, collection of notebooks and make it very easy for people to visualize it in their own workflows, and that provides a quick start for any Pythonic development workflow. And for others, we can easily solution engineer and stand those up too.
[00:30:01] Unknown:
Another interesting area to explore is in that question of performing analyses, there's always an element of forming a kind of shared understanding, building a story around the insight that you're trying to provide to the people who are interacting with that output. I'm curious what you have found, both you and your customers, is useful kind of patterns or lessons to be able to bring some of that shared understanding to the end consumers of that analysis because of the fact that it is relating to climate data, which does have this very a a lot of people have this complicated relationship with that concept where, you know, it's something that is both inherently local as well as global. And, you know, everybody has a part to play in it, but everybody has their own kind of concepts about how they think about that broader global impact and global network of involvement?
[00:30:57] Unknown:
I would say for our customers, the problem set is very clear based on the camps or the questions that I mentioned earlier. So I think the confusion is a little less on that aggregation. So if you provide data at the asset or property level, then it is very easy to group, sort, or aggregate the risk to the macro scale. Right? So that's, like, issue of a security, then they can look at their footprint and come up with a consolidation. So we kind of work with them and enable them to come up with that logic. We have some patents which we can obviously get them started with. So I think the confusion there is less. I think it comes down to based on your on ground footprint or the tangible asset footprint, your profile can look very different.
There's also this element of background or benchmarks. Right? So you might have an asset or a site or a property in an area which has very high exposure in the background, and you might not be much worse off or much different from the background. But that might still be highly elevated risk profile or exposure profile compared to another area or another site. So that context is often very important, especially in use cases around site selection where you're confined to a region, but you're looking for the lowest level of physical risk exposure in that region. And that's kind of inverting the problem towards saying, okay, I know it's this is the region, but here's where if I were to pick a parcel of land for a specific opportunity, where would I do that? So that's different from, I know this is the parcel of land I care about. What's the risk then? So you're kind of looking at both sides of that. So I would say that's macro versus the micro.
[00:32:50] Unknown:
Another layer that people are, I'm sure, curious about is this question of kind of time scales where you're looking mostly, I'm sure, at what is my risk right now? What is my risk projected into the near term future or as far out as I could reasonably predict? But there's also value in understanding, okay. I have this information about now, maybe near into the future, but what if I look backwards and I'm looking at annual time scales of what is the impact of climate on these locations and being able to then build up that view of how have changes in the climate progressed from time into the past to where we are now so that I can then get a more nuanced understanding of what these future projections actually mean for me because I can build up this trend of analysis to be able to say from the past through to now into the future.
I'm wondering what you see as ways that that factors into some of these questions and the decision making process.
[00:33:48] Unknown:
Yeah. I feel like that definitely helped shape a deeper understanding because what has happened is something people can relate to. What could happen in the future is something that is still unknown, and no model can say that with absolute certainty. So being able to surface the data on the historic dimension and surfaces at a cadence that is can be correlated with other outputs and benchmarks is definitely very valuable towards developing a deeper understanding. So being able to enable insights and indicators that are relatable in history help make those even more relatable for forward looking projections. And the forward looking projections is just hard primarily because when you look at a climate with a 30 year horizon, so many things that could change in that time frame. So that's the reason why even with the 6th assessment report and the IPCC, they documented a set of bundled assumptions as what they call socioeconomic pathways or shared socioeconomic pathways, the SSPs.
And that's basically saying, okay, if there are if you make these assumptions, this is how you can enable a projection. So that ties into all the different warning scenarios indirectly. So that's kind of what you see in the current literature.
[00:35:17] Unknown:
Bigeye is an industry leading data observability that gives data engineering and data science teams the tools they need to ensure their data is always fresh, accurate and reliable. Companies like Instacart, Clubhouse, and Udacity use BigEyes automated data quality monitoring, ML powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business. Go to data engineering podcast dot com/bigeye today to learn more and keep an eye on your data. And when people are building these kind of aggregate, you know, layered historical views projecting into the future, are there certain kind of key indicators of the climate that they're typically looking at? Like, are they looking at what is the average kind of seasonal temperature and its variance over these years? Are they looking at, you know, rainfall?
Like, what are the types of kind of climate information that they're looking at, and how do you help to prevent the pitfall of looking at specific weather events in that process of building the analysis?
[00:36:24] Unknown:
Yeah. Yeah. Great question. So I feel like it's oftentimes the questions start with, like, temperature and precipitation, right? These are more fundamental variables, but now we are seeing increasingly awareness around hazards, like the acute hazards, because it's immediately loss making, like, because of certain increased cycle risk in certain areas. People in in the Gulf of Mexico, in Florida, getting used to the fact that during hurricane season, Zafavi Ben still have to live somewhere else. So I almost feel that awareness is there. We've we've seen that expand to other other hazard dimensions, for example, heat stress and water stress. As you probably know, there's been this increasing water stress and drought across many parts of the US, and that's expected to just get worse as as there's more warming.
So what does that impact look like? So 1 of them is the absence of water, but then what about the businesses that are heavily dependent on water, like a mine or a bottling factory? And what does that look like across these scenarios, across these longer time horizons? And then if you were investing into those businesses, how would you think about the operational burden from drought and water stress in the future? So those are the kind of problems we're looking more closely into and enabling teams to answer. And in terms of be like, beta functions, that's a lot more near term, and there are different set of solutions related to that, which we introduce our customers to. We're not trying to solve that problem. We're not trying to do better seasonal prediction because that requires a different set of expertise.
[00:38:09] Unknown:
Another interesting way to think about interacting with this climate information, particularly since you're looking at near future projections, is the question of being able to maybe play with some of the outcomes where you say, based on this action or, you know, the assumption that I'm going to act in this way and that in aggregate, you know, people who are in my industry are going to have similar behaviors, what impact might that have on future climate projections, you know, particularly if you're looking at industrial scale operations where, you know, you have 1, 000, 000, 000 of dollars worth of economic activity happening.
You know, if those players are saying, okay. Based on the data that I have and my assumptions about future climate impact, I am going to actually make this investment in reducing my emissions by this percentage, assuming that everybody else in the industry is going to do the same. Like, how much of a kind of meaningful insight will that provide, and and sort of what are the levels of scale that are necessary to be able to actually understand what type of meaningful impact might that have given these set of assumptions?
[00:39:14] Unknown:
Yeah. I feel like to some extent, the scenario you just laid out is the broad narrative for offsets. And what I'm looking at is and many of the times, like, there is the understanding that when you have heavy emissions, you're indirectly doing harm to the environment and triggering accelerated warming. So if 1 thing you could do is, like, stop that, the second thing you could do is kind of offset that. And that's where I feel like the credit schemes indirectly or the credit schemes enable businesses to be able to contribute back towards the system and the holistic system.
But the big challenge in this ecosystem of assessments is the time frame. So if you're emitting if you're having heavy emissions today, so say you emitted n tons of CO 2 into the environment, like, nothing changes immediately. Right? There might be some or n tons of CO2 equivalent. It might not be exactly carbon dioxide, or you were involved in some wasteful process. The impacts of them, from those are not apparent right away. It's apparent years from now globally might not even impact your specific operation. So I feel for having a system of adaptation there, 1, you need the climate mitigation measures, which are kind of common offsetting reforestation, sequestration.
And on the other side, you need the adaptation measures where you would think that the incidence of these events is gonna increase over time. So balancing those out is kind of the complex bit, and the unknown is when you hit certain tipping points, the impacts can be a lot worse than right now. And the unknowns also, what level of emissions trigger those tipping points.
[00:41:17] Unknown:
In terms of being able to get started using the products that you're building at SUST, I'm wondering if you can talk to some of the assumptions that you have about the kind of analytical or data platform capabilities that your customers are working with or the types of technologies or interfaces that you're able to provide and just some of the overall steps involved in being able to get onboarded as a customer and be able to start taking advantage of the kind of derived data products that you're building to be able to incorporate into some of these business questions and being able to help feed that into the decision making processes?
[00:41:51] Unknown:
Yes. I would say some of the assumptions, there's always, like especially if you're looking at things from an API first standpoint. The assumptions are often an understanding a shared understanding of the data interface. Like, how does data enter your system? How is data that you are serving back the customer used in their environments? So we focused a fair bit on that, and we've worked through some early assumptions and some false assumptions that we've since corrected into the capability we currently sell. I would also say the way we have been able to work through these assumptions is just like most commercial APIs providing, like, effective documentation, a solid API reference has been very useful that someone can actually pick up, read, and implement. Secondly, providing some good examples on how the data can be used downstream and integrated in workflows is another very valuable thing. So we've kind of gone in with certain assumptions, and then in cases where those assumptions are not 100% valid, we're trying to showcase how we can solution engineer around that rather than serve a bunch of features that we gotta maintain over time. I'm sure that's gonna expand over time. Like, we're gonna have to build more, but we're trying to be very deliberate around the data capabilities we are serving because we wanna serve it for the long haul.
And inherently, the onboarding customers today, that's what they'll have to that's what they're getting used to in a production setting this year. And once these go into production, ideally, you don't wanna change unless the use case is involved.
[00:43:33] Unknown:
In your work of building this platform and working with your customers, what are some of the most interesting or innovative or unexpected ways that you've seen your products used?
[00:43:43] Unknown:
It was a pleasant surprise when we had, like, a few teams looking at municipality level insights when it comes to city planning. And, you know, originally, we didn't think of that as, like, a use case, but that was an emergent capability. The second bit is around when you think about you just touched briefly on carbon offsets, but 1 of the capabilities we spend a fair bit of time time on in terms of fleshing out is a wildfire projection capability. So as you know, 1 of the examples of carbon offsetting is planting trees and planting forests. So if there were an afforestation or a reforestation project that is getting commissioned and funded, does the assumption that when that forest goes live many years from now, it's gonna suck carbon from the atmosphere, And based on the capacity of that, you're already providing offsets onto that project.
Now that whole assumption, that whole workflow, as you said, the forest is untouched and thriving, but if you have forest fires, that's no longer true. So this is where the interaction between the urban violent interface and wildfires directly impacts the efficacy of a carbon credit. And that's a use case that came from our customers, which is very innovative that we really wanna support because it it really pushes the carbon credit ecosystem to the harsh reality of what things look like in the next 20, 30 years.
[00:45:19] Unknown:
Yeah. It's definitely interesting and great that the information that you're providing is able to feedback into not just corporate decision making, but also a better understanding of the programs that have been put in place to account for climate change and the ways that people think about being able to mitigate risk or offset their own activities and kind of digging into the assumptions and, you know, making sure that they are actually accurate.
[00:45:49] Unknown:
Absolutely. And, you know, testing those assumptions and holding accountable, like, the whole process for credits. And I think that's largely what that ecosystem needs because outside of a few standardization bodies, there's kind of like the wild wild west.
[00:46:06] Unknown:
And in your own work of building this product, building this business, working in this ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:46:16] Unknown:
So I would say 1 unexpected lesson is, you know, when you're starting off with, like, a very small team or, you know, just the team of founders, the first set of customers you onboard really shape the trajectory of your 1st few months or even sometimes 1st few years. So I would say most people talk about picking founders or investors when it comes to entrepreneurship, but I like to invert that and say what you care about the most is picking the customers you go after because that shapes the narrative of whether you wanna be self funded or venture funded or related.
The second bit is I feel founding SAS Global. I've been pleasantly surprised by the kind of learnings, like, from cross model meeting, across remote sensing, data engineering, machine learning, and climate modeling, all of which are related. Data is, like, the the common thread among all these domains, but they're all so rich in terms of context of the data that it's been a constant learning journey for myself and for my team, And building out a interdisciplinary set of individuals to come together on this shared mission has been a very exciting journey for me, and I'm also pleasantly surprised on what a small team of engineers can accomplish.
And the other learnings being, you know, we're looking at things with a fresh look and not be biased by what has existed before because we're operating in a different environment.
[00:47:51] Unknown:
For people who are interested in accounting for climate information in the work that they're doing and in the decisions that they're making, what are the cases where SUST Global is the wrong choice? I would say, like, you touched on a little earlier,
[00:48:04] Unknown:
better if you care about very near term days weeks rather than months years, then we're not the right solution. And if you care about, you know, ESG is very broad. So if you care about social and governance angles,
[00:48:19] Unknown:
you're probably not the right solution. We're very focused on physical environment and physical risk at the moment. As you continue to build and iterate on the products that you're building and exploring more of the space and understanding what are the questions that are being asked and ways that you can address them. What are some of the things you have planned for the near to medium term?
[00:48:39] Unknown:
You know, our near term focus is to get more developer friendly. We're seeing a remarkable set of opportunities in terms of serving, not just analysts and data scientist personas, but also developers who want to integrate our capabilities into new dashboards, new products that they are building. So more work than new sustainable applications powered by SAS. SAS Global's capabilities is, 1 of the things we are planning for in the future. And then, you know, going from hazard exposure to advanced adaptation measures, we're building the software of climate adaptation, and we're just in the early innings of that. So a lot more to follow on for that evolution.
[00:49:25] Unknown:
Are there any other aspects of the work that you're doing at SUST Global and this overall space of climate analytics and climate data that we didn't discuss yet that you'd like to cover before we close out the show?
[00:49:36] Unknown:
I would say, you know, climate impacts everybody and all aspects of business. So while we are being very focused on the commercial opportunities, we also live our mission and in enabling humanity to thrive on the changing planet with the evolving climate, we prioritize decisions that minimize even our own carbon footprint. And even when we pick external collaborations, we try to see, is there opportunities for us to collaborate with, like, nonprofits, enable them to have access to our data? Does it doesn't cost us much when you are still operating the same data stack and thereby enable a net positive impact.
So we've been very focused on that, and I feel like that's baked into our culture as a team. So as we onboard new individuals onto our team, as employees, as partners, as advisors, we look for that, and that's helped shape our DNA.
[00:50:36] Unknown:
And as a meta note too, if you wanted to call out your kind of hiring efforts, this is a good time for you to do that. Yeah. Absolutely. So we are hiring for
[00:50:47] Unknown:
a senior back end engineer as well as, like, a lead platform engineer to help us bolster more of our platform. Folks listening have enjoyed hearing, I would love to hear from you. And if you're interested in learning more about these roles, visit the careers page at susglobal.com. Also, for any others who are working on large scale imagery or environmental or climate data in the context of data warehousing, as well as data processing work. If the problem sets that I've described are interesting to you, whether you're looking for new opportunities or not, I would love to stay in touch and talk more. So feel free to reach out. LinkedIn is the best place to reach me and find me, so direct message me there.
[00:51:32] Unknown:
Alright. Well, for anybody who does want to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:51:47] Unknown:
That's a biggie. I would say, you know, when it comes to tooling and data management, people look at, like, data management. They're all looking at it from the lens of the tools they currently use, and thereby, they go natively to the cloud provider and the tool stacks that exist there. So I would say some of those are already from our perspective, we're familiar with the kind of tools and capabilities that exist, but the biggest gap would be looking across those tools. Some of which are built within the cloud provider's environment and some of them which are not. So for example, if you're using Mixpanel for evaluating product metrics and Datadog for evaluating and for incidents management and incidents response.
You kind of need to manage different accounts for that. So I would say having a holistic system where your data management as well as data metrics and insights management are all easy to do together, which is widely known and used as 1 gap I'm seeing. I'm sure there is some solution out there which tries to solve that, but it hasn't become mainstream where I've seen that might be deployed. So I'd love to hear more about that too.
[00:53:05] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at SUST Global and your efforts to help bring more visibility to the information about climate and the ways that it impacts businesses and their operations and the ways that they make decisions. So appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Yeah. Thank you so much. It's been a pleasure to be joining you, and thank you so much for having
[00:53:38] Unknown:
me. Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Gopal Arunjiparath and SUST Global
Gopal's Journey into Data and Climate Analytics
Types of Data and Challenges in Climate Analytics
Building the Technology Stack for Climate Data
Customer Education and Data Literacy in Climate Analytics
Architectural Elements and Engineering Challenges
Data Presentation and User Interaction
Scenario Analysis and Climate Impact Projections
Getting Started with SUST Global's Products
Lessons Learned and Customer Use Cases
Future Plans and Hiring at SUST Global