Summary
Working with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to make it useful. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable knowledge graph from the information, and the myriad ways that the system can be used. If you are wondering how to deal with all of the information that doesn’t fit in your databases or data warehouses, then this episode is for you.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Kirk Marple about Unstruk Data, a company that is building a data warehouse for unstructured data that ofers automated data preparation via metadata enrichment, integrated compute, and graph-based search
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Unstruk Data is and the story behind it?
- What would you classify as "unstructured data"?
- What are some examples of industries that rely on large or varied sets of unstructured data?
- What are the challenges for analytics that are posed by the different categories of unstructured data?
- What is the current state of the industry for working with unstructured data?
- What are the unique capabilities that Unstruk provides and how does it integrate with the rest of the ecosystem?
- Where does it sit in the overall landscape of data tools?
- Can you describe how the Unstruk data warehouse is implemented?
- What are the assumptions that you had at the start of this project that have been challenged as you started working through the technical implementation and customer trials?
- How has the design and architecture evolved or changed since you began working on it?
- How do you handle versioning of data, given the potential for individual files to be quite large?
- What are some of the considerations that users should have in mind when modeling their data in the warehouse?
- Can you talk through the workflow of ingesting and analyzing data with Unstruk?
- How do you manage data enrichment/integration with structured data sources?
- What are the most interesting, innovative, or unexpected ways that you have seen the technology of Unstruk used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with the Unstruk platform?
- When is Unstruk the wrong choice?
- What do you have planned for the future of Unstruk?
Contact Info
- @KirkMarple on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Unstruk Data
- TIFF
- ROSBag
- HDF5
- Media/Digital Asset Management
- Data Mesh
- SAN
- NAS
- Knowledge Graph
- Entity Extraction
- OCR (Optical Character Recognition)
- Cloud Native
- Cosmos DB
- Azure Functions
- Azure EventHub
- Azure Cognitive Search
- GraphQL
- KNative
- Schema.org
- Pinecone Vector Database
- Dublin Core Metadata Initiative
- Knowledge Management
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your work flows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com /linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Kirk Marple about Unstruct Data, a company that is building a data warehouse for unstructured data that offers automated data preparation via metadata enrichment, integrated compute, and graph based search. So, Kirk, can you start by introducing yourself? Kirk Marple, CEO of Unstruct Data. Been a long time engineer and been in this space for a good long time with, starting several companies. And do you remember how you first got involved in the area of data management?
[00:02:20] Unknown:
Yeah. I mean, it's really even back to my first job, first couple of jobs out of college, I ended up dealing with file format libraries. I mean, TIFF image libraries and things like that. So I've really got an early start in this kind of area of what's now I mean, we consider unstructured data. And so before we get too much into what you're building at Unstruck, I'm wondering if you can just give a description about what you classify as unstructured data and some of the types of industries
[00:02:46] Unknown:
or use cases that are going to rely heavily on those particular data formats?
[00:02:52] Unknown:
For sure. So all my career I've kind of been in really the media management space you could say as a as a generic term, more in the media entertainment world for a number of years, and now the the last few years have been really been seeing the applications of what we consider unstructured data to be for industries. And so I would define it really as anything that is a file based format. So, images, video, 3 d geometry, even documents, anything that is kind of the opposite of your typical tabular kind of database formats. So we really see it as a broad term. It always is captured as a file. We always have metadata that is inside the data that we can like, tactical metadata we can extract, and then we manage it as a file based workflow.
[00:03:36] Unknown:
And I'm wondering if you have dealt with any of the sort of scientific file formats, so things like HDF 5 or I'm sure there are a whole bunch of them that I'm gonna completely forget about, but just sort of the things that come about in, like, the hard sciences for things like molecular structures
[00:03:52] Unknown:
or, you know, I know genomes are usually text based but some things like that. Yeah. We I mean, I personally haven't. The 1 thing I have is, say, like, a ROS bag file for robotics, maybe the closest thing. And so I started to see a lot of the similarities in, I mean, what I call sort of a muxed data format, sort of a multi track data format. And that was very common with like closed captioning and audio and video back in the broadcast space. And when I started to get more into things like robotics, I could see that essentially that's what a ROS bag format is. It's it's a time synchronized multitrack data file and that was kinda where a lot of my background started to really I mean, I started to see the parallels of all the types of workflows and systems that we built back in the day really applied to to industry as well. So no I haven't gotten anything specific, like, in in the hard sciences, unfortunately.
[00:04:42] Unknown:
And in terms of what you're building at Unstruck, I'm wondering if you can just give a bit of an overview of the platform and the general intent for the technology and some of the story behind how you ended up building this company and the platform?
[00:04:54] Unknown:
So I started off. I had a video transcoding and a media management company for about a dozen years. This is back in the early days of web video and kind of getting into the really kinda took it right to the edge of when cloud services started to take over. So we're all on prem. We're doing transcoding for the major studios and broadcasters. And after I ended up selling that company and wanted to diversify a bit, I ended up taking a job at 1 of the major auto manufacturers and started to realize, I mean, the types of data they're using for imagery off the cars, time based telemetry, there's a lot of parallels between, like, what I was doing with closed captioning, you could say, is is time based telemetry on video.
And the last 5 years or so, I've been a executive at a bunch of different companies. You can consider sort of a visual data analytics space for sports, for drone image analytics, autonomous vehicles. And I've kinda started to see, like, wait. A lot of these problems I'm solving at all these different companies are really very similar to the problems I solved back in the media entertainment days. In the back of my mind, I started thinking maybe there's a company. We're we're starting here. And with kinda with the COVID and and everything that kinda time frame the last year, it ended up being the right time to to start a company. I was able to find funding and get this off the ground.
[00:06:12] Unknown:
And as far as the current state of the ecosystem for being able to deal with unstructured data, a lot of people who are working with this will end up gravitating towards a data lake approach. And I'm wondering if you can just give a bit of an overview of the shortcomings of the data lake, how Unstruck relates to a data lake approach, and maybe some of the ways that the ecosystem of technologies that have been used to deal with this type of data relate to what you're building with Unstruck and just sort of the integration points there?
[00:06:43] Unknown:
For sure. So I kinda came up not from the data science perspective, but really kinda came at this from a different direction. And so, I mean, what we call a MAM, a media asset management system, or a DAM digital asset management system was kinda the world I came from, kinda came towards data science and started to see learn about data lakes and and data I mean, data warehouses, data meshes, things like that. So there's a lot of similarity of I mean, a data lake to me is mostly just a file system. I mean, it's like an object based file system typically. We had dealt with these back in the days of, like, a SAN or a NAS system for people to have storage, but you lose the metadata part of it. And so there always has to be a metadata layer. How do you search? How do you filter? How do you organize your data?
And what we had started to see was customers that have data sitting in an s 3 bucket or a blob store or even local on prem had very poor tooling to to find and organize their data. And that's kinda we see ourselves really as a layer that sits on top of object storage or a data lake, in most cases and provides basically that search and filter organizational layer as a knowledge graph. So we pull the metadata. We do entity extraction, entity enrichment, and actually create a true knowledge graph based off of the metadata that is stored in their object storage or data lake. It's actually the first part to it of ingestion, and then there's all of the kind of enrichment and analytics and integrations that we can do on top of that. In terms of the enrichment piece, I'm interested in any sort of experience you've had or experimentation you've done of being able to do things like,
[00:08:20] Unknown:
you know, maybe image recognition or, you know, object tracking within images or video and then being able to feed that back into the metadata layer to be able to add additional information to the entity graph. So being able to do things like identify you know, if you have a major motion picture, okay, this is the actor who's in this scene and be able to tag that and add that to the metadata graph to be able to enhance the searching to say, give me all the assets that has, you know, Harrison Ford in it. Exactly. And we're working in a industrial use case. So our case would be, I mean, we have a power plant and say there's
[00:08:57] Unknown:
documents, PDF documents that refer to IDs of equipment. And we have actually those IDs might be visually recognizable and we could use OCR to pull those out of the images. We do all these things today. So we do image recognition to pull out. We do OCR, We do text extraction and any extraction off the text, and then we also are doing audio transcription and any of the extraction off the audio. So really wanna correlate all those different mediums together. So if you say the word conveyor belt, you could know where that was found in a document. You can know where it was found in a voice memo. You can know where it was on a sign and even where a computer vision algorithm recognized that from being trained to look for a conveyor belt. And I'm wondering what are some of the particular
[00:09:42] Unknown:
use cases that you're targeting or end users that you are orienting your initial product design around and how that informs the particular features and capabilities that you're building out in the initial approach to building on struck? We kinda call it visual analytics at its core, but anything with sort of a visual type of inspection in the real world would've been our core thesis of we have people from
[00:10:06] Unknown:
ports to chemical companies to manufacturing plants that are just taking masses of media. And it's not just hundreds of images. It's, I mean, tens of thousands, hundreds of thousands of of images and videos, and we've classically heard they're kind of drowning in their data. They've solved the capture problem. They're smart enough to know that we have to they have to take pictures of things, but then they start really getting lost in the in the volume. And so it becomes a scalability problem where an Ifoto or Google Photos type interface for data that, I mean, you're capturing, I mean, hundreds of terabytes of imagery per year just isn't gonna work. And so that's really the first place we see people having problems and starting to look for solutions.
[00:10:49] Unknown:
As you're talking about, you know, image capture, particularly in industrial context and maybe doing things like pulling out the asset tag from an image based on the barcode and the serial number that's printed on it, it brings up the question of data quality and image quality, and I'm curious how you've been able to think about factoring that into the Unstruck platform and maybe giving some feedback at the point of capture to be able to say, you know, this is unrecognizable. We need you to retake the image or, you know, closing the loop on the data capture and data quality piece.
[00:11:21] Unknown:
That's a really good point. And we actually are working on a mobile application to ease the data capture side of this. So this year, I mean, obviously, you could always just take pictures with your iPhone, go back to your desk, sync them up. But the ability to know where you are in a geospatial domain, know where content has been already taken, and know so let's say you're walking around a plant. You could know that this side of this wall doesn't have a lot of pictures, but the other side does. So if you call that data quality, I mean, be able to flesh out a more robust sort of capture path is very important. And also we look at it over time. You may not know that this side of your plant hasn't been captured in the last 12 months. And so you can know where to organize because there's never enough humans to do the job. So we're really about augmenting that process for data capture. Even initially, it'll be images, video, audio notes, and then things like adding metadata, adding comments, adding tags, sort of a data collaboration tool, but then even looking at things like 3 d point cloud scans and adding those in and be able to sort of correlate the media that they're taking to the real world in in 3 d space as well. And in terms of
[00:12:28] Unknown:
the actual technical implementation of this platform, I'm wondering if you can just talk through some of the architectural aspects of being able to manage the metadata layer, manage the actual storage and tracking of the assets, and just the overall kind of complexities that you're dealing with and trying to build the system for managing these unstructured data flows and then be able to automatically build the data extraction and data pipelining into the platform?
[00:12:56] Unknown:
We see it really as 2 parts. There's the ingestion portion where it's just a hard job of how do you get 1, 000 or millions of files into a system through basically a cloud native architecture. That's the first pass. And so we are an event driven system. We're cloud native. We actually run on Azure today. So we make use of cloud native services like Azure Cosmos DB, Azure Functions, Azure Event Hub. So we really are almost infinitely scalable system from that event driven model. And then the database, we're we're using Cosmos DB both as a graph database and as a document database. So the graph is essentially an index onto the document store and so we can use it in a hybrid model and with Azure Cognitive Search as well to provide a full text search interface on top of it. And all of these kind of work together to fulfill a GraphQL API.
Really, it's once you get the data ingested, the extraction pass is really all asynchronous. So everything is done in parallel. Everything is done basically in a queued model through Event Hub and then basically federates back into enriching those nodes in the in the graph database.
[00:14:05] Unknown:
As you have set out on the path of trying to build this platform and be able to provide this sort of metadata intelligence about these potentially large files. I'm curious what are some of the design constraints that you've run up against and some of the architectural changes
[00:14:23] Unknown:
or sort of reimaginings that you've had to do in the process from, I have this idea of of this company that I'm going to build to I actually have this product that other people can start using? It's something I've been working on for several years. I had built most of the back end over the last 5 years, just kind of nights and weekends and had this vision for for something in this realm, and then it started to come together in this last year. The major things we're looking at are we took advantage of Azure and managed services is probably the number 1 kind of assumption. We do have connectors to AWS and GCP that we could pull data from other systems, but running natively either on prem or on another cloud host is not in the v 1 stack today. But it's something that we've already started looking at kind of wrapping this inside of a Kubernetes ecosystem with Knative and different things like that, keeping an event driven kind of function based model.
But that's something that we've started to hear glimmers of people wanting like this just purely on prem for security reasons. That's a big 1. The other is the lack of GPUs in, serverless architectures where the more we wanna do 3 d work and do server side rendering, say, like Azure functions and most serverless architectures don't have GPU access. So there's no NVIDIA driver there to do, like, a 3 d render. I think that's an interesting 1 I hadn't necessarily thought about day 1 and kinda realized like, oh, yeah. I mean, we could do sort of little previews of 3 d geometry that we bring in. We could do it in the browser. We just can't do it server side. So there's some little things like that that, they're not blockers, but, I mean, ahead of time, I I hadn't really thought through the limitations necessarily of GPU versus CPU in the serverless.
[00:15:59] Unknown:
Because of the fact that some of these files could potentially be gigabytes in size just for an individual object, I'm wondering how you handle things like, you know, versioning of it, being able to manage any sort of transfers, optimizing the processing and extraction of information from these files so that it doesn't explode your usage bill and end up causing you to be sort of upside down in terms of your profit model and just sort of the overall complexities of dealing with these large and complex file objects.
[00:16:30] Unknown:
And so even if we're taking data from, say an S3 bucket, I mean, we do have to at least read it to index it and to create thumbnails and things like that. So today we're actually using a caching model where we do bring it in. We archive it for a short amount of time, and then we have storage policies that we can either archive it permanently. We could put it into cold storage or we just throw it away after we're done processing. I mean, you do have to touch I mean, in a read only pass, you do have to touch the data to index it. Everything is really managed in the streaming model, so we don't bring anything. Like, we don't bring whole files into memory anywhere. That's a real key architectural choice, I think, to your point of not blowing up. I mean, we can deal with, I mean, very large gigabyte, I mean, files. And and my background in the media world is, yeah, I mean, you're dealing with terabyte files sometimes with, with I mean, we don't see it in the industrial use case as much with those file sizes, but we have had to rearchitect around, say, like, millions of points in a point cloud. And very large point cloud structures has been a a critical design point that we've we've had to put a bit more thought into, and and it's good. I have a a great front end team that thinks through, okay. Like, when are we gonna break our browser? Like, how much can we put into a browser and and think about the memory side of it? I'm more of a back end dev, and so it's good to have that balance of people thinking about it from from both sides of where where your limitations are.
[00:17:48] Unknown:
Given your current focus on industrial applications, I'm wondering how you have approached the sort of extensibility of the platform to be able to work with additional file formats, additional domains that have any sort of specific formats that they need to work with and different metadata structures and ways to extract
[00:18:08] Unknown:
that. A lot of people think unstructured data is kind of the wild west that there's no structure to it, but I kinda think about it. I mean, there really is a bit of a canonicalization there that I mean, we have about 10 or 12, what we call file types. I mean, from images, video, point cloud that we can bucket everything into. And, I mean, that gets you pretty far down the road. I mean, that's probably 95, 98% of the data or more that we're gonna get. We have, we basically have parsers for each of those types to get metadata out. We have, what we call media processors so we can create, like, a visual preview of each of those files. And then we get into the data storage is pretty canonical. I mean, so we have a a concept of assets and metadata and files can live anywhere. We have, the concept of a site. So you could have a file that lives on premise and you but you could have a thumbnail for it that lives in the cloud.
So there is a lot of canonicalization of the the schema, and we all we do support schema.org as a as a canonical, metadata, standard internally. But, really, it comes down to, I mean, having that data in a way that, I mean, we can use it through our applications, but then exposing that to customers later you know, they can build their own applications on top of it. And so you do have to get into some sort of, I mean, structural schema for all those solutions.
[00:19:23] Unknown:
You also mentioned that you have concepts of time and geospatial domains built into the platform for being able to work with this data across location and, you know, chronology. I'm wondering if there are any other domains of representation that you've had to deal with because of the specifics of a particular file format or the way that's being used?
[00:19:44] Unknown:
Yeah. I mean, time geospatial metadata are really the 3 that we call our kind of triad today, and we aren't a real time system. And so that's really the 1 case in in at least in v 1, we are more of a batch oriented. I mean, it's prerecorded content, you could say. But everything has a time access, optionally has a location base, like a lat long within a geofence. And then the metadata, we can usually map to some structured metadata schema. We do wanna head it more into a real time concept where we can essentially take incoming streams of data. So the time access is, I mean, real time per se. That's 1 area. And then, I mean, really, as we kinda get into other methods of the geospatial, like, I mean, someone's walking around or a car a robot is driving around and be able to track those sessions, those sort of tracks of data, and connect the dots between the the different data elements. Those are things that we're gonna extend to probably later
[00:20:40] Unknown:
this year. And I'm interested in digging a bit further into the actual data modeling aspects. And for industrial use cases, you might have some ontological concepts of, you know, I have a physical location. I have a number of workers. I have equipment. And I'm wondering if you can just talk through some of the data and domain modeling that goes into the domain modeling that goes into
[00:21:01] Unknown:
the Unstruck platform and how you think about the extensibility of that and the workflow of specifying a sort of custom ontology for building this entity graph. Yeah. This is really interesting, and this is where a lot of my background in the via space. I did a lot with audio metadata pulling in data from all the different broadcast the studios, trying to kind of correlate all the different music and album and all those different things together. And and I started to apply similar methods, but you're right. I mean, it it doesn't always I mean, it's not as formalized in this domain as you might get like in an IMDB structure or an audio kind of more Spotify type music structure. We've taken our approach, and this is a sort of prescriptive approach of there's a concept of a tag, And so we kinda kept it simple that most everything maps to a tag in the system. And so from a user perspective, there can be user generated tags that they assign, a human assigned to a piece of content or machine generated tags that say a machine learning, computer vision algorithm, or we get through entity extraction.
And so, initially, we've tried to map everything to basically a data model that is related to the schema.org model. So we have, like, people, places, things, things like that, and then we have the tagging concept as kind of the extensible side of the ontology. We're starting with that. I mean, there's obviously I mean, we're trying to follow a bit more of an 80 20 rule there because we have seen in the past where just with every customer needs sales engineering and customization, and we're kinda leaning more on the, okay, let's try and fit the 80% well in a very sort of flexible, easy to use use case and then let customers build on top of it. So if they wanna build their own data models eventually or, I mean, extend with their own metadata, they can via API. But out of the box, they get a really solid robust extensible data model that's, I mean, somewhat generic in a sense that, with this tagging model.
[00:22:54] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hi touch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch. And can you talk through the overall workflow of actually getting started with Unstruck, setting it up, integrating your data flow into it, you know, managing the processing, figuring out, you know, what types of information you want to extract,
[00:23:42] Unknown:
the particular focus of the entity graph that you're trying to build, and then maybe building some analyses on top of the metadata layer? For sure. So we have the concept of a site, and so that would typically either can either upload content directly through our web application or you can connect a site, which say it's an s 3 bucket. Say you have, I mean, a 1, 000 files in your s 3 bucket. You basically create this folder through a user interface or through the GraphQL API that automatically starts synchronizing data in. We can enumerate all the files in that site, start ingesting it. We bring the content into our object storage, start indexing data out of that. We first start with indexing the tech technical metadata. So we know, okay. This is an image. This is a video. We can get to another level of granularity of, is this a very large image? And then we prepare it differently. So we create tiled versions of images or we create, I mean, geometry previews. So we do look at the metadata and use that to auto decide our workflow for the presentation and visualization side. But then we also go and through an enrichment process where we say, this is an image. We know we run it through computer vision algorithms to do labeling. And so, essentially, there's a dynamic workflow graph, that's created that we do use the serverless functions to asynchronously create operations and so essentially workers for processing each of these files and then get an asynchronous result when the file is completed that goes on to essentially a state machine that says I'll continue processing this file until we've been fully enriched.
And so there's really no, quote, waiting for any single file to be done. It's all asynchronous. It's all sort of this tape machine model. Once the content is fully enriched, now you have all that data in the graph. And then it can be used by our user interface. It can be used from an API or whatever. But then there's also the notification. We have an event mechanism where we can send events when assets have been completely enriched, And those events can today, we push it back through SignalR since we're going through an Azure ecosystem back so the browser can get a notification saying, Hey, this asset has been fully enriched and fully ingested. We also have webhook integrations. We have Kafka integrations. We can actually send events out to external systems very easily. And, really, it's very flexible from that point of view where we can keep that event driven architecture on the integration side outbound.
[00:26:11] Unknown:
And that bidirectional event flow and integration definitely sounds like it gives a lot of flexibility. I'm wondering if you can just talk through some of the ways that somebody would maybe add their own custom system for being able to provide data enrichment where maybe they have some internal API that they can use to be able to pull internal data to determine, you know, asset tagging or, you know, employee records or things like that to be able to populate into the assets that are being processed or just being able to also feed that into maybe a machine learning or predictive model based on the metadata that's being extracted?
[00:26:51] Unknown:
Yeah. I mean, 2 really good examples. I mean, 1 is, say, you've incorporated or ingested an asset and there's an ID that's created that you wanna enrich and say, look up this ID in the database. A really simple example. I mean, this happens a lot with, like, equipment IDs. So we can essentially send a webhook out to a system that the customer is running, give them formatted JSON that has here's the asset. Here's the tags we could we've extracted. Here's all the metadata essentially that we know about. You go look it up in your database and essentially do an upsert back, a patch, back to our database so they could either just call the GraphQL API back or figure out some other method to send us an an event back. I mean, we're essentially doing that internally for how our some of our computer vision algorithms work, and that's something we're actually doing is, I mean, we can just, through a a simple webhook model, expose that to customers.
And as long as they're authenticated to basically patch that data back into the graph, they can enrich with whatever data they want from external databases. That's 1 case. And then the other case for say, like, we actually wanna send them upon ingestion, say, here's the URL to this master file. Here's the image that we've generated. And if you wanna do your own labeling on it, you can run it, I mean, in your own SageMaker model or run it on a, I mean, some public computer vision API and then give us the, like, Cocoa JSON back. So we have a couple machine learning partners who just build models today, and we're looking to partner with them to essentially give them those hooks that they have the bespoke models that are tuned for, like, rust or corrosion or specific things in industrial use case, but not the data management side. We kinda call it bring your own compute, and that's really what we're incorporating where people have bespoke algorithms, and they wanna basically easily just plug it in and get the results back. And so we're looking for that kinda easy button for VML model serving and potentially even a model marketplace kind of model or customers may just wanna plug in their own models and just put that metadata back into our knowledge. Yeah. The model marketplace aspect definitely sounds interesting of being able to
[00:29:02] Unknown:
create, you know, commons of everybody being able to contribute to leveling up each other's capabilities of, you know, I have this model that's been well trained on, like, you pointed out, you know, rust detection for, you know, metal based equipments or, you know, I need to be able to detect manufacturing defects based on, you know, these sort of attributes or, you know, being able to do sort of transfer learning from those maybe neural nets based, you know, built built on these images would definitely be valuable. So that's an interesting aspect of the kind of unstructured data
[00:29:36] Unknown:
layer that you're building, turning it into a marketplace as well. Yeah. I should have mentioned, we're a consumption billing model. So, really, it just becomes sort of pennies on the dollar. We're all based on scalability of how much data you process. So doing a model marketplace or having those extra hooks is very easy. It ends up just being an additional little margin on top, for the customer if depending on how many models they wanna run or maybe they wanna run 4 in parallel and see which seems best. I mean, almost a automatic ensembling.
[00:30:03] Unknown:
I'm wondering if you can share some of the most interesting or innovative or unexpected ways that you've seen the technology that you're building at Unstruck used and maybe some of the particular industries or data domains that you maybe didn't expect that to be useful for?
[00:30:19] Unknown:
It's fun hearing some of these folks that there's still a lot of very manual workflows happening at industrial use cases, even like the ports. And I never would have thought that they would be such a good use case, but there's a ton of visual analysis that happens because if you think about it, they're changing daily. There's always something new happening and it's geospatial in that realm where it's a pretty wide space that all this is happening and there's constant inspection. I think people are going around taking pictures, images. And so we had seen people that they were doing a very manual workflow, literally putting printouts on a wall to organize data or having to just do very classic kind of I mean, old school solutions that worked, but they didn't scale well. And so this is an area that I really feel that we can lean into very easily and and just it's a classic digitalization kinda problem of a more manual workflow. So that's 1 area I'm really excited about is to work and it's we're seeing, I mean, the common problems as we talk to more and more of these similar customers.
Manufacturing plants are another 1 where, I mean, like, a chicken that goes on an assembly line has a camera on it. And I learned that and didn't really realize that there's, I mean, there's real time analytics happening in food manufacturing that, I mean, is beyond what I'd really expected. So there's a kinda historical use case, but there's also a real time use case of, like, slip and fall handling and security and all these other things. And, I mean, 1 of our customers is an energy or investors is is an energy company. And the idea of somebody leaving a helmet on the deck of of an oil rig is something where, I mean, they might wanna go back and reprocess historical data, but also have a real time aspect to it too. So we really see kind of all these different sides from the time and geospatial aspect and the metadata and really getting into enabling data collaboration is the 1 thing we believe we can bring to this. It's really an untapped resource of I mean, it happens very manually. I mean, sharing reports and copying here's a URL to a spreadsheet or something like that, but the ability to kinda integrate kinda Slack like collaboration
[00:32:23] Unknown:
with this extracted knowledge graph is really what we're trying to get to and help customers out with that. In your experience of building out the technology and building the company, I'm curious, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:32:38] Unknown:
The big 1 is, I mean, you always have to make assumptions. And for people that actually wanted to see a solution like this on prem, I was a bit surprised that there were as many. I mean, it's still in the minority, but there's always a security aspect. I mean, there's always some level of that. It's maybe kinda 80 20 right now, cloud native versus versus on prem from what we're hearing from customers. This is a bit higher than I would have thought, but I think that's an area that we're gonna have to look at. I mean, maybe a bit of rearchitecture, a bit of flexibility around kind of packaging this up for other ecosystems. So I think that was probably the most surprising thing so far that that we've seen.
[00:33:12] Unknown:
And in terms of the building blocks and technologies that you've been able to use, I'm curious, you know, given your history in the industry, what your thoughts are on the ecosystem effects that have made something like this possible and some of the pieces that you think are still missing in the overall space of dealing with unstructured data and being able to do, you know, analysis and metadata extraction and metadata management for these types of files?
[00:33:40] Unknown:
It's actually not that unexpected to see a solution like this in the media entertainment space. I mean, I was building solutions like this 10, 15 years ago where video editors have them, video and there's video archives, metadata is a pretty common I mean, we have conferences about metadata standards in that world, but I just think it's a been a bit overlooked in the industrial area, and a lot of it is still very DIY. I mean, to solve the problem we're solving, I mean, you'd have to be a Python programmer, I mean, or some open source tooling you could put together, but as a SaaS service that gives you kinda 80% of what you need out of the box and it's just a kind of more of a prosumer kind of easy to use architecture, that's where we really saw the value. And I think I mean, we've really seen that there's a lot of learnings that you could take from other industries and hopefully accelerate the kind of data management side of the world. And we're really just trying to be an easy button to get people a bit higher and not have to build these layers themselves. We're not gonna solve every problem, and I'm not I mean, it's gonna say we're gonna, I mean, build every model in the world or every piece of tooling in the world, but we kinda wanna be that platform in the middle. Gets people in to do their job faster. For people who are dealing with unstructured data, maybe they have an existing data lake, or they just have a pile of files on a network drive somewhere. What are the cases where Unstruck is the wrong choice?
I think it's scale. I mean, if you're only have a 100 images, we're not gonna be cost efficient. It'll still work. I mean, but you could probably do it a bit manually even just in your brow, like in your, in your OS viewers and things like that. But it's really once you start to get into the 1, 000 and more that's where it starts to, to be effective. The ability to literally connect up an S3 bucket with a 100000 random images or videos and have it searchable and also have an API to that data. If that's valuable to you, then we're the right choice. But there's gonna be, I mean, definitely cases where there's just not enough data that were overkill.
[00:35:33] Unknown:
Tying back to the data quality aspect, you know, beyond just issues with the actual source files themselves, there's the metadata question of how you handle things like missing or inconsistent or incomplete metadata or just completely wrong metadata where maybe your GPS is miscalibrated or the time is set wrong on your camera or things like that.
[00:35:54] Unknown:
Yeah. And that's 1 of the I mean, you could call it sort of anomaly detection. That's an area that we're really gonna double down into a bit more, and we're actually talking to, I mean, different companies about, like, similarity search, like Pinecone. I'll give them a shout out. I think you've had them on here, but, I mean, there's a lot of technologies out there that really give some good functionality that we can leverage to look both for similar data, but also dissimilar data and help with that data quality side of things. So, I mean, we're definitely talking to to different, partners and and different folks that that whole data quality of finding that needle in a haystack for customers is important, but also where are the rocks in the haystack that they don't want? That's an area that we'll we'll push into a bit more.
[00:36:36] Unknown:
And as you continue to build out the platform and the business for Unstruck, what are some of the things that you have planned for the near to medium term, and what are some of the projects that you're particularly excited
[00:36:47] Unknown:
for? Yeah. I mean, the rest of this year, so we're launching this quarter. In about 6 weeks, we're gonna get out in the market. And then in q 3, we're focused on data collaboration. So the ability to, I mean, comment and annotate and and discuss with your team around your data. We're building a mobile application, which I'm that's I'm really excited about that because that really makes the capture loop more efficient. And so because we've heard from customers that they're not at their desk very often. It's a remote kind of I mean, they're walking around 6 hours a day in their facility.
So that kind of tool allows them to have a much more effective layer of metadata, then also be able to see what's around them in a way that they can get more robust data capture. So those are some really big things I'm excited about for the next quarter. And then really past that, real time data integration is key and then the big bring your own compute model, be able to have anybody build models around this data and be able to integrate with all the larger players in that ecosystem from annotation to model training to MLOps.
[00:37:48] Unknown:
Are there any other aspects of the work that you're doing at Unstruck or the overall space of unstructured data and metadata management that we didn't discuss yet that you'd like to cover before we close out the show? I mean, I think just standardization is an interesting area. I mean, there have been work done, like, for schema.org
[00:38:03] Unknown:
and Dublin Core and kind of in the knowledge management space. And I think for integrating with different systems, I mean, when you have different databases and data warehouses, especially from the larger players, that lack of standardization, I think, is an area that I would love to double down on later this year and start to see, I mean, how can we share data between platforms and between data warehouses? And, I mean, there's always gonna be some uniqueness,
[00:38:27] Unknown:
but coming up with a common object model, common data model is something I'd like to push harder on, later this year. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:38:47] Unknown:
I think the biggest thing is it really always comes down to 4 if you're a developer or have access to developers, it's maybe not that hard. But for the quote sort of citizen data manager or data analyst, there aren't good tooling that are as easy to use as an Ifoto or Google Photos or things like that. So and what we found is that market is not small, and there's a lot of people that are capturing this type of data that aren't developers or don't have access to developers. And so that's really what we saw as the limitation and the gap, but also you wanna be able to bring in developers at someday. I mean, we're a lot of these companies, they may wanna build an ML team next year. So we wanna have a way to to, I mean, work both in that DIY model and in the kind of self help kind of citizen data analyst model. So that's kinda how we see it. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Unstruck. It's definitely a very interesting product and an interesting problem domain, so it's exciting to see somebody starting to tackle it in
[00:39:44] Unknown:
a sort of standardized way. Excited to see where you take it. So definitely best of luck on continuing with the product launch, and thank you again for your time, and I hope you have a good rest of your day. Yeah. Thank you so much. Appreciate it. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Kirk Marple: Unstruct Data Overview
Defining Unstructured Data and Use Cases
Building Unstruct: Company and Platform Background
Data Lakes vs. Unstruct: Integration and Ecosystem
Metadata Enrichment and Image Recognition
Target Use Cases and Initial Product Design
Technical Implementation and Architecture
Design Constraints and Architectural Challenges
Handling Large Files and Data Quality
Extensibility and Domain Modeling
Getting Started with Unstruct
Custom Enrichment and Model Integration
Innovative Use Cases and Industry Applications
Lessons Learned and Ecosystem Effects
When Unstruct is the Wrong Choice
Data Quality and Metadata Challenges
Future Plans and Exciting Projects
Standardization and Data Sharing
Biggest Gap in Data Management Tooling