Data Collection And Management To Power Sound Recognition At Audio Analytic

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering podcast.com/97

things. That's the numbers 97,

and then things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow. So try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today. Your host is Tobias Macy. And today, I'm interviewing doctor Chris Mitchell and doctor Thomas LeCorneux about audio analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music. So, Chris, can you start by introducing yourself?

Yeah. Sure. So,

Chris, CEO and,

founder of of OrganiAnalytic.

And Thomas, about you?

Hi. Yeah. So, Tom,

data engineering lead at Audio Analytic.

And so going back to you, Chris, do you remember how you first got involved in the area of data management?

Well yes. So I I did a PhD in

audio,

classification. So I suppose that that's the place where you could say, I I got my start in it, largely sort of dealing with all of the the fun constraints of

academic research which is sort of smaller data sets than you'd like but are still larger than you were used to dealing with, as well as all of the the fun challenges of building the the technology and doing the fundamental science as well. So that that's where I got my stuff. And, Tom, do you remember how you first got involved in data management? Yeah. Similar thing to Chris. It was during

my PhD. I was working with different datasets

and, you know, just dealing with them more on disk and stuff and then moving to work at biology institute and working more with computer vision and realizing that kind of having massive datasets

just on the file system is not great, and then move to audio analytic where you see a new way of doing things. Yeah. And so in terms of audio analytic, can you give a bit of a description about what it is that you're building there? And what was your motivation for building an AI platform for sound recognition and getting the recognition and getting the business started? Yeah. Sure. So so, sort of said, I I

did some research in in the field and found that

people weren't tackling sounds.

There was a lot of work going on in the speech field. There was a lot of work going on in the the music fields,

and, obviously, the the broader classification field, so image and,

text, etcetera. But sound itself has its own set of unique challenges. So I don't know. In in comparison to, say, speech,

you don't have language models to to work with, so you can't constrain the the acoustic patterns you're looking for in in that sort of way,

and you have

very large,

open set data sort of problems. So, obviously, the the sounds that you're looking to detect, you also try and differentiate them from the large number of other sounds that can

happen in the world that can happen at any point. Obviously, sounds are relatively random in in that respect.

So what's, I was interested in is could you make

a sound recognition system that could capture a a a broad sense of hearing? So that that's normally around a range of

target sounds to be detected.

So whether it be safety and security target sounds such as glass break or smoke and see your alarms going off, whether they be sort of health and well-being sounds of golfing, sneezing, that sort of thing, or whether they be entertainment sounds or whether they be communication related sounds, you you can start

looking at this world of sounds, and then you can imagine

what could you do from a product design perspective

if products have a sense of hearing, and whether that be mobile phones,

headphones,

whether it be smart speakers or whether it be smart home,

giving it that sense of hearing means that those devices can respond

more naturally in the way you and I would do if those things were happening around us,

and then they can take intelligent action. So that that was the sort of the motivation for it. At a personal personal level, the motivation for it is, I I just

like machines that,

make,

strange noises, so it's quite a natural extension for me to light machines that couldn't classify those those noises into,

various different,

classes and then and give the outcome. So quite the sort of a visceral personal love of of sound. And you touched a little bit on some of the contexts in which your product is being used. But can you give a bit of a taste of the types of use cases that it's intended to empower and some of the ways that it's actually being employed? Yeah. So if we take let do it by device is probably the easiest way. So if we if we take a device like a smart speaker

and you want to be able to turn it into into a a sort of home security

device, and you wanna know if somebody's breaking into the house from the sound of the windows,

being broken as somebody enters into the property,

then listening out for that sound.

There's, what, 4 different major types of glass, laminate, plate, wire, tamper, different sizes, different thicknesses,

obviously, breaking with different implements. So you're very quickly into a

a large data management problem,

and and that's just dealing with the target sound, lesser let alone all the the non target sounds such as, I don't know, if you've got cats, then knocking things off the work surfaces in kitchens that may be confusable,

with the types of sound you're trying to detect.

So that that's a sort of a smart speaker side on what's called audio event detection. So that's detecting specific sounds in that in that case, glass windows being broken. If we move on from event detection to something like scene detection,

so this isn't

a a single sound. This is a sort of a a a a combination of sounds or single people's soundscape

detection. That might be around detecting

whether it sounds like somebody's at a train station or to,

you know, a coffee shop, sort of, whether it's a physical scene in that case or whether it's an acoustic scene. So whether it sounds calm, whether it sounds lively,

or or not or or indeed whether it sounds like it's inside or outside. Those would be examples of acoustic and physical scene detection itself. And both of those sit under what's called,

sound recognition,

which is the field in which the the company,

leads.

And it seems that at least the majority of the use cases that you're discussing now are more consumer oriented for people to be able to take advantage of some of this intelligence

to enhance their sense of well-being or get some sort of feedback about their environment. I'm wondering if you've also experimented at all with use in industrial contexts where particular types of sound might be indicative of some type of imminent failure in terms of structural or issues with manufacturing or, you know, maybe in mining where certain sounds might be indicators

of some type of physical risk. I'm wondering if that's something that you've looked at at all or something that you're intending to branch out into. It we obviously looked at when we started a whole range of different applications of the the technology,

given it sort of a a foundational technology in in that respect. Yes. We looked at what the area you described might be for me would be called predictive maintenance or or something of that nature. The the commercial

activity of the companies is largely focused towards

consumer electronics. It's where we've had the most success commercially,

wide scale, you know, adoption.

So that that's the the bulk of the commercial effort that then obviously translates into the the the thrust of the sounds we're detecting.

Most of this world can be described in terms of breaking down in terms of the type of sounds.

So, obviously, the the sounds you'd get in a a production plant, I think is the example you'd use, would be very different than the sorts of sounds you and I would care about in our house or,

if we're out about on the street. They do stand up being very different sounds,

and we capture that in in terms of the,

the taxonomy

that we use to structure our data. And my understanding

of the way that you actually deploy your product is that it's an embeddable AI model that other companies can license and include within their own products. So I'm wondering what types of challenges that poses in terms of the deployment

mechanism and the types of interfaces that you provide to those companies to be able to take advantage of your technology and just issues in terms of

updating the model definition

if there are any changes or enhancements that you make to it?

So,

on the,

yeah, for sound recognition

because inherently

sound goes hand in hand with privacy,

concerns for for obvious reasons, we believe that sound recognition

under a a large range of use cases is best done at the edge, of the network. You know, obviously, all the audio data can stay there for means of classification. You don't need to transmit them off the device, which gives you cost and other economic benefits and scalability benefits.

In terms of the

more on the con side of that approach, clearly, you don't get to update the the models as quickly as,

you would if it was a, you know, SaaS based model or something like that. We don't tend to see that being much of a commercial issue. You know, most of the firmware now on consumer electronics devices is updated reasonably regularly, but,

we also

know that

when the customers

do want to, get that updated, it's something they can easily push out to the end users.

On the

general point

of,

getting

the challenges it faces, it means that you need to know

quite a lot about the subject matter variability

that you're trying to detect. So that's where the quality of the datasets comes in. But, generally, consumers

don't tolerate lots of failures out of classification systems and especially not around the fault tolerant sort of, I know, aspects of security or safety where, you know, if I told you, Tobias, your your house was being broken into now because I heard a window being broken. If you rushed home, if you're not already there, which is highly like you are given circumstances, but if you did rush home,

and find out you're not gonna tolerate many of those as false alarms, so, actually, you want the models to be pretty well structured and and and and understanding most of the variability they're gonna come in contact with. Otherwise, the overall value proposition doesn't work very well. So that that sort of,

that aligns with this notion of being edge based in in the large number of use cases that are applied to sound recognition. And given the fact that you're building these AI models and everyone knows that it's garbage in, garbage out, and you

highlighted the fact that you have to ensure a high amount of quality in the input data. And I'm wondering what are some of the unique challenges that you are facing in terms of being able to collect and label and, create a taxonomy around these arbitrary sounds and being able to ensure that you can correlate them with some sort of meaningful event? It's it's a great question. Let me just, I'll I'll give you the broad brush strokes, and then, Tom deals with this on filling out that taxonomy. Obviously, parts of his the great job he does is around sort of doing what we call sound expansion.

So on the taxonomy side, we break things down at the top level into 3 parts. There's,

anthropony, geophony, and biophony, but there's sort of 700 label types that we're dealing with

on a daily basis inside the system. So a label type would be something like a glass window being broken or a smoke or see alarm going off. So so this is a this is a a large set of

classes you're dealing with on the target to non target sound side.

On the,

just on

before Tom answers the the sort of practical problem on the,

the AI side, there's also a range of specialized

things you need to do on the AI. So, clearly, if you take a, I know an off the shelf speech recognition system,

the acoustic model is is designed for our voice box and, clearly, a large number of the sounds we deal with are are, produced by humans,

and and and an even larger number aren't produced by humans using their mouths. So, you know, there's quite a lot of,

issues there on the acoustic model side. And then as I said earlier on, that language model that the speech recognition companies rely on very heavily,

It does quite a lot of the heavy lifting in in correcting the errors made by the acoustic model. Clearly, when somebody breaks your window in the example I'm using, it it's not trying to speak to you in any structured way, so you don't have that language model. So there's also fundamental AI things you need to solve before you even start, which you can only do with the good quality data. So you need to both get the garbage in, garbage out principle sorted, and then you can study the AI principle off the back of that and sort of the out of the the box techniques don't work.

In terms of the day to day stuff, Tom, you you're best placed to sort of explain some of the challenges we we face there and the tools we do use to overcome it. Yeah. Sure. So I mean, from for most of the stuff we work on, like a data collection, you can consider as, you know, like a a project type approach. So we'll we'll we'll be given some some, like, some notion of a sound that we want to to work on, and we'll attack it in different stages. So the first part will be kind of considering,

like, the spec that we want to develop for it. So perhaps for certain sounds, we're we're more strict around certain criteria than for for other sounds less so.

And that spec's really gonna be important in kind of guiding us through the through the rest of the process,

in terms of, like, how how we go about the data collection,

right through to, like, the labeling and and the the metadata associated with it. So you can consider that, for example so, like, something perhaps like a dog bark sensor. You can consider all of the different

degrees of variation that that, you know, dogs exhibit, if you like. You have small dogs, you have big dogs, different breeds. Perhaps the age will affect

the the sound of the the the dog bark won't you know, the the dog will make when it barks,

and other sorts of factors. And so that's, like, you know,

for each for each sound that we work on, we have to do this massive sort of brainstorming exercise around and making sure that actually the data we're gonna gather is gonna be valuable, I guess, in in a similar way with with speech recognition when you're designing datasets. Right? You have to sort of make sure that they're sort of you know, have have all of the different sort of phonemes or whatever it may be that you're interested in. So so then we'll we'll have developed this plan for, like, the data that we actually want to collect, and then there'll probably there'll be, like, a a stage where we then have to essentially source this data. I mean Yeah. Unlike some sounds, like perhaps if you're doing a smoke alarm type sound where you can can buy the sensors for a dog, you know, you can't just get a dog off a shelf as it were. So you have to source these dogs in some way. And so we'll reach out to, like, the volunteers that we have. So we'll have a global volunteer network, and so we can draw on those people to to provide us the sources for these sounds. And so we're able to just email these people, and, yeah, for some amount of money in return, they'll be able to provide us with a doc in this situation.

So so we'll line up a load of people to help do the recordings, and then a situation like that will probably go into our, anechoic

anechoic chambers at our sound lab and or semi anechoic chambers and and make sure we record the sounds, you know, to a really good standard, as you said, about the the garbage in, garbage out. But, of course, you can consider there's a few sort of, again, bits of variation there that that you've really got to focus on, and 1 of the areas will be the different channels or the the microphones, if you like. But you can also consider, for example, something like with a dog, you know, that's gonna be running around and moving around. So that presents its own sort of sort of challenges. And and each I'd I have to say in kind of my experience working at the company, no no 1 sound is is really There's always challenges that you don't expect when you first start working on these things that you kind of have to have to overcome to be able to do a good job with with those those sorts of things. And then so the essay you say, you know, we've spent a load of time then gathering a load of really, really fantastic data, and then you have to process this data and label this data. And, you know, the lay labeling is a really important aspect for for for the sound recognition problem. You know, just deciding on how you want to do the the labeling. Do you make sure it's extremely sort of fine grained, or can it be more coarse, or can you use, like, weak labels and stuff? And then so that once we've we've kind of gone through that stage, then we can make sure we get it into to Alexandria, our data platform, for storing all of this this data.

And, you know, I think it's 1 of the the the things that every stage in that kind of this really just the data aspect of the pipeline is is kind of, you know, each each part first has its own challenges, and you have to get a lot of things right to make sure that the data that you then can present onwards to the machine learning teams is is of the sort of best standard that you can that you can really get. So what so because I know, Tom, you've got, what, some 15, 000, 000 pieces of,

audio data in Alexandra,

200, 000, 000 pieces of metadata, and 700 label types also. I I think, Tobias, 1 of the interesting things to realize is is labeling audio data as opposed to speech presents its own set of challenges as as Tom talked about. So even if you take something as as basic as baby cry, and I Tobias, I don't know if you're you're a dad or you've got kids or anything, but if you if you sit down a bunch of people and you say,

tell me when it hears the recording, tell me when the baby starts crying when it stops,

most people will disagree each other. They'll agree in the main, but at the edges, when you're trying to do, I don't know, 10 millisecond labeling accuracy around when a baby started crying, you'll get into debates like, is it crying? Is it grumbling?

You know, all the things that you would do as a parent,

start to come out because that that audio is just less exact.

Was it in a speech world, it's a lot more exact. You can take

a a a typical person off the street and say, label

the words and when they started, and there'll be very little disagreement in comparison.

And that gives Tom and his team some, sort of fundamental challenges with labeling up these sort of things,

which which I know they spend a whole bunch of time just figuring out. How do you label a new type of sound? What is that sound?

And and when does it start and when it stops? And the metadata aspect too is interesting because with things like textual records or structured data, it's easy to

associate the metadata with the record at the time that it's being created. And with image data, there's the standard of the EXIF tags. And I know that for instance, with MP 3, they've got ID 3 tags. But I'm wondering if there's any sort of useful

standard that you can use for embedding the metadata with the records or what your approach is for being able to effectively associate that information with the actual audio segment and ensure that they propagate through your system in conjunction with each other so that they're easy to relate to 1 another? So so we have a whole subject. We we call it audio provenance,

and it's a it's a whole subject matter for us internally.

The the 2 examples you've used, let's take image data. If I showed you 3 pictures of a toy dog,

and 1 of a real dog, you you'd very quickly be able to identify with with no prior information that that's the toy dog,

you know, knows the real dogs. Audio is much more complicated than that.

We're very much attuned as, humans to sort of fill in the blanks.

And so, you know, I could play you 3 recordings of smoke alarms and say 1 of those is a fake smoke alarm, and I I I guarantee you'd be very bad at telling me which 1 was the fake and which 1 was the real 1. So if, for example, I don't you scraped a bunch of audio files off the Internet, you'd be straight into that garbage in, garbage out principle.

By doing that

high quality

data

collection inside those semi anechoic environments, It means we're there when the subject matter variability is explored.

And and then you're right. That that sort of chain of evidence, if you will, has to be passed all the way through the pipeline,

right through the, you know, data collection processing, labeling, augmentation, training, evaluation,

you know, e even sometimes down to the compression and deployment levels so that,

you know that it's doing a good job.

In terms of frameworks for doing that, no there's no off the,

shelf frameworks. That that's a completely new area itself.

And, yeah, with audio data, there's definitely a huge degree of variability along a number of different axis, whereas you said, you've got your anechoic chambers for being able to isolate the sound

to the specific piece that you're trying to collect. But then out in the real world, that's going to often be overlaid with whatever the other background noise is, whether it's the, you know, of your washing machine in the next room or the sounds of engines going by outside

and then being able to isolate that sound. And then for your volunteers who are contributing the audio that you're using for this collection process, I imagine that there's variability in terms of the quality of the microphones that they're using, the sample rates that they're collecting the audio in, the specifics of the audio format that it's being collected, the lengths of the segments. I'm wondering how you approach being able to try and encapsulate all of that variability

and be able to standardize

it in some way for being able to feed it through your model training process. In in general, you're right. We we think about it in terms of subject matter variability,

and channel variability

with channel variability

split into 2 parts, which is sort of a acoustic coupling variability, which would be the environment you're in, the acoustics of it. Is it your reverberant environment? Is it in the bathroom, or is it, you know, sort of in in the hallway? And then you've actually got the

the the actual,

device channel variability, which includes the microphone.

It includes

all of the various parts of the audio subsystem,

before the input audio is received at, AI 3, which is the the inference engine that we run to do the high quality sound recognition we do on the device.

Tom, in terms of,

the the challenges with all that, and if you wanna pick that piece up. Yeah. Sure. So I mean yes. You're absolutely right in terms of, like, you know, massive amount of data to try and encapsulate. If you but consider

all the things you've talked about and then sort of draw it back to just 1 1 single file. And for 1 single file, as you said, you'll have the particular

sampling rate, the particular

bit depth, the particular channel it was recorded on, a bunch of settings around, you know, whether it's the the device you were using to record or the sound card. Then you'd have stuff like the room it was in,

perhaps, like you say, yeah, is there some sound in the background that's going on at the same time? Obviously, this will be in situations where we're kind of recording in situ versus, say, in the anechoic chamber. And then

around,

yeah, exactly like the the the source variation, the gender of the dog, the age of the dog, all these sorts of things. And so

in terms of

capturing that data, I mean,

we've got tools that that sort of help us help us, I would say,

kind of structure the collection around this, but you consider it's still quite a manual

effort of you still have to kind of match up. Yeah. When a volunteer comes in, you have to put in the name of the dog in a particular record and sort of make sure that that's stored along with the correct file. Now when you would consider it say say, going back to the dog example, if you record 1 particular dog barking, you might be recording it with, you know, some like several tens of devices

at the same time. So you need to make sure that

the information around the dog is kind of propagated to all those, say, 50 odd devices. But then the information about the individual devices is kept specific to the devices.

And so, you know, we do have a pairing of a chunk of metadata

with each individual,

sound file, and you can imagine that, you know, those numbers grow grow pretty vast in terms of,

yeah, how how many kind of elements of of metadata we we have a record of. And in terms of the taxonomy that you're building for being able to track and categorize these different audio segments, what was your approach for

structuring the initial taxonomy?

And how has it had to evolve? And what are the assumptions that have been challenged in the process of building and growing that taxonomy for being able to

make that information useful in some sort of structural or hierarchical way? So the,

the data the taxonomy is structured on what's called an actor principle,

which is is why that sort of anthropony, biophony, geophony are top level things. So, obviously, caused by, humans, caused by

geography, if that makes sense, and,

caused by biology.

So and then it cascades down from there. The actor principle

is a fundamental 1. It was a a specific taxonomy principle we came up with because, obviously, something needs to cause those sounds,

in the environment. So

using that as a a fundamental

building block means that you you're not gonna go far wrong. In terms of skipping to your last question, in terms of things that we've,

I suppose, effectively learned that we didn't know we were gonna have to learn, 1 of my favorite examples of that is

is not realizing sometimes the same world conspires

for you and sometimes it conspires somewhat against you. So there is

a smoke alarm,

that's I think it's the 3rd or 4th most popular selling smoke alarm in North America,

and it sounds identical to a bird species in the south of France.

Now

I'm pretty sure that that that bird species hasn't,

evolved to mimic the the the the smoke alarm,

but that sort of then

thing that

is then presented to the machine learning engineers and saying, well, these things sound pretty much identical to humans, but you need to separate them out. Otherwise, people are being told that their their house is,

the smoke alarms are going off when in fact it's just the the bird that they keep in their living room to treat the way, and it happens to sound identical to this North American smoke alarm, which which they the engineers solved. But those sorts of interesting

quirks of, I suppose, fate, if you will,

are fascinating to experience.

Although do give, Tom and the rest of the team sort of,

I'm sure,

sleepless nights of of worry as they try and figure out how to best collect the data and and best separate it out. And another element of this problem space is that a lot of the tools that have been built for being able to work with and process large volumes of data are generally

oriented around textual and structured data. So So I'm wondering what you've been able to use in terms of off the shelf components for being able to actually process and build these models, And how much of it you've had to custom build in house specific to your use case and your problem domain?

Sure. Yeah. So I mean, you're absolutely right. Like, I mean, a lot of the tools when it kind of comes to the to to the audio world,

say audio and then kind of bridging a bit into to machine learning, are really kind of

for speech recognition type applications

or kind of music warranted stuff. So you might have, for example, you know, some some tools around transcription

that sort of aiding with transcriptions for for speech recognition

type problems. But, obviously, there's there's Chris has touched on a few times, you know, speech sound isn't speech.

And then the other way you can consider, there'll be, you know, even even software, say, like Audacity, absolutely fantastic at

doing the job it does. But but, again, it's specific to music and in this, you know, case kind of recording and music production and that and that sort of things. And and so often, it's the that's what the problem that we're trying to solve is quite specific and is, you know, difficult. You do there is an element of having to roll your own or enhance if you can. Of course, it would be you know, it's it's easier. It was more time time beneficial to do it that way, enhance it. But if you if you have to kind of roll your own, then, you know, for a for a lot of stuff that we work on, we we do kind of have to roll our own. I mean, consider the example of that, you know, recording with many devices at 1 time. You know, there isn't a magical start button that allows every single device to be started at the same time and and stopped at the same time because, well, actually, if there was, we'd like to hear about it because that'll be extremely useful. But you have to solve you know, you kind of then have to make sure that the audio, obviously, if there are then, like, different offsets at the beginning of the file, that the sounds occur in the same part of the audio across across the many different devices. And so we've developed techniques to to handle that problem, you know, our ourselves.

And, yeah, I guess, you know, right right through the pipeline, we've we have got, you know, a lot of stuff

that is bespoke to, to us, the organalytic,

to just just help solve those problems that that aren't quite the same as as, you know, in other areas.

And can you dig into a bit of how the actual data pipeline and data management is architected and the ways that you work with it for being able to train and build and deploy the models that you're working on?

Yeah. Sure. Sure. So I mean so in terms of, like, you know, the whole the whole pipeline of what we do at audio analytic, you know, you consider it as a standard machine learning delivery pipeline. And so we go through from the the data collection right at the beginning through to the, yeah, deploying of the models at the end. And there are many, many stages in in between there. So focused around,

say, Alexandria, which is our our, you know, massive data massive database,

you know, the stage that the pipeline will focus on kind of around the data collection side, as I've touched on,

the processing.

Like with all these, you've got many different devices and many different, you know, formats that the audio come in. Some devices will give it to you as 1 format, and others will give it to you as another format.

You'll have to then the the next stage will be the so the labeling part, and that's where you kind of marry up the all of the metadata and the labels and the sound, get it into Alexandria so it can then be made available for the machine learning teams. And then, you know, there'll be a stage in between there of you've gotta consider as well. It's not just, you know, the a bunch of a bunch of audio data in a dataset or, you know, from a dataset you download. On its own, it's not it doesn't necessarily

give you that much. You've then got to kind of say, right. Let's split it into those machine learning sets where we need to make sure that, you know, ultimately, the the stuff that goes in that testing set is representative of what all real world situation is gonna be, and then you need to make sure you've got the same kind of distributions throughout. And so that's, you know, that's another very important aspect of of of making sure that you gather enough variation all across the board so that when you chop it up into these little sets, that you've got it in the right places. So that that's and that's the point that the machine learning team will will take this stuff, and so they'll apply techniques like, data augmentation,

you know, to further increase the volumes. I mean, you know, machine learning models nowadays are so so data hungry that you have to kind of apply these techniques.

And then look into the training. You've got lots of different model evaluation type procedures going on. And then then you have, like, a form a more formal evaluation stage, then kind of deciding on what sort of models that you're interested in or what trade offs you're making, I guess, would be a better way to sort of structure that. You'll then look at trying to compress these things. As Chris said, we know we run on the edge. We can't have have have,

massive models that take, you know, tons of computation that, you know, they need to be really small. They need to be really, really fast. You know, we're not the only stuff on these devices. Right? There'll be other other other sort of functionality that these devices will will have, and we're just 1 part of the stack, if you like. Yeah. And then that kind of leads into the the deployment side of things. In terms

of optimizing that that pipeline,

1 of the in that sort of evaluation

stage, 1 of the interesting things we've done recently is around

something called polyphonic sound detection score. We we found that,

like with any challenge, you need to optimize well, any machine learning challenge and probably more broadly, you need to optimize for the right criteria.

And,

the generic,

methods that were being used,

just

borrowed from

machine learning, we're we're not optimizing the systems appropriately,

and and the pipelines and everything else.

So we

released,

a bunch of GitHub code for this Polyphonic Sound Detection Score.

That's now being

used,

I think it's published at ICASP this year.

It is being used,

as part of the,

the DKs

competition, which is sort of the the benchmarking

world. This is the standard,

if you will, community standards

of

sound recognition. So it it's great to see the the discipline

grow out of its infancy into those more

developed areas,

and moving into,

what we start to call 2nd generation sound recognition.

And in terms of the models and the deployment of it, I'm wondering if you just deploy 1 model that works generically across all the different sound categories that you have collected, or if you train these models for specific deployment targets where you have 1 that's specifically focused around security, where it has things like the glass break

or sounds

of, you know, a a door being hammered on. Then you have a different model that you deploy that's focused on things like detecting a cough and sneeze and sniffles for health related environments?

So it's a it's a great question. So we tend to think of the the sound profiles as we call them, which is a 1 or a collection of sounds, broken band pipe per device because the the value propositions tend to

align at the device level. And then, of of course, you can run multiple sound profiles,

together. In terms of those

configurations

or of those underlying models,

it really comes down to the individual use cases that are applicable

on the devices.

But we we try and make sure that that

end set of sound profiles we're delivering are optimized for that set of use cases for that set of devices.

I imagine that that helps too with restricting the size of the actual deployed model rather than trying to fit everything into 1 thing and then also compress that down to a size that's runnable on an embedded device versus just training the model for a specific use case and then reducing the overall scope of what it needs to be able to handle and thereby the size of the model that's being deployed?

Yes. Size is is somewhat proportional to the amount of sounds that you add, especially when you're dealing with a smaller number of sounds. So, as it grows, that that sort of relationships becomes,

less distinct.

In terms of,

of course, if if there's sounds that you will never come across and never need to detect on a device category,

it would seem a bit

silly from a a computational

point of view to to burden that device for looking for those sounds. So, yes, it it definitely helps in knowing that notion. This is about sort of structuring a sense of hearing

in line with what is needing to be heard for those device categories and, and sort of a just enough type approach. That means that you have that good trade off between

computational smallness,

or sort of resource,

size,

and and that sense of hearing. So in fact, we did a recent

set of things, the consumer electronic show this year,

in Las Vegas on our our private demo suites, which were shown that we were capable of running a a sensor here right down onto an M0

plus processor, which is the the smallest grade of processor that the ARM do. So so really showing that you can push that sense of hearing down onto incredibly small processes.

And then in terms of being able to actually

evaluate and test the models that you're working with, I'm wondering what you have found as far as their capability of operating in noisy and complex and layered audio environments

and how the overall testing of the model has fed back back into your strategies for collecting the source data? So if I take the the top piece then, Tom, if if you can relate back to the source data piece. So in in terms of that feedback,

so generally, because we've got quite,

we've got the world's largest collection of of data for this area, we have high degree of certainty with the models we're providing to the marketplace already.

You know, we we have large amounts of, say, 247 recordings,

large amounts of environment recordings, and obviously large amounts of target sound environment. So we typically find that we're we're, you know, pretty good in our guesses of what the performance will be for a new sound profile that we're producing.

In terms of

the sort of things we learn,

going back to that example of things that you just can't predict,

you know, using the the example of the the the bird in the south of France and the North American smoke alarm, that that's that is something beyond the wit of man to sit in a room and figure out. You're only going to get that sort of insight from the actual

field deployments.

Our technology's

deployed in in something like a 160 countries worldwide. So we've got a very

good sense

of the sort of problems that are faced on a worldwide scale. In terms of how that feeds back, obviously, it feeds back into, do we need more data in a certain area? But, Tom, you're probably best placed to sort of pick back up that full loop back to the beginning of the pipeline, the data collection piece. Yeah. I mean, well, like you say, it's it's beyond the wit of man to try and think of all these things. I mean, I've been in you know, I we started these projects. I make sure we sit in a room and think about all the stuff that's gonna try and bite us in the backside.

And

you you do just never never think of everything. It'll just be there's there's too many bizarre bizarre things, and your head is so focused on kind of 1 aspect of looking at the problem that, you know, you'll be completely blindsided to another even though you spent all that time trying to figure it out. And so, like, yeah, you

know, the delivering of these,

you know, products,

you know, they are very intuitive.

You develop something and, you know, get it deployed, and then you realize you have these sort of issues. And then you kind of, you know, have to I think, often, you know,

going back to data collection is is 1 of the important

ways to kind of address these problems. I think you can consider as well, for example, you know, talking about how do you sort of actually do the evaluations,

like, as Chris touched on, the the 247

sort of use case, for a lot of our products. You know, we have we just have

absolutely tons and tons and tons of data that is just

kind

of use case. It's like a the a product in a room, and it's just recording all that time and will have, you know, many different examples of that. And so

hopefully try and identify some problems early on in your kind of evaluation, you know, kind of your product

development cycle, yeah, by by doing tests like that. And then in terms of, like, the actual,

problems that you make, obviously, that's the game kind of focusing more on these, you know, the false positive, like, area of the evaluation. And then in terms of the more, like, focusing on, you know, how well do we actually do with actually detecting the sounds we're interested in, you know, we'll

we mentioned the anechoic chamber earlier, and that that really allows us to have a sort of green screen for sound. You know, like you said, you you mentioned that, you know, you get this layering of background sounds and so on. And and and and really that's you know, a really good way of going about it is that,

again, let's go back to the dog bark analogy. If you want to know whether whether a particular model is going to to

to to, detect, sorry, a dog bark in a particular,

say, room with, say, tile flooring, and and it's a really, I don't know, say, a really reverberant environment and someone's hoo ring in the background and, you know, whatever else. Maybe the device is 5 meters away or something. You know, you have to you have to literally test for that exact scenario. And so, you know, by having these these green screens, we can sort of test. And, again, it's it's

looking at this stuff

iteratively,

evaluating,

you know, often and and then feeding it back in and seeing where you can make improvements.

What what we what we find though,

I think is, the experience we now have with doing these things means that internal iteration

speeds up and speeds up and speeds up. So we're producing

sounds at an increasing rate of,

not and those sounds being produced in their first iterations in internally at higher quality, just because, obviously, we're starting from a higher place. So we we've learned a lot that that iteration cycle means that we've now iterate internally very quickly so that when we do release that product out into

very quickly so that when we do release that product

out into the the marketplace,

the customer

can have high assurance that's already working from a a sound recognition perspective,

and not thinking

I'm sort of getting a bit of a product, but I'm gonna have to feedback data to improve it because clearly that's not gonna be acceptable to their customers. And in terms of the

audio that you're working with, what have been some of the most interesting or unusual or strange sounds that you've had to try and collect and categorize?

Well, I'll I'll

do 3 stories. There's,

there's strange in just sort of strange experiences of, capturing the data.

So I'll do that 1 first. So we did some,

gunshot recordings. We we we oddly enough chose to do them in the UK, and,

machine guns anywhere in the world are not easy to come by, but in in the UK, they're particularly

challenging to come by. And we we managed to

arrange,

a a set of, machine guns to be recorded. 1 of them was an Uzi submachine gun. And we had a rental van to go down to the, there's only 2 civilian,

sort of automatic ranges,

in the UK, and we had a rental van to go down that I remember sitting there with with the guy and he he he laid out the guns and he he explained that we had to move our rental van. And, like, I said, well, why is it? It seems to be nowhere near the the targets. And he said, well, they usually doesn't so much,

aim bullets to sort of direct them vaguely in that area. So you probably wanna move it before you get a bunch of bolts in the

van. And I'm pretty sure that would have caused us a whole range of,

fun from

at least the deposits on the rental van, let alone the explanation to the police. And I'm sure it would have been come. So that's on the sort of a strange data collection experience.

In terms

of strange sounds

oh, that's a good 1. Tom, what's your what's your favorite strange sound that you you've done the data collection for so far that you're able to talk about, obviously? Yeah. For sure. For sure. Well, yeah. Again, I guess this is this

trying to think of individual sounds. I mean, I guess, you know, I mean, I guess you come across across sounds when you've, you know, when you got 247 daily, you might come across stuff like geese honking, which is always quite an interesting 1. In terms of, like, weirdness, in terms of collection for me, I guess it's it'll be the outdoor glass break,

collection we did where, I guess, perhaps,

similar weirdness

to to Chris's thing, but, like, you know, essentially, we were tasked with just being on the street and smashing people's windows, of course, with permission

and and and everything else.

And and we were paying people. We were saying that, you know, we'll give you some good good good amount of money. You can repair you know, you can replace your windows that are 20 years old, but we'll we'll break them for you.

And it's it was it was a really, really bizarre experience. I mean, you know,

like, you're literally in a, like, a residential

area, and you've got foot well, I mean, we've we've put on fluorescent jackets, so we kind of look like, you know, we were we we we we were meant to be there, with all these different microphones and all these booms and stuff and all these wires going around and then someone's smashing

smashing these windows with a sledgehammer and this sort of stuff. Yeah. That's how you know you kinda work for a very interesting company when you're sort of tasked with with something like that, but, yeah, that's definitely weird.

The the 1 I talked with, we did a a car alarm data collection exercise, and we were we've done the sort of in car model,

fits, but we're looking at the retrofit,

car alarms. But obviously, we wanted to know what they'd sound like when they're fit into the car. So we had a a board in the back of a car with these car alarms on it. And we collected most of our data, but there was 1 challenge left, which I it was to go to a very reverberant real environment and collect them in there because we just some, you know, sort of high quality real reverberant. So I remember

sitting in a a multi story car park with a car, and these things are very loud as well. So you you wanna go a bit of a distance away because, obviously, the the car alarm is unlikely to go off when people are physically sitting in the car, which would dampen the sound. So we're sitting away from this car car with a wire sort of trailing from the car with a big button that sets off all the car alarms.

And then somebody pointed out this might look a bit strange to the security guards watching us off on the,

this the the inevitable CCTV cameras that would be watching these things in sort of a

some bizarre, I don't know, Italian job kind of blow the doors off type piece or,

something more than not not entirely clear. And then realizing we better finish our experiments up,

rather quickly before we cause anybody a whole whole bunch of the hassle.

I I think on that 1, that 1 gets interesting because the the recording of audio itself presents a whole range of challenges from

a data privacy, data,

ethics perspective.

So if you if you're trying to do audio recordings in the public, there's a whole whole set of rules that govern that. Obviously,

there's increasing rules and and direction of travel around

what data you should be able to use in machine learning models, both legally,

ethically, and the direction of travel, you know, from

a GDPR and the things that, you know, sort of going through the various

stages of legislation.

And even if you do something basic of you want to record a park,

you know, when when you try and square that with the various laws that you need to get permission to do that. But it's a public park. Who do you get permission from? You know, so you're you're down into

really quite fundamental challenges

if you want to make sure that not only is your data good today to be training off, but as the rules change and evolve and and society understands what they want to do with machine

learning, that data you based your models on doesn't start to be eaten away at, and you can't include it or,

it's decided that the the the you didn't have the right permissions,

or

the traceability

around the data and what's gone into it isn't clear. You know, you don't wanna fall foul of any of those things. So the the the effort we go to to make sure that we've got that complete chain of evidence around our data is is quite extreme, and that that's been ingrained in us from day 1.

It's, it's obviously been,

expensive and time consuming to do, but it that choice has paid off,

quite substantially with the direction that the machine learning world has gone in. And what are some of the other interesting or unexpected or challenging lessons that you've learned in the process of building out the technology

and business aspects of audio analytic?

On the business side,

we license per sound,

per device,

which is the closest in the speech world with us, but presumably

licensing

either wake words, if if that was the equivalent, or language. So there weren't any directly comparable models because those 2 are clearly different in in in in various ways.

In terms of then we've had to go out and we've we've obviously explained to people

why sound detection is extremely valuable if you're designing products and and why it gives that,

very intuitive sense of the devices behaving as you would expect it to. But that that's being done

from the ground up, and that was a new area for product owners to experience, which means that

how they judge how we should license it, how they judge

the commercial value where the the value is realized and and things like that is is,

has been part of that journey. And that's been

fascinating to experience how people

perceive and how they think the world of sound should be

monetized, I suppose. And so as you continue to build out the technical and business capacity for the company, what are some of the plans that you have for the future either in terms of new

problem spaces or use cases that you're looking to reach into or

enhancements to your existing processes of data collection and machine learning?

Our our road map is largely punctuated by

what sounds for what use cases,

is the way to think. And we we we constantly add to that,

as we expand

our ability to differentiate,

sounds and, obviously, the target devices,

capabilities themselves.

In terms of,

future stuff, I I think we did a a a blog post recently on,

on our website. If you go to audioanalytic.com,

we did it on,

what we'd call contact systems.

So contact systems

look to

not only detect individual sounds, but draw larger inferences off those. So, the example that was used in that blog post

was it it sounds like you're leaving the house. So,

you know, Tobias, if you have the house is lit is is leaving the house and you you know that you've you've not reminded the to get bin bags or whatever the other sort of, small items that you just forgot to remind there are, You you will know that. You'll sit on your living room. You're able to hear it if if it's in earshot, and you'll know what's leaving the house sounds like and what preparing to leave the house sounds like. But it is not a single sound. It's it's a collection of sounds in a certain sequence

put into broader context.

And

that sort of higher level

reasoning,

requires even even more sophisticated

approaches to sound recognition,

and that forms the some of the foundations of what we mean by 2nd generation.

So 1st generation sound recognition systems were around

a small

range of sounds, typically safety and security applications,

and and typically triggering

events, through to the end, consumer. So there might be push events through to mobile phone. I I've heard a somebody's breaking into your house. I've heard a smoke and CO alarm going off. That's something.

2nd generation is about much more sounds covering

not just safety and security, but entertainment, health, and and well-being.

And the the starting introduction of these contact system that start to use that,

that fundamental understanding of the individual sounds or

or scenes

that it's hearing,

and and building higher level inferences on on top of them. And that that's an exciting

world to see unfold as,

that sort of sense of hearing, it becomes more and more real,

in line with what you and I would call the sense of hearing. So these devices can be more and more intelligent. And I'm also wondering, given the fact that you have collected this large volume of data and it's well labeled and highly valuable for building your models. And that's what you're licensing. I'm curious if you've also

gone down the path of looking to license out the actual source data for other people who are looking to take advantage of it for other use cases. No. So typically,

commercially,

people less in the model. So we we've got this, as you've brought up on it, Alexandria, which is this this world's largest collection of of of data. It's about 15, 000, 000 audio files, 700

label types,

about 200, 000, 000 pieces of metadata.

That helps steer where you put your research from an AI perspective.

You know, coming back to those points I mentioned around there's fundamental architectural blocks from machine learning point of view that you just can't use like a language model, like the speech acoustic model that would

materially impact

your,

performance rates and take,

mean that if you just took an off the shelf machine learning technique, your performance rates wouldn't be good enough for the vast majority of,

sound recognition

tasks.

So we've used that data to steer our research and come up with our own inference engines

that are specifically designed for sound recognition.

So those 2 pieces tend to go hand in hand because even if you have the data,

you'd still need to direct all of that research effort to get to a point where you have a well functioning,

you know, sort of second generation sound recognition system. So, generally, people want to license the output

of that, and take advantage of both of those pieces. Are there any other aspects of the work that you're doing at Audio Analytic

or the overall space of sound detection that we didn't discuss that you'd like to cover before we close out the show?

No. I I think I think you're you're good,

apart from I'd I'd love to send you over some bizarre audio files, and we'll see if we can find you some and send them over just because they're they're always fun.

Well, maybe we can, put some of those in interspersed throughout this conversation just to give people something to entertain themselves with.

Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. So for data management,

obviously, it's gonna be biased because we're we're we're solely focused on audio. So most of my concerns are gonna be around the audio side. I think a a realization that the machine learning frameworks have have come on and the the data pipeline side has come on and the data management side has come on generically.

But now we're into the specialisms,

and the specialisms require us

to extend,

hopefully, what are extensible

frameworks.

Love to see more of those frameworks becoming more extensible and more modular. So we only have to

roll our own pieces in the area that makes sense.

You know, a lot of the times, we we come across

tools where they just don't work well together. So you end up even though there's good bits of functionality in tool a and good bits of functionality in tool b, that you still need to specialize them to the task at hand, and then those 2 tools won't aren't as flexible as you'd like them. So if there is 1 area, it's sort of that digging into

each individual

task within machine learning has its own set of challenges and realizing that at a fundamental tool level and and baking that into the architecture of the the tools that could go across the industry would would be incredibly valuable. And, Thomas, how about you?

Yeah. I mean, I've just I'd echo Chris's,

answer, to be honest with you. It's just as as soon as you get into something that's just kind of a bit more niche than the the sort of mainstream

applications for a lot of these things, they

you just you just have to start rolling your own, and I'd say that applies to, you know, right across the kind of spectrum of wherever you apply machine learning. Really, it's been my experience before as well. Alright. Well, thank you both very much for taking the time today to join me and discuss the work that you're doing with audio analytic and sharing some of your interesting stories of collecting these,

audio data. It's definitely a very interesting use case and interesting problem domain that you're working in. So I appreciate all of the time and effort you've put into that and the time that you spent sharing your experiences with me, and I hope you enjoy the rest of your day. Great, Tobias. Thanks for having us on the show. Been been great.

Look forward to listening to, future episodes as you go forward. Yeah. Cheers, Tobias. It was great to speak to you. Thanks so, so much.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email host atdataengineeringpodcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links