Summary
We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Dr. Chris Mitchell and Dr. Thomas le Cornu about Audio Analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what you are building at Audio Analytic?
- What was your motivation for building an AI platform for sound recognition?
- What are some of the ways that your platform is being used?
- What are the unique challenges that you have faced in working with arbitrary sound data?
- How do you handle the collection and labelling of the source data that you rely on for building your models?
- Beyond just collection and storage, what is your process for defining a taxonomy of the audio data that you are working with?
- How has the taxonomy had to evolve, and what assumptions have had to change, as you progressed in building the data set and the resulting models?
- challenges of building an embeddable AI model
- update cycle
- difficulty of identifying relevant audio and dealing with literal noise in the input data
- rights and ownership challenges in collection of source data
- What was your design process for constructing a pipeline for the audio data that you need to process?
- Can you describe how your overall data management system is architected?
- How has that architecture evolved since you first began building and using it?
- A majority of data tools are oriented around, and optimized for, collection and processing of textual data. How much off-the-shelf technology have you been able to use for working with audio?
- What are some of the assumptions that you made at the start which have been shown to be inaccurate or in need of reconsidering?
- How do you address variability in the duration of source samples in the processing pipeline?
- How much of an issue do you face as a result of the variable quality of microphones in the embedded devices where the model is being run?
- What are the limitations of the model in dealng with complex and layered audio environments?
- How has the testing and evaluation of your model fed back into your strategies for collecting source data?
- What are some of the weirdest or most unusual sounds that you have worked with?
- What have been the most interesting, unexpected, or challenging lessons that you have learned in the process of building the technology and business of Audio Analytic?
- What do you have planned for the future of the company?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Audio Analytic
- Anechoic Chamber
- EXIF Data
- ID3 Tags
- Polyphonic Sound Detection Score
- ICASSP
- CES
- MO+ ARM Processor
- Context Systems Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering podcast.com/97 things. That's the numbers 97, and then things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow. So try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today. Your host is Tobias Macy. And today, I'm interviewing doctor Chris Mitchell and doctor Thomas LeCorneux about audio analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music. So, Chris, can you start by introducing yourself?
[00:01:56] Unknown:
Yeah. Sure. So, Chris, CEO and, founder of of OrganiAnalytic.
[00:02:02] Unknown:
And Thomas, about you?
[00:02:04] Unknown:
Hi. Yeah. So, Tom, data engineering lead at Audio Analytic.
[00:02:09] Unknown:
And so going back to you, Chris, do you remember how you first got involved in the area of data management?
[00:02:14] Unknown:
Well yes. So I I did a PhD in audio, classification. So I suppose that that's the place where you could say, I I got my start in it, largely sort of dealing with all of the the fun constraints of academic research which is sort of smaller data sets than you'd like but are still larger than you were used to dealing with, as well as all of the the fun challenges of building the the technology and doing the fundamental science as well. So that that's where I got my stuff. And, Tom, do you remember how you first got involved in data management? Yeah. Similar thing to Chris. It was during
[00:02:47] Unknown:
my PhD. I was working with different datasets and, you know, just dealing with them more on disk and stuff and then moving to work at biology institute and working more with computer vision and realizing that kind of having massive datasets
[00:03:01] Unknown:
just on the file system is not great, and then move to audio analytic where you see a new way of doing things. Yeah. And so in terms of audio analytic, can you give a bit of a description about what it is that you're building there? And what was your motivation for building an AI platform for sound recognition and getting the recognition and getting the business started? Yeah. Sure. So so, sort of said, I I
[00:03:21] Unknown:
did some research in in the field and found that people weren't tackling sounds. There was a lot of work going on in the speech field. There was a lot of work going on in the the music fields, and, obviously, the the broader classification field, so image and, text, etcetera. But sound itself has its own set of unique challenges. So I don't know. In in comparison to, say, speech, you don't have language models to to work with, so you can't constrain the the acoustic patterns you're looking for in in that sort of way, and you have very large, open set data sort of problems. So, obviously, the the sounds that you're looking to detect, you also try and differentiate them from the large number of other sounds that can happen in the world that can happen at any point. Obviously, sounds are relatively random in in that respect.
So what's, I was interested in is could you make a sound recognition system that could capture a a a broad sense of hearing? So that that's normally around a range of target sounds to be detected. So whether it be safety and security target sounds such as glass break or smoke and see your alarms going off, whether they be sort of health and well-being sounds of golfing, sneezing, that sort of thing, or whether they be entertainment sounds or whether they be communication related sounds, you you can start looking at this world of sounds, and then you can imagine what could you do from a product design perspective if products have a sense of hearing, and whether that be mobile phones, headphones, whether it be smart speakers or whether it be smart home, giving it that sense of hearing means that those devices can respond more naturally in the way you and I would do if those things were happening around us, and then they can take intelligent action. So that that was the sort of the motivation for it. At a personal personal level, the motivation for it is, I I just like machines that, make, strange noises, so it's quite a natural extension for me to light machines that couldn't classify those those noises into, various different,
[00:05:30] Unknown:
classes and then and give the outcome. So quite the sort of a visceral personal love of of sound. And you touched a little bit on some of the contexts in which your product is being used. But can you give a bit of a taste of the types of use cases that it's intended to empower and some of the ways that it's actually being employed? Yeah. So if we take let do it by device is probably the easiest way. So if we if we take a device like a smart speaker
[00:05:54] Unknown:
and you want to be able to turn it into into a a sort of home security device, and you wanna know if somebody's breaking into the house from the sound of the windows, being broken as somebody enters into the property, then listening out for that sound. There's, what, 4 different major types of glass, laminate, plate, wire, tamper, different sizes, different thicknesses, obviously, breaking with different implements. So you're very quickly into a a large data management problem, and and that's just dealing with the target sound, lesser let alone all the the non target sounds such as, I don't know, if you've got cats, then knocking things off the work surfaces in kitchens that may be confusable, with the types of sound you're trying to detect.
So that that's a sort of a smart speaker side on what's called audio event detection. So that's detecting specific sounds in that in that case, glass windows being broken. If we move on from event detection to something like scene detection, so this isn't a a single sound. This is a sort of a a a a combination of sounds or single people's soundscape detection. That might be around detecting whether it sounds like somebody's at a train station or to, you know, a coffee shop, sort of, whether it's a physical scene in that case or whether it's an acoustic scene. So whether it sounds calm, whether it sounds lively, or or not or or indeed whether it sounds like it's inside or outside. Those would be examples of acoustic and physical scene detection itself. And both of those sit under what's called, sound recognition, which is the field in which the the company, leads.
[00:07:33] Unknown:
And it seems that at least the majority of the use cases that you're discussing now are more consumer oriented for people to be able to take advantage of some of this intelligence to enhance their sense of well-being or get some sort of feedback about their environment. I'm wondering if you've also experimented at all with use in industrial contexts where particular types of sound might be indicative of some type of imminent failure in terms of structural or issues with manufacturing or, you know, maybe in mining where certain sounds might be indicators of some type of physical risk. I'm wondering if that's something that you've looked at at all or something that you're intending to branch out into. It we obviously looked at when we started a whole range of different applications of the the technology,
[00:08:18] Unknown:
given it sort of a a foundational technology in in that respect. Yes. We looked at what the area you described might be for me would be called predictive maintenance or or something of that nature. The the commercial activity of the companies is largely focused towards consumer electronics. It's where we've had the most success commercially, wide scale, you know, adoption. So that that's the the bulk of the commercial effort that then obviously translates into the the the thrust of the sounds we're detecting. Most of this world can be described in terms of breaking down in terms of the type of sounds.
So, obviously, the the sounds you'd get in a a production plant, I think is the example you'd use, would be very different than the sorts of sounds you and I would care about in our house or, if we're out about on the street. They do stand up being very different sounds, and we capture that in in terms of the, the taxonomy that we use to structure our data. And my understanding
[00:09:16] Unknown:
of the way that you actually deploy your product is that it's an embeddable AI model that other companies can license and include within their own products. So I'm wondering what types of challenges that poses in terms of the deployment mechanism and the types of interfaces that you provide to those companies to be able to take advantage of your technology and just issues in terms of updating the model definition if there are any changes or enhancements that you make to it?
[00:09:44] Unknown:
So, on the, yeah, for sound recognition because inherently sound goes hand in hand with privacy, concerns for for obvious reasons, we believe that sound recognition under a a large range of use cases is best done at the edge, of the network. You know, obviously, all the audio data can stay there for means of classification. You don't need to transmit them off the device, which gives you cost and other economic benefits and scalability benefits. In terms of the more on the con side of that approach, clearly, you don't get to update the the models as quickly as, you would if it was a, you know, SaaS based model or something like that. We don't tend to see that being much of a commercial issue. You know, most of the firmware now on consumer electronics devices is updated reasonably regularly, but, we also know that when the customers do want to, get that updated, it's something they can easily push out to the end users.
On the general point of, getting the challenges it faces, it means that you need to know quite a lot about the subject matter variability that you're trying to detect. So that's where the quality of the datasets comes in. But, generally, consumers don't tolerate lots of failures out of classification systems and especially not around the fault tolerant sort of, I know, aspects of security or safety where, you know, if I told you, Tobias, your your house was being broken into now because I heard a window being broken. If you rushed home, if you're not already there, which is highly like you are given circumstances, but if you did rush home, and find out you're not gonna tolerate many of those as false alarms, so, actually, you want the models to be pretty well structured and and and and understanding most of the variability they're gonna come in contact with. Otherwise, the overall value proposition doesn't work very well. So that that sort of, that aligns with this notion of being edge based in in the large number of use cases that are applied to sound recognition. And given the fact that you're building these AI models and everyone knows that it's garbage in, garbage out, and you
[00:11:55] Unknown:
highlighted the fact that you have to ensure a high amount of quality in the input data. And I'm wondering what are some of the unique challenges that you are facing in terms of being able to collect and label and, create a taxonomy around these arbitrary sounds and being able to ensure that you can correlate them with some sort of meaningful event? It's it's a great question. Let me just, I'll I'll give you the broad brush strokes, and then, Tom deals with this on filling out that taxonomy. Obviously, parts of his the great job he does is around sort of doing what we call sound expansion.
[00:12:30] Unknown:
So on the taxonomy side, we break things down at the top level into 3 parts. There's, anthropony, geophony, and biophony, but there's sort of 700 label types that we're dealing with on a daily basis inside the system. So a label type would be something like a glass window being broken or a smoke or see alarm going off. So so this is a this is a a large set of classes you're dealing with on the target to non target sound side. On the, just on before Tom answers the the sort of practical problem on the, the AI side, there's also a range of specialized things you need to do on the AI. So, clearly, if you take a, I know an off the shelf speech recognition system, the acoustic model is is designed for our voice box and, clearly, a large number of the sounds we deal with are are, produced by humans, and and and an even larger number aren't produced by humans using their mouths. So, you know, there's quite a lot of, issues there on the acoustic model side. And then as I said earlier on, that language model that the speech recognition companies rely on very heavily, It does quite a lot of the heavy lifting in in correcting the errors made by the acoustic model. Clearly, when somebody breaks your window in the example I'm using, it it's not trying to speak to you in any structured way, so you don't have that language model. So there's also fundamental AI things you need to solve before you even start, which you can only do with the good quality data. So you need to both get the garbage in, garbage out principle sorted, and then you can study the AI principle off the back of that and sort of the out of the the box techniques don't work.
[00:14:08] Unknown:
In terms of the day to day stuff, Tom, you you're best placed to sort of explain some of the challenges we we face there and the tools we do use to overcome it. Yeah. Sure. So I mean, from for most of the stuff we work on, like a data collection, you can consider as, you know, like a a project type approach. So we'll we'll we'll be given some some, like, some notion of a sound that we want to to work on, and we'll attack it in different stages. So the first part will be kind of considering, like, the spec that we want to develop for it. So perhaps for certain sounds, we're we're more strict around certain criteria than for for other sounds less so. And that spec's really gonna be important in kind of guiding us through the through the rest of the process, in terms of, like, how how we go about the data collection, right through to, like, the labeling and and the the metadata associated with it. So you can consider that, for example so, like, something perhaps like a dog bark sensor. You can consider all of the different degrees of variation that that, you know, dogs exhibit, if you like. You have small dogs, you have big dogs, different breeds. Perhaps the age will affect the the sound of the the the dog bark won't you know, the the dog will make when it barks, and other sorts of factors. And so that's, like, you know, for each for each sound that we work on, we have to do this massive sort of brainstorming exercise around and making sure that actually the data we're gonna gather is gonna be valuable, I guess, in in a similar way with with speech recognition when you're designing datasets. Right? You have to sort of make sure that they're sort of you know, have have all of the different sort of phonemes or whatever it may be that you're interested in. So so then we'll we'll have developed this plan for, like, the data that we actually want to collect, and then there'll probably there'll be, like, a a stage where we then have to essentially source this data. I mean Yeah. Unlike some sounds, like perhaps if you're doing a smoke alarm type sound where you can can buy the sensors for a dog, you know, you can't just get a dog off a shelf as it were. So you have to source these dogs in some way. And so we'll reach out to, like, the volunteers that we have. So we'll have a global volunteer network, and so we can draw on those people to to provide us the sources for these sounds. And so we're able to just email these people, and, yeah, for some amount of money in return, they'll be able to provide us with a doc in this situation.
So so we'll line up a load of people to help do the recordings, and then a situation like that will probably go into our, anechoic anechoic chambers at our sound lab and or semi anechoic chambers and and make sure we record the sounds, you know, to a really good standard, as you said, about the the garbage in, garbage out. But, of course, you can consider there's a few sort of, again, bits of variation there that that you've really got to focus on, and 1 of the areas will be the different channels or the the microphones, if you like. But you can also consider, for example, something like with a dog, you know, that's gonna be running around and moving around. So that presents its own sort of sort of challenges. And and each I'd I have to say in kind of my experience working at the company, no no 1 sound is is really There's always challenges that you don't expect when you first start working on these things that you kind of have to have to overcome to be able to do a good job with with those those sorts of things. And then so the essay you say, you know, we've spent a load of time then gathering a load of really, really fantastic data, and then you have to process this data and label this data. And, you know, the lay labeling is a really important aspect for for for the sound recognition problem. You know, just deciding on how you want to do the the labeling. Do you make sure it's extremely sort of fine grained, or can it be more coarse, or can you use, like, weak labels and stuff? And then so that once we've we've kind of gone through that stage, then we can make sure we get it into to Alexandria, our data platform, for storing all of this this data. And, you know, I think it's 1 of the the the things that every stage in that kind of this really just the data aspect of the pipeline is is kind of, you know, each each part first has its own challenges, and you have to get a lot of things right to make sure that the data that you then can present onwards to the machine learning teams is is of the sort of best standard that you can that you can really get. So what so because I know, Tom, you've got, what, some 15, 000, 000 pieces of,
[00:18:08] Unknown:
audio data in Alexandra, 200, 000, 000 pieces of metadata, and 700 label types also. I I think, Tobias, 1 of the interesting things to realize is is labeling audio data as opposed to speech presents its own set of challenges as as Tom talked about. So even if you take something as as basic as baby cry, and I Tobias, I don't know if you're you're a dad or you've got kids or anything, but if you if you sit down a bunch of people and you say, tell me when it hears the recording, tell me when the baby starts crying when it stops, most people will disagree each other. They'll agree in the main, but at the edges, when you're trying to do, I don't know, 10 millisecond labeling accuracy around when a baby started crying, you'll get into debates like, is it crying? Is it grumbling?
You know, all the things that you would do as a parent, start to come out because that that audio is just less exact. Was it in a speech world, it's a lot more exact. You can take a a a typical person off the street and say, label the words and when they started, and there'll be very little disagreement in comparison. And that gives Tom and his team some, sort of fundamental challenges with labeling up these sort of things, which which I know they spend a whole bunch of time just figuring out. How do you label a new type of sound? What is that sound?
[00:19:25] Unknown:
And and when does it start and when it stops? And the metadata aspect too is interesting because with things like textual records or structured data, it's easy to associate the metadata with the record at the time that it's being created. And with image data, there's the standard of the EXIF tags. And I know that for instance, with MP 3, they've got ID 3 tags. But I'm wondering if there's any sort of useful standard that you can use for embedding the metadata with the records or what your approach is for being able to effectively associate that information with the actual audio segment and ensure that they propagate through your system in conjunction with each other so that they're easy to relate to 1 another? So so we have a whole subject. We we call it audio provenance,
[00:20:07] Unknown:
and it's a it's a whole subject matter for us internally. The the 2 examples you've used, let's take image data. If I showed you 3 pictures of a toy dog, and 1 of a real dog, you you'd very quickly be able to identify with with no prior information that that's the toy dog, you know, knows the real dogs. Audio is much more complicated than that. We're very much attuned as, humans to sort of fill in the blanks. And so, you know, I could play you 3 recordings of smoke alarms and say 1 of those is a fake smoke alarm, and I I I guarantee you'd be very bad at telling me which 1 was the fake and which 1 was the real 1. So if, for example, I don't you scraped a bunch of audio files off the Internet, you'd be straight into that garbage in, garbage out principle. By doing that high quality data collection inside those semi anechoic environments, It means we're there when the subject matter variability is explored.
And and then you're right. That that sort of chain of evidence, if you will, has to be passed all the way through the pipeline, right through the, you know, data collection processing, labeling, augmentation, training, evaluation, you know, e even sometimes down to the compression and deployment levels so that, you know that it's doing a good job. In terms of frameworks for doing that, no there's no off the, shelf frameworks. That that's a completely new area itself.
[00:21:31] Unknown:
And, yeah, with audio data, there's definitely a huge degree of variability along a number of different axis, whereas you said, you've got your anechoic chambers for being able to isolate the sound to the specific piece that you're trying to collect. But then out in the real world, that's going to often be overlaid with whatever the other background noise is, whether it's the, you know, of your washing machine in the next room or the sounds of engines going by outside and then being able to isolate that sound. And then for your volunteers who are contributing the audio that you're using for this collection process, I imagine that there's variability in terms of the quality of the microphones that they're using, the sample rates that they're collecting the audio in, the specifics of the audio format that it's being collected, the lengths of the segments. I'm wondering how you approach being able to try and encapsulate all of that variability and be able to standardize
[00:22:23] Unknown:
it in some way for being able to feed it through your model training process. In in general, you're right. We we think about it in terms of subject matter variability, and channel variability with channel variability split into 2 parts, which is sort of a acoustic coupling variability, which would be the environment you're in, the acoustics of it. Is it your reverberant environment? Is it in the bathroom, or is it, you know, sort of in in the hallway? And then you've actually got the the the actual, device channel variability, which includes the microphone.
It includes all of the various parts of the audio subsystem, before the input audio is received at, AI 3, which is the the inference engine that we run to do the high quality sound recognition we do on the device. Tom, in terms of,
[00:23:09] Unknown:
the the challenges with all that, and if you wanna pick that piece up. Yeah. Sure. So I mean yes. You're absolutely right in terms of, like, you know, massive amount of data to try and encapsulate. If you but consider all the things you've talked about and then sort of draw it back to just 1 1 single file. And for 1 single file, as you said, you'll have the particular sampling rate, the particular bit depth, the particular channel it was recorded on, a bunch of settings around, you know, whether it's the the device you were using to record or the sound card. Then you'd have stuff like the room it was in, perhaps, like you say, yeah, is there some sound in the background that's going on at the same time? Obviously, this will be in situations where we're kind of recording in situ versus, say, in the anechoic chamber. And then around, yeah, exactly like the the the source variation, the gender of the dog, the age of the dog, all these sorts of things. And so in terms of capturing that data, I mean, we've got tools that that sort of help us help us, I would say, kind of structure the collection around this, but you consider it's still quite a manual effort of you still have to kind of match up. Yeah. When a volunteer comes in, you have to put in the name of the dog in a particular record and sort of make sure that that's stored along with the correct file. Now when you would consider it say say, going back to the dog example, if you record 1 particular dog barking, you might be recording it with, you know, some like several tens of devices at the same time. So you need to make sure that the information around the dog is kind of propagated to all those, say, 50 odd devices. But then the information about the individual devices is kept specific to the devices.
And so, you know, we do have a pairing of a chunk of metadata with each individual, sound file, and you can imagine that, you know, those numbers grow grow pretty vast in terms of,
[00:25:03] Unknown:
yeah, how how many kind of elements of of metadata we we have a record of. And in terms of the taxonomy that you're building for being able to track and categorize these different audio segments, what was your approach for structuring the initial taxonomy? And how has it had to evolve? And what are the assumptions that have been challenged in the process of building and growing that taxonomy for being able to make that information useful in some sort of structural or hierarchical way? So the,
[00:25:35] Unknown:
the data the taxonomy is structured on what's called an actor principle, which is is why that sort of anthropony, biophony, geophony are top level things. So, obviously, caused by, humans, caused by geography, if that makes sense, and, caused by biology. So and then it cascades down from there. The actor principle is a fundamental 1. It was a a specific taxonomy principle we came up with because, obviously, something needs to cause those sounds, in the environment. So using that as a a fundamental building block means that you you're not gonna go far wrong. In terms of skipping to your last question, in terms of things that we've, I suppose, effectively learned that we didn't know we were gonna have to learn, 1 of my favorite examples of that is is not realizing sometimes the same world conspires for you and sometimes it conspires somewhat against you. So there is a smoke alarm, that's I think it's the 3rd or 4th most popular selling smoke alarm in North America, and it sounds identical to a bird species in the south of France.
Now I'm pretty sure that that that bird species hasn't, evolved to mimic the the the the smoke alarm, but that sort of then thing that is then presented to the machine learning engineers and saying, well, these things sound pretty much identical to humans, but you need to separate them out. Otherwise, people are being told that their their house is, the smoke alarms are going off when in fact it's just the the bird that they keep in their living room to treat the way, and it happens to sound identical to this North American smoke alarm, which which they the engineers solved. But those sorts of interesting quirks of, I suppose, fate, if you will, are fascinating to experience.
Although do give, Tom and the rest of the team sort of, I'm sure,
[00:27:30] Unknown:
sleepless nights of of worry as they try and figure out how to best collect the data and and best separate it out. And another element of this problem space is that a lot of the tools that have been built for being able to work with and process large volumes of data are generally oriented around textual and structured data. So So I'm wondering what you've been able to use in terms of off the shelf components for being able to actually process and build these models, And how much of it you've had to custom build in house specific to your use case and your problem domain?
[00:28:03] Unknown:
Sure. Yeah. So I mean, you're absolutely right. Like, I mean, a lot of the tools when it kind of comes to the to to the audio world, say audio and then kind of bridging a bit into to machine learning, are really kind of for speech recognition type applications or kind of music warranted stuff. So you might have, for example, you know, some some tools around transcription that sort of aiding with transcriptions for for speech recognition type problems. But, obviously, there's there's Chris has touched on a few times, you know, speech sound isn't speech. And then the other way you can consider, there'll be, you know, even even software, say, like Audacity, absolutely fantastic at doing the job it does. But but, again, it's specific to music and in this, you know, case kind of recording and music production and that and that sort of things. And and so often, it's the that's what the problem that we're trying to solve is quite specific and is, you know, difficult. You do there is an element of having to roll your own or enhance if you can. Of course, it would be you know, it's it's easier. It was more time time beneficial to do it that way, enhance it. But if you if you have to kind of roll your own, then, you know, for a for a lot of stuff that we work on, we we do kind of have to roll our own. I mean, consider the example of that, you know, recording with many devices at 1 time. You know, there isn't a magical start button that allows every single device to be started at the same time and and stopped at the same time because, well, actually, if there was, we'd like to hear about it because that'll be extremely useful. But you have to solve you know, you kind of then have to make sure that the audio, obviously, if there are then, like, different offsets at the beginning of the file, that the sounds occur in the same part of the audio across across the many different devices. And so we've developed techniques to to handle that problem, you know, our ourselves.
And, yeah, I guess, you know, right right through the pipeline, we've we have got, you know, a lot of stuff that is bespoke to, to us, the organalytic, to just just help solve those problems that that aren't quite the same as as, you know, in other areas.
[00:29:58] Unknown:
And can you dig into a bit of how the actual data pipeline and data management is architected and the ways that you work with it for being able to train and build and deploy the models that you're working on?
[00:30:10] Unknown:
Yeah. Sure. Sure. So I mean so in terms of, like, you know, the whole the whole pipeline of what we do at audio analytic, you know, you consider it as a standard machine learning delivery pipeline. And so we go through from the the data collection right at the beginning through to the, yeah, deploying of the models at the end. And there are many, many stages in in between there. So focused around, say, Alexandria, which is our our, you know, massive data massive database, you know, the stage that the pipeline will focus on kind of around the data collection side, as I've touched on, the processing.
Like with all these, you've got many different devices and many different, you know, formats that the audio come in. Some devices will give it to you as 1 format, and others will give it to you as another format. You'll have to then the the next stage will be the so the labeling part, and that's where you kind of marry up the all of the metadata and the labels and the sound, get it into Alexandria so it can then be made available for the machine learning teams. And then, you know, there'll be a stage in between there of you've gotta consider as well. It's not just, you know, the a bunch of a bunch of audio data in a dataset or, you know, from a dataset you download. On its own, it's not it doesn't necessarily give you that much. You've then got to kind of say, right. Let's split it into those machine learning sets where we need to make sure that, you know, ultimately, the the stuff that goes in that testing set is representative of what all real world situation is gonna be, and then you need to make sure you've got the same kind of distributions throughout. And so that's, you know, that's another very important aspect of of of making sure that you gather enough variation all across the board so that when you chop it up into these little sets, that you've got it in the right places. So that that's and that's the point that the machine learning team will will take this stuff, and so they'll apply techniques like, data augmentation, you know, to further increase the volumes. I mean, you know, machine learning models nowadays are so so data hungry that you have to kind of apply these techniques.
And then look into the training. You've got lots of different model evaluation type procedures going on. And then then you have, like, a form a more formal evaluation stage, then kind of deciding on what sort of models that you're interested in or what trade offs you're making, I guess, would be a better way to sort of structure that. You'll then look at trying to compress these things. As Chris said, we know we run on the edge. We can't have have have, massive models that take, you know, tons of computation that, you know, they need to be really small. They need to be really, really fast. You know, we're not the only stuff on these devices. Right? There'll be other other other sort of functionality that these devices will will have, and we're just 1 part of the stack, if you like. Yeah. And then that kind of leads into the the deployment side of things. In terms
[00:32:42] Unknown:
of optimizing that that pipeline, 1 of the in that sort of evaluation stage, 1 of the interesting things we've done recently is around something called polyphonic sound detection score. We we found that, like with any challenge, you need to optimize well, any machine learning challenge and probably more broadly, you need to optimize for the right criteria. And, the generic, methods that were being used, just borrowed from machine learning, we're we're not optimizing the systems appropriately, and and the pipelines and everything else. So we released, a bunch of GitHub code for this Polyphonic Sound Detection Score.
That's now being used, I think it's published at ICASP this year. It is being used, as part of the, the DKs competition, which is sort of the the benchmarking world. This is the standard, if you will, community standards of sound recognition. So it it's great to see the the discipline grow out of its infancy into those more developed areas, and moving into, what we start to call 2nd generation sound recognition.
[00:33:58] Unknown:
And in terms of the models and the deployment of it, I'm wondering if you just deploy 1 model that works generically across all the different sound categories that you have collected, or if you train these models for specific deployment targets where you have 1 that's specifically focused around security, where it has things like the glass break or sounds of, you know, a a door being hammered on. Then you have a different model that you deploy that's focused on things like detecting a cough and sneeze and sniffles for health related environments?
[00:34:30] Unknown:
So it's a it's a great question. So we tend to think of the the sound profiles as we call them, which is a 1 or a collection of sounds, broken band pipe per device because the the value propositions tend to align at the device level. And then, of of course, you can run multiple sound profiles, together. In terms of those configurations or of those underlying models, it really comes down to the individual use cases that are applicable on the devices. But we we try and make sure that that end set of sound profiles we're delivering are optimized for that set of use cases for that set of devices.
[00:35:10] Unknown:
I imagine that that helps too with restricting the size of the actual deployed model rather than trying to fit everything into 1 thing and then also compress that down to a size that's runnable on an embedded device versus just training the model for a specific use case and then reducing the overall scope of what it needs to be able to handle and thereby the size of the model that's being deployed?
[00:35:31] Unknown:
Yes. Size is is somewhat proportional to the amount of sounds that you add, especially when you're dealing with a smaller number of sounds. So, as it grows, that that sort of relationships becomes, less distinct. In terms of, of course, if if there's sounds that you will never come across and never need to detect on a device category, it would seem a bit silly from a a computational point of view to to burden that device for looking for those sounds. So, yes, it it definitely helps in knowing that notion. This is about sort of structuring a sense of hearing in line with what is needing to be heard for those device categories and, and sort of a just enough type approach. That means that you have that good trade off between computational smallness, or sort of resource, size, and and that sense of hearing. So in fact, we did a recent set of things, the consumer electronic show this year, in Las Vegas on our our private demo suites, which were shown that we were capable of running a a sensor here right down onto an M0 plus processor, which is the the smallest grade of processor that the ARM do. So so really showing that you can push that sense of hearing down onto incredibly small processes.
[00:36:48] Unknown:
And then in terms of being able to actually evaluate and test the models that you're working with, I'm wondering what you have found as far as their capability of operating in noisy and complex and layered audio environments
[00:37:03] Unknown:
and how the overall testing of the model has fed back back into your strategies for collecting the source data? So if I take the the top piece then, Tom, if if you can relate back to the source data piece. So in in terms of that feedback, so generally, because we've got quite, we've got the world's largest collection of of data for this area, we have high degree of certainty with the models we're providing to the marketplace already. You know, we we have large amounts of, say, 247 recordings, large amounts of environment recordings, and obviously large amounts of target sound environment. So we typically find that we're we're, you know, pretty good in our guesses of what the performance will be for a new sound profile that we're producing. In terms of the sort of things we learn, going back to that example of things that you just can't predict, you know, using the the example of the the the bird in the south of France and the North American smoke alarm, that that's that is something beyond the wit of man to sit in a room and figure out. You're only going to get that sort of insight from the actual field deployments.
Our technology's deployed in in something like a 160 countries worldwide. So we've got a very good sense of the sort of problems that are faced on a worldwide scale. In terms of how that feeds back, obviously, it feeds back into, do we need more data in a certain area? But, Tom, you're probably best placed to sort of pick back up that full loop back to the beginning of the pipeline, the data collection piece. Yeah. I mean, well, like you say, it's it's beyond the wit of man to try and think of all these things. I mean, I've been in you know, I we started these projects. I make sure we sit in a room and think about all the stuff that's gonna try and bite us in the backside.
[00:38:46] Unknown:
And you you do just never never think of everything. It'll just be there's there's too many bizarre bizarre things, and your head is so focused on kind of 1 aspect of looking at the problem that, you know, you'll be completely blindsided to another even though you spent all that time trying to figure it out. And so, like, yeah, you know, the delivering of these, you know, products, you know, they are very intuitive. You develop something and, you know, get it deployed, and then you realize you have these sort of issues. And then you kind of, you know, have to I think, often, you know, going back to data collection is is 1 of the important ways to kind of address these problems. I think you can consider as well, for example, you know, talking about how do you sort of actually do the evaluations, like, as Chris touched on, the the 247 sort of use case, for a lot of our products. You know, we have we just have absolutely tons and tons and tons of data that is just kind of use case. It's like a the a product in a room, and it's just recording all that time and will have, you know, many different examples of that. And so hopefully try and identify some problems early on in your kind of evaluation, you know, kind of your product development cycle, yeah, by by doing tests like that. And then in terms of, like, the actual, problems that you make, obviously, that's the game kind of focusing more on these, you know, the false positive, like, area of the evaluation. And then in terms of the more, like, focusing on, you know, how well do we actually do with actually detecting the sounds we're interested in, you know, we'll we mentioned the anechoic chamber earlier, and that that really allows us to have a sort of green screen for sound. You know, like you said, you you mentioned that, you know, you get this layering of background sounds and so on. And and and and really that's you know, a really good way of going about it is that, again, let's go back to the dog bark analogy. If you want to know whether whether a particular model is going to to to to, detect, sorry, a dog bark in a particular, say, room with, say, tile flooring, and and it's a really, I don't know, say, a really reverberant environment and someone's hoo ring in the background and, you know, whatever else. Maybe the device is 5 meters away or something. You know, you have to you have to literally test for that exact scenario. And so, you know, by having these these green screens, we can sort of test. And, again, it's it's looking at this stuff iteratively, evaluating, you know, often and and then feeding it back in and seeing where you can make improvements.
[00:41:11] Unknown:
What what we what we find though, I think is, the experience we now have with doing these things means that internal iteration speeds up and speeds up and speeds up. So we're producing sounds at an increasing rate of, not and those sounds being produced in their first iterations in internally at higher quality, just because, obviously, we're starting from a higher place. So we we've learned a lot that that iteration cycle means that we've now iterate internally very quickly so that when we do release that product out into very quickly so that when we do release that product out into the the marketplace, the customer can have high assurance that's already working from a a sound recognition perspective, and not thinking I'm sort of getting a bit of a product, but I'm gonna have to feedback data to improve it because clearly that's not gonna be acceptable to their customers. And in terms of the
[00:41:58] Unknown:
audio that you're working with, what have been some of the most interesting or unusual or strange sounds that you've had to try and collect and categorize?
[00:42:07] Unknown:
Well, I'll I'll do 3 stories. There's, there's strange in just sort of strange experiences of, capturing the data. So I'll do that 1 first. So we did some, gunshot recordings. We we we oddly enough chose to do them in the UK, and, machine guns anywhere in the world are not easy to come by, but in in the UK, they're particularly challenging to come by. And we we managed to arrange, a a set of, machine guns to be recorded. 1 of them was an Uzi submachine gun. And we had a rental van to go down to the, there's only 2 civilian, sort of automatic ranges, in the UK, and we had a rental van to go down that I remember sitting there with with the guy and he he he laid out the guns and he he explained that we had to move our rental van. And, like, I said, well, why is it? It seems to be nowhere near the the targets. And he said, well, they usually doesn't so much, aim bullets to sort of direct them vaguely in that area. So you probably wanna move it before you get a bunch of bolts in the van. And I'm pretty sure that would have caused us a whole range of, fun from at least the deposits on the rental van, let alone the explanation to the police. And I'm sure it would have been come. So that's on the sort of a strange data collection experience.
In terms of strange sounds oh, that's a good 1. Tom, what's your what's your favorite strange sound that you you've done the data collection for so far that you're able to talk about, obviously? Yeah. For sure. For sure. Well, yeah. Again, I guess this is this
[00:43:35] Unknown:
trying to think of individual sounds. I mean, I guess, you know, I mean, I guess you come across across sounds when you've, you know, when you got 247 daily, you might come across stuff like geese honking, which is always quite an interesting 1. In terms of, like, weirdness, in terms of collection for me, I guess it's it'll be the outdoor glass break, collection we did where, I guess, perhaps, similar weirdness to to Chris's thing, but, like, you know, essentially, we were tasked with just being on the street and smashing people's windows, of course, with permission and and and everything else.
And and we were paying people. We were saying that, you know, we'll give you some good good good amount of money. You can repair you know, you can replace your windows that are 20 years old, but we'll we'll break them for you. And it's it was it was a really, really bizarre experience. I mean, you know, like, you're literally in a, like, a residential area, and you've got foot well, I mean, we've we've put on fluorescent jackets, so we kind of look like, you know, we were we we we we were meant to be there, with all these different microphones and all these booms and stuff and all these wires going around and then someone's smashing smashing these windows with a sledgehammer and this sort of stuff. Yeah. That's how you know you kinda work for a very interesting company when you're sort of tasked with with something like that, but, yeah, that's definitely weird.
[00:44:51] Unknown:
The the 1 I talked with, we did a a car alarm data collection exercise, and we were we've done the sort of in car model, fits, but we're looking at the retrofit, car alarms. But obviously, we wanted to know what they'd sound like when they're fit into the car. So we had a a board in the back of a car with these car alarms on it. And we collected most of our data, but there was 1 challenge left, which I it was to go to a very reverberant real environment and collect them in there because we just some, you know, sort of high quality real reverberant. So I remember sitting in a a multi story car park with a car, and these things are very loud as well. So you you wanna go a bit of a distance away because, obviously, the the car alarm is unlikely to go off when people are physically sitting in the car, which would dampen the sound. So we're sitting away from this car car with a wire sort of trailing from the car with a big button that sets off all the car alarms.
And then somebody pointed out this might look a bit strange to the security guards watching us off on the, this the the inevitable CCTV cameras that would be watching these things in sort of a some bizarre, I don't know, Italian job kind of blow the doors off type piece or, something more than not not entirely clear. And then realizing we better finish our experiments up, rather quickly before we cause anybody a whole whole bunch of the hassle. I I think on that 1, that 1 gets interesting because the the recording of audio itself presents a whole range of challenges from a data privacy, data, ethics perspective.
So if you if you're trying to do audio recordings in the public, there's a whole whole set of rules that govern that. Obviously, there's increasing rules and and direction of travel around what data you should be able to use in machine learning models, both legally, ethically, and the direction of travel, you know, from a GDPR and the things that, you know, sort of going through the various stages of legislation. And even if you do something basic of you want to record a park, you know, when when you try and square that with the various laws that you need to get permission to do that. But it's a public park. Who do you get permission from? You know, so you're you're down into really quite fundamental challenges if you want to make sure that not only is your data good today to be training off, but as the rules change and evolve and and society understands what they want to do with machine learning, that data you based your models on doesn't start to be eaten away at, and you can't include it or, it's decided that the the the you didn't have the right permissions, or the traceability around the data and what's gone into it isn't clear. You know, you don't wanna fall foul of any of those things. So the the the effort we go to to make sure that we've got that complete chain of evidence around our data is is quite extreme, and that that's been ingrained in us from day 1.
It's, it's obviously been, expensive and time consuming to do, but it that choice has paid off,
[00:47:43] Unknown:
quite substantially with the direction that the machine learning world has gone in. And what are some of the other interesting or unexpected or challenging lessons that you've learned in the process of building out the technology and business aspects of audio analytic?
[00:47:56] Unknown:
On the business side, we license per sound, per device, which is the closest in the speech world with us, but presumably licensing either wake words, if if that was the equivalent, or language. So there weren't any directly comparable models because those 2 are clearly different in in in in various ways. In terms of then we've had to go out and we've we've obviously explained to people why sound detection is extremely valuable if you're designing products and and why it gives that, very intuitive sense of the devices behaving as you would expect it to. But that that's being done from the ground up, and that was a new area for product owners to experience, which means that how they judge how we should license it, how they judge the commercial value where the the value is realized and and things like that is is, has been part of that journey. And that's been fascinating to experience how people perceive and how they think the world of sound should be
[00:49:02] Unknown:
monetized, I suppose. And so as you continue to build out the technical and business capacity for the company, what are some of the plans that you have for the future either in terms of new problem spaces or use cases that you're looking to reach into or enhancements to your existing processes of data collection and machine learning?
[00:49:24] Unknown:
Our our road map is largely punctuated by what sounds for what use cases, is the way to think. And we we we constantly add to that, as we expand our ability to differentiate, sounds and, obviously, the target devices, capabilities themselves. In terms of, future stuff, I I think we did a a a blog post recently on, on our website. If you go to audioanalytic.com, we did it on, what we'd call contact systems. So contact systems look to not only detect individual sounds, but draw larger inferences off those. So, the example that was used in that blog post was it it sounds like you're leaving the house. So, you know, Tobias, if you have the house is lit is is leaving the house and you you know that you've you've not reminded the to get bin bags or whatever the other sort of, small items that you just forgot to remind there are, You you will know that. You'll sit on your living room. You're able to hear it if if it's in earshot, and you'll know what's leaving the house sounds like and what preparing to leave the house sounds like. But it is not a single sound. It's it's a collection of sounds in a certain sequence put into broader context.
And that sort of higher level reasoning, requires even even more sophisticated approaches to sound recognition, and that forms the some of the foundations of what we mean by 2nd generation. So 1st generation sound recognition systems were around a small range of sounds, typically safety and security applications, and and typically triggering events, through to the end, consumer. So there might be push events through to mobile phone. I I've heard a somebody's breaking into your house. I've heard a smoke and CO alarm going off. That's something. 2nd generation is about much more sounds covering not just safety and security, but entertainment, health, and and well-being.
And the the starting introduction of these contact system that start to use that, that fundamental understanding of the individual sounds or or scenes that it's hearing, and and building higher level inferences on on top of them. And that that's an exciting world to see unfold as, that sort of sense of hearing, it becomes more and more real,
[00:51:53] Unknown:
in line with what you and I would call the sense of hearing. So these devices can be more and more intelligent. And I'm also wondering, given the fact that you have collected this large volume of data and it's well labeled and highly valuable for building your models. And that's what you're licensing. I'm curious if you've also gone down the path of looking to license out the actual source data for other people who are looking to take advantage of it for other use cases. No. So typically,
[00:52:20] Unknown:
commercially, people less in the model. So we we've got this, as you've brought up on it, Alexandria, which is this this world's largest collection of of of data. It's about 15, 000, 000 audio files, 700 label types, about 200, 000, 000 pieces of metadata. That helps steer where you put your research from an AI perspective. You know, coming back to those points I mentioned around there's fundamental architectural blocks from machine learning point of view that you just can't use like a language model, like the speech acoustic model that would materially impact your, performance rates and take, mean that if you just took an off the shelf machine learning technique, your performance rates wouldn't be good enough for the vast majority of, sound recognition tasks.
So we've used that data to steer our research and come up with our own inference engines that are specifically designed for sound recognition. So those 2 pieces tend to go hand in hand because even if you have the data, you'd still need to direct all of that research effort to get to a point where you have a well functioning, you know, sort of second generation sound recognition system. So, generally, people want to license the output
[00:53:38] Unknown:
of that, and take advantage of both of those pieces. Are there any other aspects of the work that you're doing at Audio Analytic or the overall space of sound detection that we didn't discuss that you'd like to cover before we close out the show?
[00:53:52] Unknown:
No. I I think I think you're you're good, apart from I'd I'd love to send you over some bizarre audio files, and we'll see if we can find you some and send them over just because they're they're always fun.
[00:54:03] Unknown:
Well, maybe we can, put some of those in interspersed throughout this conversation just to give people something to entertain themselves with. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. So for data management,
[00:54:29] Unknown:
obviously, it's gonna be biased because we're we're we're solely focused on audio. So most of my concerns are gonna be around the audio side. I think a a realization that the machine learning frameworks have have come on and the the data pipeline side has come on and the data management side has come on generically. But now we're into the specialisms, and the specialisms require us to extend, hopefully, what are extensible frameworks. Love to see more of those frameworks becoming more extensible and more modular. So we only have to roll our own pieces in the area that makes sense.
You know, a lot of the times, we we come across tools where they just don't work well together. So you end up even though there's good bits of functionality in tool a and good bits of functionality in tool b, that you still need to specialize them to the task at hand, and then those 2 tools won't aren't as flexible as you'd like them. So if there is 1 area, it's sort of that digging into each individual task within machine learning has its own set of challenges and realizing that at a fundamental tool level and and baking that into the architecture of the the tools that could go across the industry would would be incredibly valuable. And, Thomas, how about you?
[00:55:44] Unknown:
Yeah. I mean, I've just I'd echo Chris's, answer, to be honest with you. It's just as as soon as you get into something that's just kind of a bit more niche than the the sort of mainstream applications for a lot of these things, they you just you just have to start rolling your own, and I'd say that applies to, you know, right across the kind of spectrum of wherever you apply machine learning. Really, it's been my experience before as well. Alright. Well, thank you both very much for taking the time today to join me and discuss the work that you're doing with audio analytic and sharing some of your interesting stories of collecting these,
[00:56:15] Unknown:
audio data. It's definitely a very interesting use case and interesting problem domain that you're working in. So I appreciate all of the time and effort you've put into that and the time that you spent sharing your experiences with me, and I hope you enjoy the rest of your day. Great, Tobias. Thanks for having us on the show. Been been great.
[00:56:30] Unknown:
Look forward to listening to, future episodes as you go forward. Yeah. Cheers, Tobias. It was great to speak to you. Thanks so, so much.
[00:56:42] Unknown:
Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email host atdataengineeringpodcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction to Audio Analytic
Chris and Thomas' Backgrounds
Building Sound Recognition Technology
Use Cases and Applications
Challenges in Deployment and Data Collection
Labeling and Metadata Management
Custom Tools and Data Pipeline
Model Evaluation and Real-World Testing
Interesting and Unusual Sound Collections
Business and Technical Challenges
Future Plans and Innovations
Closing Remarks and Contact Information