Interpreting human language is one of the most complex tasks in nature, and even more so for machines. People often ask us at Truleo how we’re able to accomplish understanding of body-worn camera interactions. This post details step-by-step how our machine-learned models are able to analyze and understand a BWC interaction between an officer and a civilian.
Extracting audio features from video
For a machine learned model to analyze spoken language, it first needs the data in a format of numbers called “features”. For BWC, the task of feature extraction is augmented with data security concerns and CJIS compliance. At Truleo, we have developed a CJIS-compliant way of extracting audio features for our models on-the-fly in memory, leaving the video intact in a customer’s evidence retention platform.
Once the audio features are available to Truleo’s models, our engine then has to separate the audio into chunks with only one speaker in each chunk. At this stage, speakers are tagged as “Speaker 1”, “Speaker 2” and so forth.
Transcription is one of the most universally understood steps of Truleo’s process - converting the audio into text and figuring out what was said. This task, while easy to understand, is often challenging for speech recognition engines because of the noisy environments typically seen in BWC videos. Truleo’s model is trained specifically on BWC audio to achieve the world’s best BWC transcription accuracy.
Once the audio is transcribed, Truleo’s models look at the voice quality of the audio components and the text of what was said to anonymously tag one of the previously found speakers as the “Officer”. This step then allows further analysis to be specific to only officers, only civilians, or both.
Truleo’s event classification model uses natural language processing to figure out what events may have occurred in the course of a dialogue, such as language cues that an accident or arrest occurred.
Truleo’s entity recognition model identifies key words and key phrases within the text, such as directed profanity or empathy, and makes these tags available for Truleo’s “Risk Score” to rank videos.
Taking into account the amount of directed profanity, offensive language, and indications of use of force, Truleo’s Risk Model calculates a risk score that departments can use to identify the most at-risk interactions.
The final product of Truleo’s Audio Analysis is a rich analysis of thousands of conversations, enabling departments to quickly identify at-risk incidents and department wide trends.