Joshua Lott / Reuters
SAY it out loud and the machines will know. Search engines are moving beyond the web and into the messy real world. And they’re finding some odd things.
Every call into or out of US prisons is recorded. It can be important to know what’s being said, because some inmates use phones to conduct illegal business on the outside. But the recordings generate huge quantities of audio that are prohibitively expensive to monitor with human ears.
To help, one jail in the Midwest recently used a machine-learning system developed by London firm Intelligent Voice to listen in on the thousands of hours of recordings generated every month.
“No one at the prison spotted the code word until software started churning through calls“
The software saw the phrase “three-way” cropping up again and again in the calls – it was one of the most common non-trivial words or phrases used. At first, prison officials were surprised by the overwhelming popularity of what they thought was a sexual reference.
Then they worked out it was code. Prisoners are allowed to call only a few previously agreed numbers. So if an inmate wanted to speak to someone on a number not on the list, they would call their friends or parents and ask for a “three-way” with the person they really wanted to talk to – code for dialling a third party into the call. No one running the phone surveillance at the prison spotted the code until the software started churning through the recordings.
This story illustrates. Intelligent Voice originally developed the software for use by UK banks, which must record their calls to comply with industry regulations. As with prisons, this generates a vast amount of audio data that is hard to search through.
The company’s CEO Nigel Cannings says the breakthrough came when he decided to see what would happen if he pointed a machine-learning system at the waveform of the voice data – its pattern of spikes and troughs – rather than the audio recording directly. It worked brilliantly.
Training his system on this visual representation let him harness powerful existing techniques designed for image classification. “I built this dialect classification system based on pictures of the human voice,” he says.
The trick let his system create its own models for recognising speech patterns and accents that were as good as the best hand-coded ones around, models built by dialect and computer science experts. “In our first run we were getting something like 88 per cent accuracy,” says Intelligent Voice developer Neil Glackin.
The software then taught itself to transcribe speech by using recordings of US congressional hearings, matching up the audio with the transcripts.
Cheap as chips
The power of machines that can listen and watch is not that they can do better than human ears or eyes. In fact, they perform much worse – especially when confronted with data from the real world. Their power, like all applications of computation, lies in speed, scale and the relative cheapness of processing.
“The cost would work out at 4 pence per hour of audio,” says Cannings. Human transcription costs can be 1000 times that. An automated transcription service is something Intelligent Voice is considering, but for now they are focusing on search.
Most large tech companies are developing neural networks for understanding speech, opening up data sets that were previously difficult, or impossible, to search. Voice-activated virtual assistants like Google Now, Apple’s Siri, Amazon’s Echo and Microsoft’s Cortana must also make sense of the quirks of human speech.
And Facebook recently announced that it has repurposed its image-recognition software to. These maps are of lower quality than those produced by humans but, again, the advantage is speed. Facebook’s system can map the entire land surface of the planet – every road and house – in just a few hours.
This article appeared in print under the headline “They are listening”
More on these topics: