Most people we know have had this kind of dialog with interactive voice response (IVR) systems:
IVR: “Please say or input your order number?”
Me: “One-one-three-four-seven-two-one-nine.”
IVR: “I’m sorry. I didn’t understand. Please say or input your order number.”
Me: “One-one-three-four-seven-two-one-nine.”
IVR: “I’m sorry. I didn’t understand…” (the next sound is the dial tone after I hang up)
Does a call center really need to be this painful? We recently caught up with Dr. David Nahamoo, IBM’s speech technology guru to hear about what he calls “super-human speech recognition.” No, he’s not talking about Spidey or Superman, but rather a project meant to substantially improve the quality, dialog management, and usability of speech technology by the end of the decade — for dictation, call centers, cars, and a broad set of other applications with embedded computing power .
- “Can you understand me now?” How well the system recognizes your voice establishes value, credibility, and caller willingness to continue interacting with a machine. Nahamoo says that his team has been reducing error rates 15 to 25% per year, but there’s still a long way to go before it’s at human speech levels. One of his goals is to surpass a human for real-time dictation such as a lecture, phone conversation, or broadcast — and he would like to do that for 50 languages with the same computer.
- Make dialogs more flexible. Nahamoo said that when we encounter an IVR “we often forget that we’re not interacting with a reasoning person. The IVR system has very little ability to adjust itself to caller nuance, something a human agent does without much difficulty.” IBM and its competitors in the space want to create a more powerful, more flexible dialog manager that could easily adapt to less structured interactions. Meanwhile, today’s more successful applications have been designed to be very directed, taking callers through a hierarchy of possibilities by essentially stringing them along one question at a time (hopefully more effectively than the Eliza Computer Therapist simulation).
- Simplify the tools. Before fully speech-enabled applications become ubiquitous, Nahamoo says that the technology must cross a simplicity threshold that would open it up to more developers. The speech recognition community converged around Voice XML about 5 years ago, thereby abandoning proprietary interfaces. Nahamoo feels that the next step is for providers like IBM to encapsulate design principles and behaviors in templates. Sound familiar? Like client/server and web development tools (think Visual Basic and Dreamweaver, respectively), we’d say that speech needs its own GUI-based development environment.
What will we see in the coming year? On the usage side: Look for more re-use of speech-enabled components in companies with successful call center applications. Some GUI-based development tools will trickle out. Integration with machine translation will continue evolving. Figure on some innovative Web 2.0 mash-ups à la GOOG411, structured integrations like speech-to-sign-language, and in-dash, hands-free operations for more cars. For the technology itself, you can expect another 20% improvement in quality. Maybe, just maybe you’ll be 2% less frustrated the next time you’re asked to “say or input your order number” or when your car turns on the air conditioner instead of switching to track 3, disk 2 in your CD player.
Share or tag this post on: