At BBC newsHACK VIII last week, I was part of a WSJ team that put together Autocast, a small hands-free app that reads out the latest news. The idea was that those commuting by car could press play and have a stream of the latest articles read out to them, potentially ordered by their personal interests.
For our proof-of-concept (demo and source), we used the Factiva API1 to grab a selection of articles and read them out using the HTML5 SpeechSynthesis API.
The HTML5 SpeechSynthesis API is a relatively new standard which does what it says on the tin: it allows text to be converted to (robotic) speech in-browser. It’s part of the Speech API, an open standard which also includes a SpeechRecognition API2.
The API is, in theory, pretty straightforward: create a new instance of
SpeechSynthesisUtterance and then read it out with
var msg = new SpeechSynthesisUtterance( "Hello I am browser" ); window.speechSynthesis.speak( msg );
Unfortunately, even in the process of putting together our simple app at newsHACK I encountered a host of annoyances:
- The API is currently only supported in Chrome and Safari.
- Each browser/OS combo has a different set of available voices, and each has a different default voice.
- On iOS Safari, speech playback must be triggered by a user action (e.g. button press). This is presumably a feature rather than a bug.
Then there are inexplicable cross-browser bugs:
- On iOS Safari the speech rate is much faster than any other browser.
speechSynthesis.cancel()should clear any currently-playing or queued speech. However, on Safari v8.0.6 (OS X) it sometimes flat-out doesn’t work. (For example, try skipping forward and then pausing in the Autocast demo.)
- On Chrome on OS X, running
speechSynthesis.speak()causes the new speech instance to be skipped. In Autocast, it means skipping to the next item skips through almost everything in the queue (~100 articles).
Combined, these problems are a nightmare that cost me hours of valuable time during the hackathon and could easily scupper a production-level app. The SpeechSynthesis API is a neat idea, but until these issues are addressed it isn’t much more than that.
- Factiva is a Dow Jones product that provides access to articles from thousands of different publications. I’m not sure if the API is publicly available.↩
- A while back, I used the SpeechRecognition API in Kanji Voice Quiz. It worked OK with recognising set phrases — so could work well for voice commands — but I certainly wouldn’t attempt to use it for transcribing arbitrary speech.↩