The HTML5 SpeechSynthesis API is rubbish

At BBC newsHACK VIII last week, I was part of a WSJ team that put together Autocast, a small hands-free app that reads out the latest news. The idea was that those commuting by car could press play and have a stream of the latest articles read out to them, potentially ordered by their personal interests.

For our proof-of-concept (demo and source), we used the Factiva API¹ to grab a selection of articles and read them out using the HTML5 SpeechSynthesis API.

Autocast running on Jack's iPhone 6

The HTML5 SpeechSynthesis API is a relatively new standard which does what it says on the tin: it allows text to be converted to (robotic) speech in-browser. It’s part of the Speech API, an open standard which also includes a SpeechRecognition API².

The API is, in theory, pretty straightforward: create a new instance of SpeechSynthesisUtterance and then read it out with speechSynthesis.speak():

var msg = new SpeechSynthesisUtterance( "Hello I am browser" );
window.speechSynthesis.speak( msg );

Unfortunately, even in the process of putting together our simple app at newsHACK I encountered a host of annoyances:

The API is currently only supported in Chrome and Safari.
Each browser/OS combo has a different set of available voices, and each has a different default voice.
On iOS Safari, speech playback must be triggered by a user action (e.g. button press). This is presumably a feature rather than a bug.

Then there are inexplicable cross-browser bugs:

On iOS Safari the speech rate is much faster than any other browser.
speechSynthesis.cancel() should clear any currently-playing or queued speech. However, on Safari v8.0.6 (OS X) it sometimes flat-out doesn’t work. (For example, try skipping forward and then pausing in the Autocast demo.)
On Chrome on OS X, running speechSynthesis.cancel() before running speechSynthesis.speak() causes the new speech instance to be skipped. In Autocast, it means skipping to the next item skips through almost everything in the queue (~100 articles).

Combined, these problems are a nightmare that cost me hours of valuable time during the hackathon and could easily scupper a production-level app. The SpeechSynthesis API is a neat idea, but until these issues are addressed it isn’t much more than that.

Factiva is a Dow Jones product that provides access to articles from thousands of different publications. I’m not sure if the API is publicly available. ↩
A while back, I used the SpeechRecognition API in Kanji Voice Quiz. It worked OK with recognising set phrases — so could work well for voice commands — but I certainly wouldn’t attempt to use it for transcribing arbitrary speech. ↩

Published June 7th, 2015.

The HTML5 SpeechSynthesis API is rubbish

Footnotes