The HTML5 SpeechSynthesis API is rubbish

At BBC newsHACK VIII last week, I was part of a WSJ team that put together Autocast, a small hands-free app that reads out the latest news. The idea was that those commuting by car could press play and have a stream of the latest articles read out to them, potentially ordered by their personal interests.

For our proof-of-concept (demo and source), we used the Factiva API1 to grab a selection of articles and read them out using the HTML5 SpeechSynthesis API.

Autocast running on Jack's iPhone 6

The HTML5 SpeechSynthesis API is a relatively new standard which does what it says on the tin: it allows text to be converted to (robotic) speech in-browser. It’s part of the Speech API, an open standard which also includes a SpeechRecognition API2.

The API is, in theory, pretty straightforward: create a new instance of SpeechSynthesisUtterance and then read it out with speechSynthesis.speak():

var msg = new SpeechSynthesisUtterance( "Hello I am browser" );
window.speechSynthesis.speak( msg );

Unfortunately, even in the process of putting together our simple app at newsHACK I encountered a host of annoyances:

Then there are inexplicable cross-browser bugs:

Combined, these problems are a nightmare that cost me hours of valuable time during the hackathon and could easily scupper a production-level app. The SpeechSynthesis API is a neat idea, but until these issues are addressed it isn’t much more than that.

  1. Factiva is a Dow Jones product that provides access to articles from thousands of different publications. I’m not sure if the API is publicly available.
  2. A while back, I used the SpeechRecognition API in Kanji Voice Quiz. It worked OK with recognising set phrases — so could work well for voice commands — but I certainly wouldn’t attempt to use it for transcribing arbitrary speech.

Published .