The speech API is still experimental and may not be available in every user agent or at least not fully working. A common approach for modern websites and web applications is to progressively enhance the user experience, hence use what is available to the user. This design principle is known as progressive enhancement. If you create a web application or a website your own, I strongly recommend to start with implementing a smaller core feature set that supports a wide range of user agents. Then optimize and enhance the user experince using newer technology, keeping a fallback ensured.
Back to the Web Speech API, it provides two working modes: Handling Text-to-Speech and SpeechRecognition. We’ll skip the first and build a simple and handy extension for eyeson.team to control a video conference web GUI with speech input.
Speak My Language
When it comes to speech synthesis it is essential that the language dealt with is known beforehand. In terms of accessibility a document level language definition is essential, as assistive technologies using text-to-speech may fallback on the users default language. Fortunately this is as simple as setting the
lang= attribute in your opening html-tag, you don’t want your visitor from France listening to a french dialect interpreting your content in English.
In our case we don’t want to concentrate too much on the content but the input via the users voice. The API provides us with a configuration to set the users language, if none provided will use the HTML documents
lang attribute. In most cases this will be fine, there’s definitely room for improvement here. Having British user dealing with language settings
en-GB instead of
en-US will result in a surprisingly better matches and therefore a better user experience.
Interim Results, Grammar and other Options
Although speech recognition is quite a complex topic, the current specification has a short and powerful set of properties. Let’s have a short look at some of the setting options provided.
When to Use
With popular voice recognition assistants like Alexa, Siri, or OK Google your users might already be used to speech control. The use of this technology can improve accessibility to a wide range of your users significantly. Additionally there are scenarios where there’s simply no other input mechanism available, think of a remote support using a headset requiring both hands to be free in order to get a job done.
If you do provide voice control, ensure your user is aware of it and what commands are available.
toogle_video event in order to mute our camera and make it active again on voice command ninja (ref).
As described we try to use any SpeechRecognition inside the initialization method, or fallback on not sticking any further with this feature. If we successfully detect the API to be available, we configure to skip interim results, stick with one result and provide a continuous recognition.
We use oneresult to attach an input handler, and start listening. Within a received result, we’re just interested on the transcript interpreted by our recognition – a received ‘ninja’ will get our attention and the input handler receives the video toggle event, switching between mute and unmute state.
Note that the speech recognition web API is fully available only in Google Chrome. A specification draft version is created and published for W3C. Firefox and Opera web browser ship with the support as well, but require manual change of configuration flags in order to enable the feature.
Although speech recognition might not have full support for every user agent, it might be available for users who want or even need to use it. This powerful but easy to handle control mechanism might give your website or web application a head start to the upcoming trend of interacting with software via voice.