If you want a winning voice app, implement SSML by@Terren_in_VA

If you want a winning voice app, implement SSML

Read on Terminal Reader
react to story with heart
react to story with light
react to story with boat
react to story with money
Terren Peterson HackerNoon profile picture

Terren Peterson


I’m an Alexa champion, and have been writing skills on the platform for the past year. These include the winner of the 2016 Internet of Voice Challenge. This blog offers ideas in how to improve the quality of your Alexa Voice skills using an audio markup language called SSML.

Get better returns than 3%

The number of Alexa skills is proliferating, but developers are finding that these efforts aren’t building active communities of users as they intend. A recent study by VoiceLabs found that the average voice application sees only 3% of users coming back to their skill the following week after its enabled. That’s a poor showing, and highlights that something is missing from the user experience. What’s missing is developers not yet using SSML — Speech Synthesis Markup Language. Mastery of it will be fundamental to building great voice apps. A good comparison is CSS in a browser rendered application. Take that away from web apps, and you would flatten the user experience.

More Voice Apps are Coming

The field of voice applications in Alexa is beginning to get crowded. This will get worse if the number of Alexa skills jumps by another 10x as it did in 2016. Similar growth in voice skills will come for the Google Home platform. Add new offerings by Microsoft and Apple, and branding challenges for a single skill will be challenging. It will no longer be good enough to get something published. Differentiation will come from standout quality so that users will ever enable your app.


What’s Missing?

Features. The downside of making skills easy to publish is that many of them have limited functionality, with little calling users back for a second visit. Sure, it’s great that a developer can get something deployed in an hour, but achieving ludicrous development time is one success metric to track. Let’s take advantage of these easy to use starter kits, but then push forward and add features to skills.

If we compare performance with other platforms, mobile applications have retention rates 3–4x higher than the current generation of voice skills. They also have a ten year head start in developing patterns to enable features. I expect that as voice applications will close if functionality improves.

Let’s Start Building Better Voice Apps!

There are plenty of ways to get started in building more robust voice applications. Let’s start with what the strength is for this technology. The audio is streaming over a quality speaker, so let’s leverage that to our advantage. An overlooked aspect of Alexa is a markup language called SSML that enables complete control over audio rendering. If you’re familiar with CSS for browser apps, SSML is the equivalent for audio apps. Imagine how poor the user experience for our favorite websites would be if they only rendered basic text. It’s the same way for voice, and why it is so important to learn SSML to further the audio experience.

The definitive guide for SSML is on the Amazon Alexa website, and in this post I’ve included some easy tips on how to get started. When making the request to Alexa, to use SSML, we need to expand the outputSpeech attribute like this.

"outputSpeech": {    "type": "SSML",    "ssml": "<speak>This output speech uses SSML.</speak>"}

Adding Pauses

The SSML attribute uses traditional markup tags similar to HTML. Here’s another example showing just the SSML attribute.

<speak>                                             <p>This is a paragraph. There will be a pause after this.</p>           <p>Followed by another paragraph.</p> </speak>

Some skills read plenty of text, and begin to sound mechanical if there are no natural pauses. Using breaks between paragraphs can solve that, and the <p></p> markup provides a brief pause that can easily be controlled when scripting the request. There are other times where a longer break is necessary that can be solved by setting a timed break (code below). This can be for up to ten seconds.

<speak>                                             Okay, let's be mindful and take a deep breath.    <break time="3s"/>    Now don't we feel better?</speak>

Adding Audio

Given that we are using a speaker, what could be better for the user experience than adding sounds or music? Once again, this requires some basic SSML. Here’s what it looks like.

<speak>    Welcome to Music Teacher.    

There are a few different lessons to take note of when creating these audio files.

  • The audio clip needs to be in MP3 format (MPEG version 2).
  • Bit rate should be 48 kbps.
  • The MP3 file must be hosted on an endpoint that uses HTTPS. The easiest way to comply is to host via an AWS S3 bucket.
  • The sample rate must be 16000 Hz.
  • The audio file cannot be longer than ninety (90) seconds.

A free tool to manipulate these settings is Audacity.

An easy way to get started is to add a brief audio cue at the beginning of the welcome message when starting the skill as in the example above. In my Music Teacher skill, I’ve included a short (3–5 second) clip of piano keys being played. It didn’t take much to record, format, and upload to S3, and is just one line of code.

Overriding Default Speech

There are other markups to take care of overriding the default audio experience within Alexa. Any guesses on what happens when you put this response out?

<speak>    Can you call me at 

The voice response by Alexa is “Eight million, six hundred seventy five thousand, three hundred and nine.” Not a very good user experience if we’re trying to call out a phone number. If we add markup to it like this using the “say-as” tag.

<speak>    <say-as interpret-as="digits">Can you call me at 

The audio response is what we’re looking for “eight-six-seven-five-three-zero-nine”.


There’s a large number of variations of what to override, including date formatting, numerics, pronunciation, etc. An easy way to think of SSML is that it’s the styling, similar to CSS for browsers.

Looking for what’s possible?

Check out the top of the skill leaderboard that covers the most popular skills. I’d estimate that 80% of these are using SSML, and how they’re getting to the top of the pack is that far more than 3% of the users are returning.


For example, the Jeopardy skill is consistently at the top of the “Top enabled skills” category as well as “Customer favorites this week”. Clearly there are brand recognition benefits that the show has going for it, but a simple way that they are tying the channels together is by including the same audio cues (buzzer & chime) that are on the broadcast version. As demonstrated above, that’s a handful of lines of code and a few MP3 recordings. Take those away, and it’s fairly flat content, and not differentiated from the hundreds of other trivia skills despite the huge brand.


If you haven’t tried the Wayne Investigation skill, enable it today. It also uses plenty of MP3 sound integration, and the user experience is fantastic. While it’s been out for six months, it stays on the leaderboard based on the user experience, and carries a 4.5 star rating. The gameplay itself is not complex, and similar gameplay can be used through the Interactive Game Tool Framework.

A recently introduced top app is The Tonight Show skill, and it’s a similar scenario. Rather than the standard Alexa voice reading over the prior nights monologue, the recording of Jimmy Fallon’s voice is embedded via an MP3 file using SSML as described above. +1 on user experience!

Can Independent Developers do this?

This isn’t just for entertainment and media outlets, and can be done by independent developers. A skill that I recently published was Guitar Teacher.


In researching other skills instructing how to play the guitar, I noticed that most were being built with just the Alexa voice. That limits the functionality to using words to articulate how to place fingers. It’s a huge differentiator to include an actual recording of how the guitar sounds when you play, and makes it memorable for the user to come back again.


Let’s continue to celebrate the number of Alexa skills that have been published, but after the initial MVP, invest a few cycles learning SSML and using the full capabilities of the platform. It’s the path to break out of the 3% slump!

react to story with heart
react to story with light
react to story with boat
react to story with money
. . . comments & more!