Chasing simplicity: Playing audio files with Echo using SSML and Python

Amazon offers the option of playing audio files via Echo using SSML. To quote:

" in some cases you may want additional control over how Alexa generates the speech from the text in your response. For example, you may want a longer pause within the speech, or you may want a string of digits read back as a standard telephone number. The Alexa Skills Kit provides this type of control with Speech Synthesis Markup Language (SSML) support."

I looked around for some examples to learn how this is achieved and ran into a brick wall. Here's what I learnt on (using python as my language) playing an audio file using Echo.

The <speak> tag: All SSML documents(text) need to be embedded within the speak tag.
The <audio> tag: Lets you provide the URL to an audio file. There are some guidelines around the hosting and characteristics of the file you provide.

The MP3 must be hosted at an Internet-accessible HTTPS endpoint. (best bet? use S3)
No sensitive or customer specific information
Sample rate of 16000 Hz, bit rate of 48 kbps
No longer than 90 seconds

How do we address the requirements around characteristics? Thankfully, Amazon even identifies the tools and commands with which you can achieve this. 2 options(amongst the many available):

Command line: FFmpeg.

following command converts the provided <input-file> to an MP3 file that works with the audio tag.

ffmpeg -i <input-file> -ac 2 -codec:a libmp3lame -b:a 48k -ar 16000 <output-file.mp3>

GUI: Audacity. (this needs the Lame library, available at: http://lame.buanzo.org/#lamewindl)

Open the file to convert.
Set the Project Rate in the lower-left corner to 16000.
Click File > Export Audio and change the Save as type to MP3 Files.
Set the Bit Rate Mode to Constant and Quality to 48 kbps.

What are the code changes needed ?

In the outputSpeech attribute:

set the type to SSML
use SSML for the marked up text(instead of 'text')

So, in effect, if you're used to seeing:

def build_speechlet_response(title, output, reprompt_text, should_end_session):

return {

'outputSpeech': {

'type': 'PlainText',

'text': output

'card': {

'type': 'Simple',

'title': title,

'content': output

'reprompt': {

'outputSpeech': {

'type': 'PlainText',

'text': reprompt_text

}

'shouldEndSession': should_end_session

}

your function will now look something like:

def build_speechlet_response(title, output, reprompt_text, should_end_session):

return {

'outputSpeech': {

'type': 'SSML',

'ssml': output

'card': {

'type': 'Simple',

'title': title,

'content': output

'reprompt': {

'outputSpeech': {

'type': 'PlainText',

'text': reprompt_text

}

'shouldEndSession': should_end_session

}

Here is an example of valid output(note, enclosed within the <speak> </speak>tags. Replace the bucket name and file name appropriately)

'<speak>This output speech uses SSML.<audio src="https://s3-us-west-2.amazonaws.com/<bucket name>/<file name.mp3>" />.</speak>'

When returned in outputSpeech, Echo will :

read out, in normal, Alexa's voice: "This output speech uses SSML."
and then play the audio file the URL points to.

Chasing simplicity

Thursday, January 19, 2017

Playing audio files with Echo using SSML and Python

No comments:

Post a Comment