Thursday, January 19, 2017

Playing audio files with Echo using SSML and Python


Amazon offers the option of playing audio files via Echo using SSML. To quote:

" in some cases you may want additional control over how Alexa generates the speech from the text in your response. For example, you may want a longer pause within the speech, or you may want a string of digits read back as a standard telephone number. The Alexa Skills Kit provides this type of control with Speech Synthesis Markup Language (SSML) support."

I looked around for some examples to learn how this is achieved and ran into a brick wall. Here's what I learnt on (using python as my language) playing an audio file using Echo.

  • The <speak> tag: All SSML documents(text) need to be embedded within the speak tag.
  • The <audio> tag: Lets you provide the URL to an audio file.  There are some guidelines around the hosting and characteristics of the file you provide.
    • The MP3 must be hosted at an Internet-accessible HTTPS endpoint. (best bet? use S3)
    • No sensitive or customer specific information
    • Sample rate of 16000 Hz, bit rate of 48 kbps
    • No longer than 90 seconds

How do we address the requirements around characteristics? Thankfully, Amazon even identifies the tools and commands with which you can achieve this. 2 options(amongst the many available):

  • Command line: FFmpeg

    • following command converts the provided <input-file> to an MP3 file that works with the audio tag.

  • ffmpeg -i <input-file> -ac 2 -codec:a libmp3lame -b:a 48k -ar 16000 <output-file.mp3>
  • GUI: Audacity. (this needs the Lame library, available at: http://lame.buanzo.org/#lamewindl)
    • Open the file to convert.
    • Set the Project Rate in the lower-left corner to 16000.
    • Click File > Export Audio and change the Save as type to MP3 Files.
    • Set the Bit Rate Mode to Constant  and Quality to 48 kbps.
What are the code changes needed ? 
  • In the outputSpeech attribute:
    • set the type to SSML
    • use SSML for the marked up text(instead of 'text')
So, in effect, if you're used to seeing:

def build_speechlet_response(title, output, reprompt_text, should_end_session):
    return {
        'outputSpeech': {
            'type': 'PlainText',
            'text': output
        },
        'card': {
            'type': 'Simple',
            'title': title,
            'content':  output
        },
        'reprompt': {
            'outputSpeech': {
                'type': 'PlainText',
                'text': reprompt_text
            }
        },
        'shouldEndSession': should_end_session
    }

your function will now look something like:

def build_speechlet_response(title, output, reprompt_text, should_end_session):
    return {
        
        'outputSpeech': {
            'type': 'SSML',
            'ssml': output
        },
        'card': {
            'type': 'Simple',
            'title': title,
            'content':  output
        },
        'reprompt': {
            'outputSpeech': {
                'type': 'PlainText',
                'text': reprompt_text
            }
        },
        'shouldEndSession': should_end_session
    }

Here is an example of valid output(note, enclosed within the <speak> </speak>tags. Replace the bucket name and file name appropriately)

'<speak>This output speech uses SSML.<audio src="https://s3-us-west-2.amazonaws.com/<bucket name>/<file name.mp3>" />.</speak>'

When returned in outputSpeech, Echo will :
  • read out, in normal, Alexa's voice: "This output speech uses SSML."
  • and then play the audio file the URL points to.