A guide to get started with Alexa Skills
Alexa is a cloud based service provided by Amazon that is at the core of their range of Echo devices. It is capable of translating an ever-expanding number of voice commands into actionable requests that the service can then take action on. There are already a number of impressive videos on Youtube highlighting the capabilities of Alexa, which include simply asking the device to “dim the lights” or
“play some music” in a very natural and organic way of communication – that is then translated into complex tasks to be executed. Although this might seem like something that has existed for a while now (yes, iPhone users, Siri is similar, we know), it would be a mistake to simply classify Alexa as another voice assistant. This is because instead of being targeted at phones and other mobile devices,
Alexa is primarily relevant in a home / indoor setup. It can hear someone across the room and most importantly, the capabilities or ‘skills’, as Amazon calls them for Alexa, are open for the developer community to augment.This is perhaps the single most important part of Alexa as it allows developers from all over the world to ‘teach’ it new skills everyday. It also allows the cloud service to grow and
augment its capabilities at an amazing rate. Of course this has also spawned a number of quirky and hilarious skills like the ‘philosoraptor’ skill which, when invoked, entertains the user with a number of pseudo-deep thoughts like “If camera lenses are round, why do pictures come out rectangular?”.
Now that we have some knowledge of what an Alexa skill is and how it is used, let us try to build one for ourselves. We will be explaining each component of the skill as we work on them hereon.
SKILLS AND INTENTS
An Alexa skill can be broadly divided into two components:
1. Voice User Interface (VUI)
2. Programming Logic
VOICE USER INTERFACE
This is the translator for Alexa that translates the captured audio into one or more actions using a combination of speech recognition, machine learning and natural language processing.These ‘actions’ or ‘requirements’ that are output by the VUI are known as ‘Intents’, which are then supplied to the backend for further processing. While speech recognition is by itself simple, the main hurdle is deriving the context of the speech in a sentence. So, while a user can say “four miles” to indicate 4 miles (between two locations say) it is also phonetically similar to “for Miles” as in for the singer Miles Davis. This is where differentiating the context is important, and Alexa achieves this by using a set of ‘utterrances’– that is, training data for its already pre-trained machine learning model. The voice user interface in Alexa can be built on the developer. amazon.com portal.
PROGRAMMING LOGIC
This is essentially the ‘backend’ of the application which handles the intent supplied by the
voice user interface. In the simplest sense, it is an https URL that is called when an action is recognized by the device. Amazon recommends the use of its ‘Lambda service’ to setup a serverless backend for Alexa as it allows for easy optimization using its existing ecosystem. It also allows audio
streaming , text to speech or SSML (Simple Speech Markup Language – HTML like syntax for textto-speech)
from the backend service. We will look at this in greater detail at a later stage.
UTTERANCES
An utterance is a sample of a voice command that the user needs to say in order for Alexa to map it to a specific skill and intent.
An utterance consists of the wake word (usually ‘Alexa’), the starting phase (‘ask’, ‘tell’ etc), the skill invocation name( ‘Ola’ , ‘Zomato’) and the utterance (‘book me a cab to work’) to launch the specific intent for the skill.
DESIGNING THE VOICE USER INTEFACE
We will first design the Voice User Inteface for our skill on developer.amazon.com. To do this, login to your developer account on Amazon and navigate to the Alexa tab, then click on Create a New Skill. Once the skills homescreen is open, click on the Add a new Skill button to open a wizard. As of today, there are
four types of skills that you can create for the Alexa service, these are:
Custom Interaction Model
• Smart Home Skill
• Flash Briefing Skill
• Video Skill
For now, let us focus on the simple Custom Interaction Model API.Select a language (for example,
English, India), write the name of the Skill, eg: Egg recipe Skill and then add the Invocation name for your Skill (let’s call it Egginator).
The invocation name for your skill is crucial as it is used to call on your skill by Alexa. Typically it should be at most 2-3 words long, easy to pronounce and not use an already reserved word, so as not to conflict with other skills There are also a couple of global variables, like if you would need to
have a long audio response (>90 sec) for your skill or if you need to use the video/render template modules in it.Once this is done, click next to move to the skill builder page which has a Skill builder button, clicking on it opens another wizard that allows easy creation of intents (the defaults
for Help, Stop and Cancel intents are already provided).
To create a custom intent, simply provide a name for it, and add a set of sample utterances that trigger it.