How to transcribe and analyse a phone call in real-time

In this post, I want to share an example of how to stream phone call audio through IBM Watson Speech to Text and IBM Watson Natural Language Understanding services, and show some ideas of what you could use this for.

Let’s start with a demo

That’s what I want to show you how to build.

At a high-level, this is what you will have seen in that video:

Faith made a phone call to a phone number managed by Twilio.

Twilio routed the phone call to me, and I answered the call.

We then started talking to each other. And while we were doing this:

Twilio streamed a copy of the audio from the phone call to a demo Node.js app

The Node.js app sent audio to the Watson Speech to Text service for transcribing.

Watson Speech to Text asynchronously sent transcriptions to the Node.js app as soon as they were available.

The app then submitted the transcription text to Watson Natural Language Understanding for analysis.

All of this – the transcriptions and analyses – were displayed on the demo web page.

How did that all work?

I’ve written detailed instructions for how you can get and run the code for this yourself in a Code Pattern on but I want to describe a little more about what is happening here, too.

Step 1 – Collecting the phone number to connect to

Faith needed to call a phone number managed by Twilio.

The call processing was handled in two phases.

The first phase was implemented in TwiML Bin XML. This simple XML code generated the voice you can hear on the video, that asked her to enter my phone number into her phone keypad.

Then you can hear her tap in my mobile number, and then press hash.

The last thing this TwiML Bin code does is invoke a REST API that is implemented in the Node.js app.

Step 2 – Connecting the call

The second phase of call handling was also implemented in TwiML Bin, but this was dynamically generated by the invoked REST API in the generateTwimlBin function.

It used the <Dial> verb to connect Faith’s call to my phone.

Step 3 – Forwarding call audio to the application

The generated TwiML Bin code also included two <Stream> verbs.

One instructed Twilio to stream the audio from Faith to the application’s websocket address /ws/caller, that is defined in lib/api.js.

The other instructed Twilio to stream the audio from my phone to the application’s websocket address /ws/receiver, also defined in lib/api.js.

This means that the application received the two audio streams independently – there was no need for it to try and identify our separate voices from a combined audio.

Step 4 – Sending call audio to Speech to Text

Call audio was received as base64-encoded audio data in a Twilio JSON payload format.

lib/phone-to-stt.js was responsible for extracting each base64-encoded audio string from the JSON objects, and sending them to the Watson Speech to Text service.

Step 5 – Receiving transcriptions from Speech to Text

The Speech to Text service asynchronously sent transcriptions to the application when they were ready through a websocket connection.

The configuration used for the Speech to Text service meant that interim transcriptions were sent any time something was recognised, and final transcriptions were sent when Faith or I paused speaking.

Transcriptions were received as JSON objects, with both interim and final transcriptions processed by the handleSttData() function in lib/phone-to-stt.js.

These transcriptions were stored in-memory using lib/stt-store.js so that they were available for display in the web application (step 7).

Step 6 – Analysing transcriptions

When requested by the web application, a combined transcript of everything we had said on the call so far was submitted to the Watson Natural Language Understanding service for analysis. This was done by the analyze function in lib/nlu.js.

The configuration used for the Natural Language Understanding service meant that emotion in the text was assessed (e.g. sadness, joy, fear, anger, etc.) however commented-out examples in NLU_CONFIG show how other analyses such as entity extraction, tone or sentiment analysis could be performed.

As transcriptions can be received very frequently while someone is speaking, to avoid the application making a large number of calls to NLU, analyses are cached. The amount of time that a cached analysis should be reused is determined by the CACHE_TIME_SECONDS constant.

Get the code

That was the demo!

If you think you could turn this into something useful for your own project, you can find all of the source code and instructions for how to build and run it at

Tags: , , ,

Comments are closed.