{"id":4630,"date":"2022-07-16T17:05:50","date_gmt":"2022-07-16T17:05:50","guid":{"rendered":"https:\/\/dalelane.co.uk\/blog\/?p=4630"},"modified":"2022-07-16T17:05:50","modified_gmt":"2022-07-16T17:05:50","slug":"how-to-transcribe-and-analyse-a-phone-call-in-real-time","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=4630","title":{"rendered":"How to transcribe and analyse a phone call in real-time"},"content":{"rendered":"<p><strong>In this post, I want to share an example of how to stream phone call audio through IBM Watson Speech to Text and IBM Watson Natural Language Understanding services, and show some ideas of what you could use this for.<\/strong><\/p>\n<p>Let&#8217;s start with a demo<\/p>\n<p><iframe loading=\"lazy\" width=\"450\" height=\"253\" src=\"https:\/\/www.youtube.com\/embed\/So3b4uJGaBw\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen=\"\"><\/iframe><\/p>\n<p>That&#8217;s what I want to show you how to build.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/github.com\/IBM\/phone-stt-demo\/raw\/master\/doc\/source\/images\/architecture.png\" style=\"border: thin black solid\"\/><\/p>\n<p>At a high-level, this is what you will have seen in that video:<\/p>\n<p><strong>1.<\/strong><br \/>\nFaith made a phone call to a phone number managed by Twilio.<\/p>\n<p><strong>2.<\/strong><br \/>\nTwilio routed the phone call to me, and I answered the call.<\/p>\n<p>We then started talking to each other. And while we were doing this:<\/p>\n<p><strong>3.<\/strong><br \/>\nTwilio streamed a copy of the audio from the phone call to a demo Node.js app<\/p>\n<p><strong>4.<\/strong><br \/>\nThe Node.js app sent audio to the Watson Speech to Text service for transcribing.<\/p>\n<p><strong>5.<\/strong><br \/>\nWatson Speech to Text asynchronously sent transcriptions to the Node.js app as soon as they were available.<\/p>\n<p><strong>6.<\/strong><br \/>\nThe app then submitted the transcription text to Watson Natural Language Understanding for analysis.<\/p>\n<p><strong>7.<\/strong><br \/>\nAll of this &#8211; the transcriptions and analyses &#8211; were displayed on the demo web page.<\/p>\n<p><!--more--><\/p>\n<hr \/>\n<h2>How did that all work?<\/h2>\n<p>I&#8217;ve written detailed instructions for how you can get and run the code for this yourself in a <a href=\"https:\/\/developer.ibm.com\/patterns\/transcribe-a-phone-call-in-real-time-with-watson-speech-to-text-and-twilio\/\">Code Pattern on developer.ibm.com<\/a> but I want to describe a little more about what is happening here, too.<\/p>\n<h3>Step 1 &#8211; Collecting the phone number to connect to<\/h3>\n<p>Faith needed to call a phone number managed by Twilio.<\/p>\n<p>The call processing was handled in two phases.<\/p>\n<p>The first phase was implemented in <strong>TwiML Bin XML<\/strong>. This simple XML code generated the voice you can hear on the video, that asked her to enter my phone number into her phone keypad.<\/p>\n<p>Then you can hear her tap in my mobile number, and then press hash.<\/p>\n<p>The last thing this TwiML Bin code does is invoke a REST API that is implemented in the Node.js app.<\/p>\n<h3>Step 2 &#8211; Connecting the call<\/h3>\n<p>The second phase of call handling was also implemented in TwiML Bin, but this was <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/56ec58252455ebf81427c7840a8daa72e970988b\/lib\/twilio.js#L25-L49\">dynamically generated by the invoked REST API<\/a> in the <code>generateTwimlBin<\/code> function.<\/p>\n<p>It used the <code>&lt;Dial&gt;<\/code> verb to connect Faith&#8217;s call to my phone.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/github.com\/IBM\/phone-stt-demo\/raw\/master\/doc\/source\/images\/architecture.png\" style=\"border: thin black solid\"\/><\/p>\n<h3>Step 3 &#8211; Forwarding call audio to the application<\/h3>\n<p>The <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/56ec58252455ebf81427c7840a8daa72e970988b\/lib\/twilio.js#L39-L42\">generated TwiML Bin code<\/a> also included two <code>&lt;Stream&gt;<\/code> verbs.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/github.com\/IBM\/phone-stt-demo\/raw\/master\/doc\/source\/images\/source-generate-twiml-bin.png\" style=\"border: thin black solid\"\/><\/p>\n<p>One instructed Twilio to stream the audio from Faith to the application&#8217;s websocket address <code>\/ws\/caller<\/code>, that is defined in <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/56ec58252455ebf81427c7840a8daa72e970988b\/lib\/api.js#L126\">lib\/api.js<\/a>.<\/p>\n<p>The other instructed Twilio to stream the audio from my phone to the application&#8217;s websocket address <code>\/ws\/receiver<\/code>, also defined in <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/56ec58252455ebf81427c7840a8daa72e970988b\/lib\/api.js#L127\">lib\/api.js<\/a>.<\/p>\n<p>This means that the application received the two audio streams independently &#8211;  there was no need for it to try and identify our separate voices from a combined audio.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/github.com\/IBM\/phone-stt-demo\/raw\/master\/doc\/source\/images\/architecture.png\" style=\"border: thin black solid\"\/><\/p>\n<h3>Step 4 &#8211; Sending call audio to Speech to Text<\/h3>\n<p>Call audio was received as base64-encoded audio data in a <a href=\"https:\/\/www.twilio.com\/docs\/voice\/twiml\/stream#message-media\">Twilio JSON payload format<\/a>.<\/p>\n<p><a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/56ec58252455ebf81427c7840a8daa72e970988b\/lib\/phone-to-stt.js#L92-L132\"><code>lib\/phone-to-stt.js<\/code><\/a> was responsible for extracting each base64-encoded audio string from the JSON objects, and sending them to the Watson Speech to Text service.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/github.com\/IBM\/phone-stt-demo\/raw\/master\/doc\/source\/images\/source-send-to-stt.png\"\/><\/p>\n<h3>Step 5 &#8211; Receiving transcriptions from Speech to Text<\/h3>\n<p>The Speech to Text service asynchronously sent transcriptions to the application when they were ready through a websocket connection.<\/p>\n<p>The <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/master\/lib\/config\/stt.js\">configuration used for the Speech to Text service<\/a> meant that interim transcriptions were sent any time something was recognised, and final transcriptions were sent when Faith or I paused speaking.<\/p>\n<p>Transcriptions were received as JSON objects, with both interim and final transcriptions processed by the <code>handleSttData()<\/code> function in <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/56ec58252455ebf81427c7840a8daa72e970988b\/lib\/phone-to-stt.js#L149-L177\">lib\/phone-to-stt.js<\/a>.<\/p>\n<p>These transcriptions were stored in-memory using <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/master\/lib\/stt-store.js\">lib\/stt-store.js<\/a> so that they were available for display in the web application (step 7).<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/github.com\/IBM\/phone-stt-demo\/raw\/master\/doc\/source\/images\/architecture.png\" style=\"border: thin black solid\"\/><\/p>\n<h3>Step 6 &#8211; Analysing transcriptions<\/h3>\n<p>When requested by the web application, a combined transcript of everything we had said on the call so far was submitted to the Watson Natural Language Understanding service for analysis. This was done by the <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/56ec58252455ebf81427c7840a8daa72e970988b\/lib\/nlu.js#L50-L97\"><code>analyze<\/code><\/a> function in <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/master\/lib\/nlu.js\"><code>lib\/nlu.js<\/code><\/a>.<\/p>\n<p>The <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/master\/lib\/config\/nlu.js\">configuration used for the Natural Language Understanding service<\/a> meant that emotion in the text was assessed (e.g. sadness, joy, fear, anger, etc.) however commented-out examples in <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/56ec58252455ebf81427c7840a8daa72e970988b\/lib\/config\/nlu.js#L6-L43\"><code>NLU_CONFIG<\/code><\/a> show how other analyses such as entity extraction, tone or sentiment analysis could be performed.<\/p>\n<p>As transcriptions can be received very frequently while someone is speaking, to avoid the application making a large number of calls to NLU, analyses are cached. The amount of time that a cached analysis should be reused is determined by the <a href=\"https:\/\/github.com\/IBM\/phone-stt-demo\/blob\/56ec58252455ebf81427c7840a8daa72e970988b\/lib\/config\/nlu.js#L45-L53\"><code>CACHE_TIME_SECONDS constant<\/code><\/a>.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/github.com\/IBM\/phone-stt-demo\/raw\/master\/doc\/source\/images\/architecture.png\" style=\"border: thin black solid\"\/><\/p>\n<h2>Get the code<\/h2>\n<p>That was the demo!<\/p>\n<p>If you think you could turn this into something useful for your own project, you can find all of the source code and instructions for how to build and run it at <a href=\"https:\/\/developer.ibm.com\/patterns\/transcribe-a-phone-call-in-real-time-with-watson-speech-to-text-and-twilio\/\">developer.ibm.com<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, I want to share an example of how to stream phone call audio through IBM Watson Speech to Text and IBM Watson Natural Language Understanding services, and show some ideas of what you could use this for. Let&#8217;s start with a demo That&#8217;s what I want to show you how to build. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4636,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[578,529,607,505],"class_list":["post-4630","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-code","tag-ibmwatson","tag-nlp","tag-twilio","tag-watson"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/4630","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4630"}],"version-history":[{"count":0,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/4630\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/media\/4636"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4630"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4630"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4630"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}