I've been speaking in public for about 40 years. I am always looking to improve my pace, clarity, informational value, and educational value. My new job, which I love, does not require as much public speaking as my old job and I am a little out of practice. I have noticed a few filler words creeping back into my presentations. Words such as "Um" and "Uh".
I thought "Wouldn't it be great if, while I am presenting, an app was converting my spoken words to text real time and flagging filler words?" These flagged words would give me a little reminder to slow down and take more pauses if needed.
I am a student of machine learning and familiar with the Google Cloud APIs for all sorts of tasks. When a Google search for "Real Time Speech to Text" turned up a result for using Google Cloud Speech to Text for this, I was all in. This page provides code samples in multiple programming languages to send either recorded audio input (16 KHz, Mono) or microphone input to Google Cloud and then transcribed text is sent back in real time.
Even though I have never used the Go programming language, I used the Go samples as Python is typically a nightmare of failed pip installs for me. Python really is very unreliable.
I used Linux for this test as I started with macOS, but the real time microphone input requires the use of the gst-launch-1.0 program which does not exist on macOS. gst-launch-1.0 passes the microphone input to the Go program in the correct format for the Google speech to text API (gst-launch-1.0 -v pulsesrc ! audioconvert ! audioresample ! audio/x-raw,channels=1,rate=16000 ! filesink location=/dev/stdout | go run gcp-mic-text.go).
To install the Go language, version 1.14.2, on Linux:
Once done, running the "go version" command should return this output: "go version go1.14.2 linux/amd64"
Creating the Real-Time Microphone to Text Go Progran
I made a few changes to the sample real-time program provided by Google here. You can find my changed file in the Google folder of this repo.
Basically, I replaced the print of the entire array of alternative text results to print only index 0 (the highest confidence translation). I also printed just the text (.Transcript) and not all the other fields such as "Result: alternatives:<transcript:" and "confidence:0.9377223 > is_final:true result_end_time:<seconds:12 nanos:70000000 >". This made the output much easier to read.
My end goal was to flag my filler words such as "um" and "uh", so I Googled a bit of Go syntax and found the strings.ReplaceAll function. I added a bit of logic to replace all occurrences of "Um" with "**Um**" in the translated text.
Imagine my surprise after all this work when I could not get the word "Um" to show up in my speech-to-text no matter how many times I said um. It turns out that the Google speech to text model was built to supress filler words. Filler words will never show up in the transcripts. Oh, well. Let's see if Azure or IBM Watson leave filler words in the transcript.
Version 2 - Azure (Confusing) then IBM Watson (Success!)
So, fresh from my semi-failure with Google speech to text, I searched for other speech-to-text APIs. I found this very helpful article which recommended Google, Microsoft, Dialogflow, IBM Watson, and Speechmatics: https://nordicapis.com/5-best-speech-to-text-apis/
Short Stop at Microsoft Azure Cognitive Services
I have done some simple development in Azure, so I thought I would try Microsoft Cognitive Services next. I started a project in Microsoft Cognitive Services but could not figure out how to upload audio to the Speaker Recognition API Quick Start page. I'm sure there is a way, but I Googled for a while and could not figure this out. On to IBM Watson Speech-to-Text...
On to IBM Watson Speech-to-Text
Next step, IBM Watson. I have never used any IBM Watson services before but have heard good things. Also, I do not have a billable account with IBM Watson as I do with AWS, Azure, and GCP. Thankfully, IBM Watson Speech to Text does not require a billing account for < 500 minutes of audio trnascribed per month. Thank you very much.
Curl is the easiest way to test that you have a working personalized IBM speech to text API key and service URL. To create your API and service URL, create a speech-to-text service on this page: https://cloud.ibm.com/catalog.
If everything went well, you should see this output:
"transcript": "several tornadoes touched down as a line of severe thunderstorms swept through Colorado on Sunday "
If you want to create your own WAV files for testing, the files have to be 16KHz and mono. On my Mac, I used the "To WAV Converter" app found on this web site https://amvidia.com/wav-converter and used these settings:
Creating a Custom Microphone to Text App with Trigger Words
Change #1 - Replace default keywords to spot
Not really a big deal as this field is user-editable before clicking "Record Audio" in the app, but still nice to find. You can modify the default keywords that come up in the app in the file /src/data/samples.json. Line 41 in the section "en-US_BroadbandModel" is where the default keywords are defined. You can change/reduce these to suit your needs.
During my testing, I was receiving alerts even when I did not say filler words. It turns out that the demo program has a confidence floor of 1% before a translated word is flagged. I wanted to increase this confidence floor to 50% and test again. While opening random files in the repo in Atom, I found this section of ./views/demo.jsx on lines 117-119:
keywords_threshold:keywords.length?0.01:undefined, // note: in normal usage, you'd probably set this a bit higher
I changed the 0.01 to 0.50 and the new section now looks like this:
keywords_threshold:keywords.length?0.50:undefined, // note: in normal usage, you'd probably set this a bit higher
You can find this modified file in the Watson directory of this repo.
Listening and Displaying Alerts on Another Device
So, now everything is working. Yay. Now what microphone do I use to listen for and alert on filler words? That microphone needs to be different from the microphone I am using to present over WebEx at my desk. I had a few options:
Connect a second microphone to my Mac and tell the browser on my Mac to listen to this second microphone
Connect a second microphone to the Linux VM used for the Google test and listen in this VM
Listen from the microphone and browser on my phone propped in front of me while I present
I chose option 3 even though option 3 required a hack.
Google Chrome Hack to Allow Microphone Input for Insecure Web Site
Here's the issue I ran into. This demo code runs a simple http web server. Most browsers, reasonably, do not allow microphone nor camera input to insecure web servers. Opening the microphone on my Mac to a web server running on the same Mac (localhost) is OK. Opening the microphone on my phone to some random IP address that just happens to be my Mac is not OK. Thankfully, you can override this microphone block for trusted, insecure web servers in Chrome by going to the chrome://flags/#unsafely-treat-insecure-origin-as-secure setting in Chrome:
Change the setting from Disabled to Enabled and type the full path to your web server in the text field. For example: http://192.168.86.24:3000 Once you do this, you will be prompted with a button to relaunch your browser. Relaunch your browser and you should be able to use the microphone on your phone (or any other device) to provide input to your speech to text application. Here is a photo of the entire system in use the other day as I was giving a PowerPoint presentation:
Thank you for taking the time to read this tutorial. Please let me know if you have any issues making this work for yourself, if you find any errors, or have any suggestions.