I’ve wanted to use speech detection in my personal projects for the longest time, but the Google API has gradually gotten more and more restrictive as time passes. In order to ensure that my projects could work even without an internet connection, I looked for another speech recognition package that would preferably be easier to use. I found the Sphinx voice recognition suite of CMU to be a really great speech to text package. However, documentation and sample code is non-existent, so it took me forever to get anything done. Finally, I’ve figured it out! The example code is at the bottom of this post, but you can directly download it from Github here.
Here are the steps to take to get this working:
- Download SphinxBase and follow the install instructions
- Download PocketSphinx and follow the install instructions
- Download PocketSphinx-python and follow the install instructions
- Run the code below
The main problems I had with setting up PocketSphinx was the myriad of libraries that the main site told me to download. However, after lots of trial and error, I’ve realized that I really only need three.
- SphinxBase is the base package that all of the other Sphinx programs use
- PocketSphinx is the lightweight recognizer, since I was okay with the program being a bit inaccurate if it meant I could decode phrases faster
- PocketSphinx-python is the wrapper to allow us to program in the best scripting language ever.
The code basically sets up the microphone and saves each phrase detected as a temporary .wav file which the Sphinx decoder then translates into a list of strings representing the spoken words. A phrase is defined as a bunch of sound sandwiched by duration of silence. I stole most of the phrase detection code from someone else two years ago, though unfortunately, I can’t remember who. If you’re reading this, thank you! 🙂
Anyhow, in the initialization of the run loop, we first define what the minimum threshold should be in defining “silence”. Then we launch into an infinitely running loop that will continue to listen to sounds over the microphone, calling the Sphinx decoder whenever a phrase has been saved. A sliding average is used as well during phrase detection, to make things a bit more accurate. You can load different voice recognition models into the decoder config if you want this speech recognition code to work for different languages.
Now that I have this speech detection code in a neat little importable class, I’m really excited about future capabilities of my projects. So many ideas, so little time!
[Addendum] Thanks to Carl at email@example.com for getting this code working with Python3!