Speech recognition (Python package)¶
Summary¶
Python library to connect to several speech recognition APIs, including CMU Sphinx, Google Cloud, Wit.ai, … . It can be used for stream processing (in “real-time” with automatic detection of end according to the API) or for batch processing, both with synchronous speech recognition (response as soon as it is processed) for short audio sequences and asynchronous speech recognition (for larger files).
Below, we explain how to make it work with two particular speech recognition APIs, namely CMU Sphinx and the Google API. Let us quickly compare their pros and cons:
Note
We found the google API to be very precise straight out of the box. Sphinx performance very poor and was close to non-usuable without a grammar definition (explained in detail below) and good with a grammar.
Hardware requirements¶
- microphone, e.g. the internal microphone of the Kinect
Software requirements¶
- installed microphone drivers, e.g. as part of the Kinect drivers
- PyAudio (specifically considered as part of the installation procedure below)
Installation¶
Speech recognition package¶
- First, install the speech recognition package itself:
$ pip install SpeechRecognition
- The library requires PyAudio as a dependency (and PyAudio’s dependencies) in order to use real-time microphone input. The pip installation procedure should install PyAudio automatically, but the pip full pip installation is as of the date of this writing broken and might still be when used in the future. Note that a simple
pip install pyaudio
does not work. To resolve this, please follow the steps of the setup test below. If the output is correct, PyAudio has been installed correctly and you can jump to the next step. However, if the output is an error message indicating something similar to
AttributeError: Could not find PyAudio; check installation
then PyAudio has not been installed correctly and you have to follow the following steps:
a) Install several further dependencies of PyAudio manually by executingsudo apt install libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools
. b) Now, install PyAudio manually by executingpip install pyaudio
. c) As a final step, re-run the setup test below and give Fezzik a hand for documenting all this struggle.
Now let’s install further dependencies for the underlying APIs. Note that you can install either the Google API or CMU Sphinx; they are independent.
Google Cloud API for speech recognition¶
Some useful links to start: * Documentation: https://cloud.google.com/speech/docs/ * Overview of Google speech recognition API: https://cloud.google.com/speech/
- Follow the steps under https://cloud.google.com/iam/docs/creating-managing-service-accounts#creating_a_service_account to create a service account. To do so, you need a Google account and have to provide billing information (a working credit card).
- At the end of this process, you should get a so-called service account key which is a
.json
file. You will need the content of this key for the speech recognition package. Store it in a location on your local machine (don’t share it via git, since it can be used to access the API).
Note
You only have to setup an API key and activate the service once. You can use it later on other machines, thus, you can skip this first step in such a case.
- Install the python API client (assuming you are using python) by executing
$ pip install --upgrade google-api-python-client
How to setup the CMU Sphinx package¶
$ sudo apt-get install python3 python3-all-dev python3-pip build-essential swig git libpulse-dev
$ pip3 install pocketsphinx
Setup test¶
This is a setup test for the speech recognition
package only.
- Run the demo program by executing:
python -m speech_recognition
This should start a demo program which, if correct, outputs something similar to:
Say something!
Got it! Now to recognize it...
You said hello hello
Say something!
Got it! Now to recognize it...
Oops! Didn't catch that
If not, consider the usage of this demo program in the installation instructions above.
Implementation¶
TODO
Limitations¶
- In noisy environments, speech recognition works poorly, in particular when using external surround microphones like the Kinect.
- When using speech recognition together with other ROS packages that strongly use ROS communication via a wireless LAN, using, in addition, a wireless internet connection can lead to broadband and latency issues. Furthermore, internet connection might be poor/uncontrollable when exiting an isolated environment (like the lab). We therefore favoured an offline
- using multiple APIs at once - limited possibility to configure -> one could make a pull request to the source code of the speech recognition package
What Didn’t Work¶
Note
TODO