Speech recognition in python using pocketsphinx or how I tried to make a voice assistant
This is a tutorial on using the pocketsphinx library in Python. I hope he helps you
quickly deal with this library and not step on my rake.
It all started with the fact that I wanted to make myself a voice assistant in python. Initially, it was decided to use the speech_recognition library for recognition . As it turned out, I'm not the only one . For recognition, I used Google Speech Recognition, since it was the only one that did not require any keys, passwords, etc. For speech synthesis, gTTS was taken. In general, it turned out to be almost a clone of this assistant, because of which I could not calm down.
True, I could not calm down not only because of this: I had to wait a long time for the answer (recording did not end right away, sending speech to the server for recognition and text for synthesis took a lot of time), the speech was not always recognized correctly, I had to scream further than half a meter from the microphone , it was necessary to speak clearly, the speech synthesized by Google sounded terrible, there was no activation phrase, that is, sounds were constantly recorded and transmitted to the server.
The first improvement was speech synthesis using yandex speechkit cloud:
URL = 'https://tts.voicetech.yandex.net/generate?text='+text+'&format=wav&lang=ru-RU&speaker=ermil&key='+key+'&speed=1&emotion=good' response=requests.get(URL) if response.status_code==200: with open(speech_file_name,'wb') as file: file.write(response.content)
Then came the recognition queue. I immediately became interested in the inscription "CMU Sphinx (works offline)" on the library page . I will not talk about the basic concepts of pocketsphinx, as did it to me chubakur (for which many thanks to him) in this post.
I must say right away that it’s not so easy to install pocketsphinx (at least I didn’t succeed), so
Installing via pip will only work if you have swig installed. Otherwise, to install pocketsphinx you need to go
and download the installer (msi).
Please note: the installer is only for version 3.5!
pip install pocketsphinx
it won’t work, it will fail, it will swear on wheel.
Speech recognition with pocketsphinx
Pocketsphinx can recognize speech from both a microphone and a file. He can also look for hot phrases (I didn’t succeed, for some reason, the code that should be executed when the hot word is found is executed several times, although I only pronounced it). Pocketsphinx differs from cloud solutions in that it works offline and can work in a limited dictionary, which increases accuracy. If interested, there are examples on the library page . Pay attention to the item "Default config".
Russian language and acoustic model
Initially, pocketsphinx comes with English language and acoustic models and a dictionary. You can download Russian at
. The archive must be unpacked. Then you need to
move the folder to the folder
it is the folder into which you unpacked the archive. The moved folder is an acoustic model. The same procedure should be done with the files
. A file
is a language model, and
it is a dictionary. If you did everything correctly, then the following code should work.
import os from pocketsphinx import LiveSpeech, get_model_path model_path = get_model_path() speech = LiveSpeech( verbose=False, sampling_rate=16000, buffer_size=2048, no_search=False, full_utt=False, hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'), lm=os.path.join(model_path, 'ru.lm'), dic=os.path.join(model_path, 'ru.dic') ) print("Say something!") for phrase in speech: print(phrase)
First check that the microphone is connected and working. If the inscription does not appear for a long time
- this is normal. Most of this time is taken up by creating an instance
that has been created for so long because the Russian language model weighs more than 500 (!) Mb. My copy
is created about 2 minutes.
This code should recognize almost any phrase you uttered. Agree, the accuracy is disgusting. But it can be fixed. And
you can increase the speed of creation
Instead of a language model, you can make pocketsphinx work on a simplified grammar. A
file is used for this
. Its use speeds up instantiation
. How to create grammar files is written
. If there is a language model, the
file will be ignored, so if you want to use your own grammar file, you need to write like this:
speech = LiveSpeech( verbose=False, sampling_rate=16000, buffer_size=2048, no_search=False, full_utt=False, hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'), lm=False, jsgf=os.path.join(model_path, 'grammar.jsgf'), dic=os.path.join(model_path, 'ru.dic') )
Naturally, a grammar file must be created in a folder
. And one
more thing : when using
will have to speak more clearly and separate the words.
Create your own dictionary
A dictionary is a set of words and their transcriptions; the smaller it is, the higher the recognition accuracy. To create a dictionary with Russian words you need to use the
. Download, unpack. Then we open the notebook and write the words that should be in the dictionary, each from a new line, then save the file
in a folder
. Then, open the console and write:
C:\Users\tutam\Downloads\ru4sphinx-master\ru4sphinx-master\text2dict> perl dict2transcript.pl my_dictionary.txt my_dictionary_out.txt
, copy the contents. Open the notepad, paste the copied text and save the file as
(instead of “text file” select “all files”), in
speech = LiveSpeech( verbose=False, sampling_rate=16000, buffer_size=2048, no_search=False, full_utt=False, hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'), lm=os.path.join(model_path, 'ru.lm'), dic=os.path.join(model_path, 'my_dict.dic') )
Some transcriptions may need to be tweaked.
Using pocketsphinx via speech_recognition
Using pocketsphinx through speech_recognition only makes sense if you recognize English speech. In speech_recognition, you cannot specify an empty language model and use jsgf, and therefore it will take 2 minutes to recognize each fragment. Verified.
Ditching a few evenings, I realized that I wasted my time. In a two-word dictionary (yes and no), the sphinx manages to make mistakes, and often. 30-40% of celeron eats away, and with the language model also a bold chunk of memory. And Yandex recognizes almost any speech accurately, while it does not eat memory and processor. So think for yourself whether it is worth undertaking at all.
PS : this is my first post, so I'm waiting for advice on the design and content of the article.
Only registered users can participate in the survey. Please come in.
Which speech recognition solution do you like more?
- 25.0% sphinx 33
- 34.1% Yandex Speechkit Cloud 45
- 29.6% Google Cloud Speech API 39
- 11.4% Own option 15