Speech recognition in python using pocketsphinx or how I tried to make a voice assistant

  • Tutorial

This is a tutorial on using the pocketsphinx library in Python. I hope he helps you
quickly deal with this library and not step on my rake.


It all started with the fact that I wanted to make myself a voice assistant in python. Initially, it was decided to use the speech_recognition library for recognition . As it turned out, I'm not the only one . For recognition, I used Google Speech Recognition, since it was the only one that did not require any keys, passwords, etc. For speech synthesis, gTTS was taken. In general, it turned out to be almost a clone of this assistant, because of which I could not calm down.


True, I could not calm down not only because of this: I had to wait a long time for the answer (recording did not end right away, sending speech to the server for recognition and text for synthesis took a lot of time), the speech was not always recognized correctly, I had to scream further than half a meter from the microphone , it was necessary to speak clearly, the speech synthesized by Google sounded terrible, there was no activation phrase, that is, sounds were constantly recorded and transmitted to the server.


The first improvement was speech synthesis using yandex speechkit cloud:


URL = 'https://tts.voicetech.yandex.net/generate?text='+text+'&format=wav&lang=ru-RU&speaker=ermil&key='+key+'&speed=1&emotion=good'
response=requests.get(URL)
if response.status_code==200:
    with open(speech_file_name,'wb') as file:
        file.write(response.content)

Then came the recognition queue. I immediately became interested in the inscription "CMU Sphinx (works offline)" on the library page . I will not talk about the basic concepts of pocketsphinx, as did it to me chubakur (for which many thanks to him) in this post.


Install Pocketsphinx


I must say right away that it’s not so easy to install pocketsphinx (at least I didn’t succeed), so pip install pocketsphinx it won’t work, it will fail, it will swear on wheel. Installing via pip will only work if you have swig installed. Otherwise, to install pocketsphinx you need to go here and download the installer (msi). Please note: the installer is only for version 3.5!


Speech recognition with pocketsphinx


Pocketsphinx can recognize speech from both a microphone and a file. He can also look for hot phrases (I didn’t succeed, for some reason, the code that should be executed when the hot word is found is executed several times, although I only pronounced it). Pocketsphinx differs from cloud solutions in that it works offline and can work in a limited dictionary, which increases accuracy. If interested, there are examples on the library page . Pay attention to the item "Default config".


Russian language and acoustic model


Initially, pocketsphinx comes with English language and acoustic models and a dictionary. You can download Russian at this link . The archive must be unpacked. Then you need to <your_folder>/zero_ru_cont_8k_v3/zero_ru.cd_cont_4000 move the folder to the folder C:/Users/tutam/AppData/Local/Programs/Python/Python35-32/Lib/site-packages/pocketsphinx/model where <your_folder> it is the folder into which you unpacked the archive. The moved folder is an acoustic model. The same procedure should be done with the files ru.lm and ru.dic folder <your_folder>/zero_ru_cont_8k_v3/ . A file ru.lm is a language model, and ru.dic it is a dictionary. If you did everything correctly, then the following code should work.


import os
from pocketsphinx import LiveSpeech, get_model_path

model_path = get_model_path()

speech = LiveSpeech(
    verbose=False,
    sampling_rate=16000,
    buffer_size=2048,
    no_search=False,
    full_utt=False,
    hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'),
    lm=os.path.join(model_path, 'ru.lm'),
    dic=os.path.join(model_path, 'ru.dic')
)

print("Say something!")

for phrase in speech:
    print(phrase)

First check that the microphone is connected and working. If the inscription does not appear for a long time Say something! - this is normal. Most of this time is taken up by creating an instance LiveSpeech that has been created for so long because the Russian language model weighs more than 500 (!) Mb. My copy LiveSpeech is created about 2 minutes.


This code should recognize almost any phrase you uttered. Agree, the accuracy is disgusting. But it can be fixed. And LiveSpeech you can increase the speed of creation LiveSpeech .


Jsgf


Instead of a language model, you can make pocketsphinx work on a simplified grammar. A jsgf file is used for this jsgf . Its use speeds up instantiation LiveSpeech . How to create grammar files is written here . If there is a language model, the jsgf file will be ignored, so if you want to use your own grammar file, you need to write like this:


speech = LiveSpeech(
    verbose=False,
    sampling_rate=16000,
    buffer_size=2048,
    no_search=False,
    full_utt=False,
    hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'),
    lm=False,
    jsgf=os.path.join(model_path, 'grammar.jsgf'),
    dic=os.path.join(model_path, 'ru.dic')
)

Naturally, a grammar file must be created in a folder C:/Users/tutam/AppData/Local/Programs/Python/Python35-32/Lib/site-packages/pocketsphinx/model . And one jsgf more thing : when using jsgf it, you jsgf will have to speak more clearly and separate the words.


Create your own dictionary


A dictionary is a set of words and their transcriptions; the smaller it is, the higher the recognition accuracy. To create a dictionary with Russian words you need to use the ru4sphinx project . Download, unpack. Then we open the notebook and write the words that should be in the dictionary, each from a new line, then save the file my_dictionary.txt in a folder text2dict in UTF-8 text2dict encoding . Then, open the console and write: C:\Users\tutam\Downloads\ru4sphinx-master\ru4sphinx-master\text2dict> perl dict2transcript.pl my_dictionary.txt my_dictionary_out.txt . Open my_dictionary_out.txt , copy the contents. Open the notepad, paste the copied text and save the file as my_dict.dic (instead of “text file” select “all files”), in UTF-8 my_dict.dic encoding .


speech = LiveSpeech(
    verbose=False,
    sampling_rate=16000,
    buffer_size=2048,
    no_search=False,
    full_utt=False,
    hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'),
    lm=os.path.join(model_path, 'ru.lm'),
    dic=os.path.join(model_path, 'my_dict.dic')
)

Some transcriptions may need to be tweaked.


Using pocketsphinx via speech_recognition


Using pocketsphinx through speech_recognition only makes sense if you recognize English speech. In speech_recognition, you cannot specify an empty language model and use jsgf, and therefore it will take 2 minutes to recognize each fragment. Verified.


Summary


Ditching a few evenings, I realized that I wasted my time. In a two-word dictionary (yes and no), the sphinx manages to make mistakes, and often. 30-40% of celeron eats away, and with the language model also a bold chunk of memory. And Yandex recognizes almost any speech accurately, while it does not eat memory and processor. So think for yourself whether it is worth undertaking at all.


PS : this is my first post, so I'm waiting for advice on the design and content of the article.

Only registered users can participate in the survey. Please come in.

Which speech recognition solution do you like more?

  • 25.0% sphinx 33
  • 34.1% Yandex Speechkit Cloud 45
  • 29.6% Google Cloud Speech API 39
  • 11.4% Own option 15