Construct a Regionally Working Voice Assistant | by Sébastien Gilbert

Machine Learning

Construct a Regionally Working Voice Assistant | by Sébastien Gilbert | Dec, 2023

hhhhm

2023年12月30日

Construct a Regionally Working Voice Assistant | by Sébastien Gilbert | Dec, 2023

[ad_1]

Ask an LLM a query with out leaking non-public info

Picture generated by the creator, with assist from openart.ai

I’ve to confess that I used to be initially skeptical concerning the means of Giant Language Fashions (LLM) to generate code snippets that really labored. I attempted it anticipating the worst, and I used to be pleasantly stunned. Like several interplay with a chatbot, the way in which the query is formatted issues, however with time, you get to know the right way to specify the boundaries of the issue you need assistance with.

I used to be getting used to having a web-based chatbot service at all times accessible whereas writing code when my employer issued a company-wide coverage prohibiting workers from utilizing it. I may return to my outdated googling habits, however I made a decision to construct a domestically working LLM service that I may query with out leaking info exterior the corporate partitions. Due to the open-source LLM providing on HuggingFace, and the chainlit challenge, I may put collectively a service that satisfies the necessity for coding help.

The subsequent logical step was so as to add some voice interplay. Though voice is just not well-suited for coding help (you wish to see the generated code snippets, not hear them), there are conditions the place you need assistance with inspiration on a inventive challenge. The sensation of being informed a narrative provides worth to the expertise. However, you could be reluctant to make use of a web-based service since you wish to maintain your work non-public.

On this challenge, I’ll take you thru the steps to construct an assistant that lets you work together vocally with an open-source LLM. All of the elements are working domestically in your pc.

The structure entails three separate elements:

A wake-word detection service
A voice assistant service
A chat service

Flowchart of the three elements. Picture by the creator.

The three elements are standalone tasks, every having its personal github repository. Let’s stroll by every part and see how they work together.

Chat service

The chat service runs the open-source LLM known as HuggingFaceH4/zephyr-7b-alpha. The service receives a immediate by a POST name, passes the immediate by the LLM, and returns the output as the decision response.

Yow will discover the code right here.

In …/chat_service/server/, rename chat_server_config.xml.instance to chat_server_config.xml.

You may then begin the chat server with the next command:

python .chat_server.py

When the service runs for the primary time, it takes a number of minutes to start out as a result of massive recordsdata get downloaded from the HuggingFace web site and saved in a neighborhood cache listing.

You get a affirmation from the terminal that the service is working:

Affirmation that the chat service is working. Picture by the creator.

If you wish to take a look at the interplay with the LLM, go to …/chat_service/chainlit_interface/.

Rename app_config.xml.instance to app_config.xml. Launch the online chat service with

.start_interface.sh

Browse to the native handle localhost:8000

You need to have the ability to work together along with your domestically working LLM by a textual content interface:

Textual content interplay with the domestically working LLM. Picture by the creator.

Voice assistant service

The voice assistant service is the place the speech-to-text and text-to-speech conversions occur. Yow will discover the code right here.

Go to …/voice_assistant/server/.

Rename voice_assistant_service_config.xml.instance to voice_assistant_service_config.xml.

The assistant begins by enjoying the greeting to point that it’s listening to the consumer. The greeting textual content is configured in voice_assistant_config.xml, beneath the factor <welcome_message>:

The voice_assistant_config.xml file. Picture by the creator.

The text-to-speech engine that enables this system to transform textual content into spoken audio that you would be able to hear by your audio output gadget is pyttsx3. From my expertise, this engine speaks with a fairly pure tone, each in English and in French. In contrast to different packages that depend on an API name, it runs domestically.

A mannequin known as fb/seamless-m4t-v2-large performs the speech-to-text inference. Mannequin weights get downloaded when voice_assistant_service.py is first run.

The principal loop in voice_assistant_service.fundamental() performs the next duties:

Get a sentence from the microphone. Convert it to textual content utilizing the speech-to-text mannequin.
Test if the consumer spoke the message outlined within the <end_of_conversation_text> factor from the configuration file. On this case, the dialog ends, and this system terminates after enjoying the goodbye message.
Test if the sentence is gibberish. The speech-to-text engine usually outputs a legitimate English sentence, even when I didn’t say something. By likelihood, these undesirable outputs are likely to repeat themselves. For instance, gibberish sentences will typically begin with “[” or “i’m going to”. I collected a list of prefixes often associated with a gibberish sentence in the <gibberish_prefix_list> element of the configuration file (this list would likely change for another speech-to-text model). Whenever an audio input starts with one of the prefixes in the list, then the sentence is ignored.
If the sentence doesn’t appear to be gibberish, send a request to the chat service. Play the response.

The principal loop in voice_assistant_service.main(). Code by the author.

Wake-word service

The last component is a service that continually listens to the user’s microphone. When the user speaks the wake-word, a system call starts the voice assistant service. The wake-word service runs a smaller model than the voice assistant service models. For this reason, it makes sense to have the wake-word service running continuously while the voice assistant service only launches when we need it.

You can find the wake-word service code here.

After cloning the project, move to …/wakeword_service/server.

Rename wakeword_service_gui_config.xml.example to wakeword_service_gui_config.xml.

Rename command.bat.example to command.bat. You’ll need to edit command.bat so the virtual environment activation and the call to voice_assistant_service.py correspond to your directory structure.

You can start the service by the following call:

python gui.py

The core of the wake-word detection service is the openwakeword project. Out of a few wake-word models, I picked the “hey jarvis” model. I found that simply saying “Jarvis?” will trigger the detection.

Whenever the wake-word is detected, a command file gets called, as specified in the <command_on_wakeword> element of the configuration file. In our case, the command.bat file activates the virtual environment and starts the voice assistant service.

The configuration file of the wake-word detection service GUI. Image by the author.

If you want to use the wake-word detection service for something else, you can edit the command.bat file to make it start whatever program you want.

When the wake-word is detected, the wake-word detection service emits a two-beep audio signal to indicate it was triggered. This quick feedback tells the user that they must wait as the text-to-speech and the speech-to-text models load in the computer’s GPU, which can take several seconds. When both models are loaded, the voice assistant service plays the greeting, and the conversation can begin. After the conversation ends, the voice assistant service terminates, and the wake-word detection service emits another two-beep signal.

When the three components are installed and tested, we can start the chat and the wake-word detection services. The wake-word detection service will constantly listen to the microphone. When the wake-word gets detected (“Jarvis?”), it will start the voice assistant service. You can then ask questions out loud and receive a spoken answer. When the end-of-conversation phrase (“Thank you and goodbye”) is detected, the voice assistant service ends.

Here is an example of an interaction with the voice assistant:

“Jarvis?” [beep-beep]

[Several seconds of silence as the voice assistant models load]

“Hey! How can I enable you?”

“What are Saturn’s rings product of?”

“Saturn’s rings are primarily composed of ice particles, with the bulk being made up of water ice. Nonetheless, there are additionally smaller quantities of different ices, akin to ammonia ice and rocky particles, within the rings. The precise composition of the rings varies relying on the gap from Saturn, with the inside rings being denser and made up of smaller particles, whereas the outer rings are extra unfold out and comprise bigger ice particles.”

“Inform me the plot of the film ‘Again to the Future’.”

“The film ‘Again to the Future’ follows the story of Marty McFly, a highschool pupil who’s by accident transported again to 1955 in a time-traveling DeLorean automotive created by his eccentric scientist pal, Doc Brown. Marty should navigate the challenges of the previous whereas attempting to restore the injury to the current and forestall his teenage mom from falling in love along with his teenage father. Alongside the way in which, Marty befriends a youthful model of Doc Brown and helps him overcome a private tragedy. The film explores themes of destiny, household, and the results of altering the previous.”

“Thanks and goodbye”

“Goodbye” [beep-beep]

On this challenge, we applied a voice assistant system composed of a wake-word detection service, a voice assistant service, and a chat service. Every one in every of these elements is a standalone constructing block reusable for different tasks.

Among the many many surprises I had whereas engaged on this voice assistant, what struck me probably the most was the standard of the speech-to-text conversion. For those who’re like me, you may have most likely struggled with automated voice recognition methods that fail to transcribe easy instructions akin to “Flip down the amount”! I anticipated speech-to-text conversion to be the primary stumbling block of the pipeline. After experimenting with just a few unsatisfying fashions, I landed on fb/seamless-m4t-v2-large and was impressed with the standard of the outcomes. I may even communicate a sentence in French, and the neural community will routinely translate it into English. Nothing lower than superb!

I hope you’ll do this enjoyable challenge, and let me know what you utilize it for!

[ad_2]