Home Machine Learning Speech to Textual content to Speech with AI Utilizing Python — a How-To Information | by Naomi Kriger | Feb, 2024

Speech to Textual content to Speech with AI Utilizing Python — a How-To Information | by Naomi Kriger | Feb, 2024

0
Speech to Textual content to Speech with AI Utilizing Python — a How-To Information | by Naomi Kriger | Feb, 2024

[ad_1]

The best way to Create a Speech-to-Textual content-to-Speech Program

Picture by Mariia Shalabaieva from unsplash

It’s been precisely a decade since I began attending GeekCon (sure, a geeks’ convention 🙂) — a weekend-long hackathon-makeathon through which all tasks have to be ineffective and just-for-fun, and this 12 months there was an thrilling twist: all tasks have been required to include some type of AI.

My group’s mission was a speech-to-text-to-speech recreation, and right here’s the way it works: the person selects a personality to speak to, after which verbally expresses something they’d prefer to the character. This spoken enter is transcribed and despatched to ChatGPT, which responds as if it have been the character. The response is then learn aloud utilizing text-to-speech know-how.

Now that the sport is up and working, bringing laughs and enjoyable, I’ve crafted this how-to information that will help you create an analogous recreation by yourself. All through the article, we’ll additionally discover the varied concerns and selections we made throughout the hackathon.

Need to see the total code? Right here is the hyperlink!

As soon as the server is working, the person will hear the app “speaking”, prompting them to decide on the determine they need to speak to and begin conversing with their chosen character. Every time they need to speak out loud — they need to press and maintain a key on the keyboard whereas speaking. Once they end speaking (and launch the important thing), their recording can be transcribed by Whisper (a text-to-speech mannequin by OpenAI), and the transcription can be despatched to ChatGPT for a response. The response can be learn out loud utilizing a text-to-speech library, and the person will hear it.

Disclaimer

Be aware: The mission was developed on a Home windows working system and incorporates the pyttsx3 library, which lacks compatibility with M1/M2 chips. As pyttsx3 isn’t supported on Mac, customers are suggested to discover various text-to-speech libraries which might be appropriate with macOS environments.

Openai Integration

I utilized two OpenAI fashions: Whisper, for speech-to-text transcription, and the ChatGPT API for producing responses primarily based on the person’s enter to their chosen determine. Whereas doing so prices cash, the pricing mannequin could be very low cost, and personally, my invoice continues to be beneath $1 for all my utilization. To get began, I made an preliminary deposit of $5, and to this point, I’ve not exhausted this layer, and this preliminary deposit received’t expire till a 12 months from now.
I’m not receiving any fee or advantages from OpenAI for scripting this.

When you get your OpenAI API key — set it as an atmosphere variable to make use of upon making the API calls. Ensure that to not push your key to the codebase or any public location, and to not share it unsafely.

Speech to Textual content — Create Transcription

The implementation of the speech-to-text function was achieved utilizing Whisper, an OpenAI mannequin.

Under is the code snippet for the perform accountable for transcription:

async def get_transcript(audio_file_path: str, 
text_to_draw_while_waiting: str) -> Non-obligatory[str]:
openai.api_key = os.environ.get("OPENAI_API_KEY")
audio_file = open(audio_file_path, "rb")
transcript = None

async def transcribe_audio() -> None:
nonlocal transcript
strive:
response = openai.Audio.transcribe(
mannequin="whisper-1", file=audio_file, language="en")
transcript = response.get("textual content")
besides Exception as e:
print(e)

draw_thread = Thread(goal=print_text_while_waiting_for_transcription(
text_to_draw_while_waiting))
draw_thread.begin()

transcription_task = asyncio.create_task(transcribe_audio())
await transcription_task

if transcript is None:
print("Transcription not out there inside the specified timeout.")

return transcript

This perform is marked as asynchronous (async) for the reason that API name could take a while to return a response, and we await it to make sure that this system doesn’t progress till the response is obtained.

As you’ll be able to see, the get_transcript perform additionally invokes the print_text_while_waiting_for_transcription perform. Why? Since acquiring the transcription is a time-consuming job, we needed to maintain the person knowledgeable that this system is actively processing their request and never caught or unresponsive. In consequence, this textual content is regularly printed because the person awaits the following step.

String Matching Utilizing FuzzyWuzzy for Textual content Comparability

After transcribing the speech into textual content, we both utilized it as is, or tried to check it with an present string.

The comparability use instances have been: choosing a determine from a predefined listing of choices, deciding whether or not to proceed taking part in or not, and when opting to proceed – deciding whether or not to decide on a brand new determine or follow the present one.

In such instances, we needed to check the person’s spoken enter transcription with the choices in our lists, and subsequently we determined to make use of the FuzzyWuzzy library for string matching.

This enabled selecting the closest choice from the listing, so long as the matching rating exceeded a predefined threshold.

Right here’s a snippet of our perform:

def detect_chosen_option_from_transcript(
transcript: str, choices: Checklist[str]) -> str:
best_match_score = 0
best_match = ""

for choice in choices:
rating = fuzz.token_set_ratio(transcript.decrease(), choice.decrease())
if rating > best_match_score:
best_match_score = rating
best_match = choice

if best_match_score >= 70:
return best_match
else:
return ""

If you wish to study extra concerning the FuzzyWuzzy library and its features — you’ll be able to take a look at an article I wrote about it right here.

Get ChatGPT Response

As soon as we’ve the transcription, we will ship it over to ChatGPT to get a response.

For every ChatGPT request, we added a immediate asking for a brief and humorous response. We additionally instructed ChatGPT which determine to fake to be.

So our perform seemed as follows:

def get_gpt_response(transcript: str, chosen_figure: str) -> str:
system_instructions = get_system_instructions(chosen_figure)
strive:
return make_openai_request(
system_instructions=system_instructions,
user_question=transcript).decisions[0].message["content"]
besides Exception as e:
logging.error(f"couldn't get ChatGPT response. error: {str(e)}")
increase e

and the system directions seemed as follows:

def get_system_instructions(determine: str) -> str:
return f"You present humorous and brief solutions. You might be: {determine}"

Textual content to Speech

For the text-to-speech half, we opted for a Python library referred to as pyttsx3. This alternative was not solely easy to implement but additionally provided a number of extra benefits. It’s freed from cost, supplies two voice choices — female and male — and permits you to choose the talking charge in phrases per minute (speech velocity).

When a person begins the sport, they decide a personality from a predefined listing of choices. If we couldn’t discover a match for what they stated inside our listing, we’d randomly choose a personality from our “fallback figures” listing. In each lists, every character was related to a gender, so our text-to-speech perform additionally obtained the voice ID comparable to the chosen gender.

That is what our text-to-speech perform seemed like:

def text_to_speech(textual content: str, gender: str = Gender.FEMALE.worth) -> None:
engine = pyttsx3.init()

engine.setProperty("charge", WORDS_PER_MINUTE_RATE)
voices = engine.getProperty("voices")
voice_id = voices[0].id if gender == "male" else voices[1].id
engine.setProperty("voice", voice_id)

engine.say(textual content)
engine.runAndWait()

The Essential Stream

Now that we’ve kind of obtained all of the items of our app in place, it’s time to dive into the gameplay! The principle circulate is printed under. You would possibly discover some features we haven’t delved into (e.g. choose_figure, play_round), however you’ll be able to discover the total code by testing the repo. Ultimately, most of those higher-level features tie into the interior features we’ve lined above.

Right here’s a snippet of the primary recreation circulate:

import asyncio

from src.handle_transcript import text_to_speech
from src.main_flow_helpers import choose_figure, begin, play_round,
is_another_round

def farewell() -> None:
farewell_message = "It was nice having you right here, "
"hope to see you once more quickly!"
print(f"n{farewell_message}")
text_to_speech(farewell_message)

async def get_round_settings(determine: str) -> dict:
new_round_choice = await is_another_round()
if new_round_choice == "new determine":
return {"determine": "", "another_round": True}
elif new_round_choice == "no":
return {"determine": "", "another_round": False}
elif new_round_choice == "sure":
return {"determine": determine, "another_round": True}

async def principal():
begin()
another_round = True
determine = ""

whereas True:
if not determine:
determine = await choose_figure()

whereas another_round:
await play_round(chosen_figure=determine)
user_choices = await get_round_settings(determine)
determine, another_round =
user_choices.get("determine"), user_choices.get("another_round")
if not determine:
break

if another_round is False:
farewell()
break

if __name__ == "__main__":
asyncio.run(principal())

We had a number of concepts in thoughts that we didn’t get to implement throughout the hackathon. This was both as a result of we didn’t discover an API we have been glad with throughout that weekend, or as a result of time constraints stopping us from creating sure options. These are the paths we didn’t take for this mission:

Matching the Response Voice with the Chosen Determine’s “Precise” Voice

Think about if the person selected to speak to Shrek, Trump, or Oprah Winfrey. We needed our text-to-speech library or API to articulate responses utilizing voices that matched the chosen determine. Nonetheless, we couldn’t discover a library or API throughout the hackathon that provided this function at an inexpensive value. We’re nonetheless open to recommendations when you’ve got any =)

Let the Customers Speak to “Themselves”

One other intriguing concept was to immediate customers to supply a vocal pattern of themselves talking. We might then prepare a mannequin utilizing this pattern and have all of the responses generated by ChatGPT learn aloud within the person’s personal voice. On this situation, the person may select the tone of the responses (affirmative and supportive, sarcastic, indignant, and many others.), however the voice would carefully resemble that of the person. Nonetheless, we couldn’t discover an API that supported this inside the constraints of the hackathon.

Including a Frontend to Our Software

Our preliminary plan was to incorporate a frontend part in our utility. Nonetheless, on account of a last-minute change within the variety of contributors in our group, we determined to prioritize the backend growth. In consequence, the appliance at the moment runs on the command line interface (CLI) and doesn’t have frontend aspect.

Latency is what bothers me most in the intervening time.

There are a number of parts within the circulate with a comparatively excessive latency that for my part barely hurt the person expertise. For instance: the time it takes from ending offering the audio enter and receiving a transcription, and the time it takes for the reason that person presses a button till the system truly begins recording the audio. So if the person begins speaking proper after urgent the important thing — there can be at the very least one second of audio that received’t be recorded on account of this lag.

Need to see the entire mission? It’s proper right here!

Additionally, heat credit score goes to Lior Yardeni, my hackathon companion with whom I created this recreation.

On this article, we realized how you can create a speech-to-text-to-speech recreation utilizing Python, and intertwined it with AI. We’ve used the Whisper mannequin by OpenAI for speech recognition, performed round with the FuzzyWuzzy library for textual content matching, tapped into ChatGPT’s conversational magic through their developer API, and introduced all of it to life with pyttsx3 for text-to-speech. Whereas OpenAI’s providers (Whisper and ChatGPT for builders) do include a modest value, it’s budget-friendly.

We hope you’ve discovered this information enlightening and that it’s motivating you to embark in your tasks.

Cheers to coding and enjoyable! 🚀

[ad_2]