A Newbie’s Information to Constructing Information Graphs from Movies | by Mohammed Mohammed

Machine Learning

A Newbie’s Information to Constructing Information Graphs from Movies | by Mohammed Mohammed | Jan, 2024

hhhhm

2024年2月5日

A Newbie’s Information to Constructing Information Graphs from Movies | by Mohammed Mohammed | Jan, 2024

[ad_1]

Construct a pipeline to investigate and retailer the info inside movies.

Earlier than diving into the technical facet of the article let’s set the context and reply the query that you just may need, What’s a data graph ?

And to reply this, think about as a substitute of storing the data in cupboards we retailer them in a cloth web. Every truth, idea, piece of details about individuals, locations, occasions, and even summary concepts are knots, and the road connecting them collectively is the connection they’ve with one another. This intricate internet, my associates, is the essence of a data graph.

Consider it like a bustling metropolis map, not simply exhibiting streets however revealing the connections between landmarks, parks, and retailers. Equally, a data graph doesn’t simply retailer chilly info; it captures the wealthy tapestry of how issues are linked. For instance, you may be taught that Marie Curie found radium, then observe a thread to see that radium is utilized in medical remedies, which in flip hook up with hospitals and most cancers analysis. See how one truth effortlessly results in one other, portray a much bigger image?

So why is that this map-like method of storing data so in style? Properly, think about trying to find info on-line. Conventional strategies usually go away you with remoted bits and items, like discovering solely buildings on a map with out realizing the streets that join them. A data graph, nonetheless, takes you on a journey, guiding you from one truth to a different, like having a pleasant information whisper fascinating tales behind each nook of the knowledge world. Fascinating proper? I do know.

Since I found this magic, it captured my consideration and I explored and performed round with many potential purposes. On this article, I’ll present you learn how to construct a pipeline that extracts audio from video, then transcribes that audio, and from the transcription, construct a data graph permitting for a extra nuanced and interconnected illustration of knowledge throughout the video.

I will likely be utilizing Google Drive to add the video pattern. I will even use Google Colab to put in writing the code, and at last, you want entry to the GPT Plus API for this mission. I’ll break this down into steps to make it clear and simple for learners:

Establishing every part.
Extracting audio from video.
Transcribing audio to textual content.
Constructing the data graph.

By the tip of this text, you’ll assemble a graph with the next schema.

Let’s dive proper into it!

As talked about, we will likely be utilizing Google Drive and Colab. Within the first cell, let’s join Google Drive to Colab and create our listing folders (video_files, audio_files, text_files). The next code can get this performed. (If you wish to observe together with the code, I’ve uploaded all of the code for this mission on GitHub; you possibly can entry it from right here.)

# putting in required libraries
!pip set up pydub
!pip set up git+https://github.com/openai/whisper.git
!sudo apt replace && sudo apt set up ffmpeg
!pip set up networkx matplotlib
!pip set up openai
!pip set up requests# connecting google drive to import video samples
from google.colab import drive
import os
drive.mount('/content material/drive')
video_files = '/content material/drive/My Drive/video_files'
audio_files = '/content material/drive/My Drive/audio_files'
text_files = '/content material/drive/My Drive/text_files'
folders = [video_files, audio_files, text_files]
for folder in folders:
# Examine if the output folder exists
if not os.path.exists(folder):
# If not, create the folder
os.makedirs(folder)

Or you possibly can create the folders manually and add your video pattern to the “video_files” folder, whichever is simpler for you.

Now we’ve got our three folders with a video pattern within the “video_files” folder to check the code.

The following factor we wish to do is to import our video and extract the audio from it. We will use the Pydub library, which is a high-level audio processing library that may assist us to do this. Let’s see the code after which clarify it beneath.

from pydub import AudioSegment
# Extract audio from movies
for video_file in os.listdir(video_files):
if video_file.endswith('.mp4'):
video_path = os.path.be part of(video_files, video_file)
audio = AudioSegment.from_file(video_path, format="mp4")# Save audio as WAV
audio.export(os.path.be part of(audio_files, f"{video_file[:-4]}.wav"), format="wav")

After putting in our bundle pydub, we imported the AudioSegment class from the Pydub library. Then, we created a loop that iterates via all of the video information within the “video_files” folder we created earlier and passes every file via AudioSegment.from_file to load the audio from the video file. The loaded audio is then exported as a WAV file utilizing audio.export and saved within the specified “audio_files” folder with the identical identify because the video file however with the extension .wav.

At this level, you possibly can go to the “audio_files” folder in Google Drive the place you will note the extracted audio.

Within the third step, we are going to transcribe the audio file we’ve got to a textual content file and put it aside as a .txt file within the “text_files” folder. Right here I used the Whisper ASR (Automated Speech Recognition) system from OpenAI to do that. I used it as a result of it’s simple and pretty correct, beside it has totally different fashions for various accuracy. However the extra correct the mannequin is the bigger the mannequin the slower to load, therefore I will likely be utilizing the medium one only for demonstration. To make the code cleaner, let’s create a perform that transcribes the audio after which use a loop to make use of the perform on all of the audio information in our listing

import re
import subprocess
# perform to transcribe and save the output in txt file
def transcribe_and_save(audio_files, text_files, mannequin='medium.en'):
# Assemble the Whisper command
whisper_command = f"whisper '{audio_files}' --model {mannequin}"
# Run the Whisper command
transcription = subprocess.check_output(whisper_command, shell=True, textual content=True)# Clear and be part of the sentences
output_without_time = re.sub(r'[d+:d+.d+ --> d+:d+.d+]  ', '', transcription)
sentences = [line.strip() for line in output_without_time.split('n') if line.strip()]
joined_text = ' '.be part of(sentences)
# Create the corresponding textual content file identify
audio_file_name = os.path.basename(audio_files)
text_file_name = os.path.splitext(audio_file_name)[0] + '.txt'
file_path = os.path.be part of(text_files, text_file_name)
# Save the output as a txt file
with open(file_path, 'w') as file:
file.write(joined_text)
print(f'Textual content for {audio_file_name} has been saved to: {file_path}')
# Transcribing all of the audio information within the listing
for audio_file in os.listdir(audio_files):
if audio_file.endswith('.wav'):
audio_files = os.path.be part of(audio_files, audio_file)
transcribe_and_save(audio_files, text_files)

Libraries Used:

os: Offers a method of interacting with the working system, used for dealing with file paths and names.
re: Common expression module for sample matching and substitution.
subprocess: Permits the creation of further processes, used right here to execute the Whisper ASR system from the command line.

We created a Whisper command and saved it as a variable to facilitate the method. After that, we used subprocess.check_output to run the Whisper command and save the ensuing transcription within the transcription variable. However the transcription at this level shouldn’t be clear (you possibly can verify it by printing the transcription variable out of the perform; it has timestamps and a few traces that aren’t related to the transcription), so we added a cleansing code that removes the timestamp utilizing re.sub and joins the sentences collectively. After that, we created a textual content file throughout the “text_files” folder with the identical identify because the audio and saved the cleaned transcription in it.

Now in the event you go to the “text_files” folder, you possibly can see the textual content file that incorporates the transcription. Woah, step 3 performed efficiently! Congratulations!

That is the essential half — and perhaps the longest. I’ll observe a modular strategy with 5 features to deal with this job, however earlier than that, let’s start with the libraries and modules vital for making HTTP requests requests, dealing with JSON json, working with information frames pandas, and creating and visualizing graphs networkx and matplotlib. And setting the worldwide constants that are variables used all through the code. API_ENDPOINT is the endpoint for OpenAI’s API, API_KEY is the place the OpenAI API key will likely be saved, and prompt_text will retailer the textual content used as enter for the OpenAI immediate. All of that is performed on this code

import requests
import json
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt# International Constants API endpoint, API key, immediate textual content
API_ENDPOINT = "https://api.openai.com/v1/chat/completions"
api_key = "your_openai_api_key_goes_here"
prompt_text = """Given a immediate, extrapolate as many relationships as potential from it and supply an inventory of updates.
If an replace is a relationship, present [ENTITY 1, RELATIONSHIP, ENTITY 2]. The connection is directed, so the order issues.
Instance:
immediate: Solar is the supply of photo voltaic vitality. Additionally it is the supply of Vitamin D.
updates:
[["Sun", "source of", "solar energy"],["Sun","source of", "Vitamin D"]]
immediate: $immediate
updates:"""

Then let’s proceed with breaking down the construction of our features:

The primary perform, create_graph(), the duty of this perform is to create a graph visualization utilizing the networkx library. It takes a DataFrame df and a dictionary of edge labels rel_labels — which will likely be created on the next perform — as enter. Then, it makes use of the DataFrame to create a directed graph and visualizes it utilizing matplotlib with some customization and outputs the attractive graph we want

# Graph Creation Performdef create_graph(df, rel_labels):
G = nx.from_pandas_edgelist(df, "supply", "goal",
edge_attr=True, create_using=nx.MultiDiGraph())
plt.determine(figsize=(12, 12))
pos = nx.spring_layout(G)
nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos=pos)
nx.draw_networkx_edge_labels(
G,
pos,
edge_labels=rel_labels,
font_color='pink'
)
plt.present()

The DataFrame df and the sting labels rel_labels are the output of the following perform, which is: preparing_data_for_graph(). This perform takes the OpenAI api_response — which will likely be created from the next perform — as enter and extracts the entity-relation triples (supply, goal, edge) from it. Right here we used the json module to parse the response and procure the related information, then filter out components which have lacking information. After that, construct a data base dataframe kg_df from the triples, and at last, create a dictionary (relation_labels) mapping pairs of nodes to their corresponding edge labels, and naturally, return the DataFrame and the dictionary.

# Knowledge Preparation Performdef preparing_data_for_graph(api_response):
#extract response textual content
response_text = api_response.textual content
entity_relation_lst = json.hundreds(json.hundreds(response_text)["choices"][0]["text"])
entity_relation_lst = [x for x in entity_relation_lst if len(x) == 3]
supply = [i[0] for i in entity_relation_lst]
goal = [i[2] for i in entity_relation_lst]
relations = [i[1] for i in entity_relation_lst]
kg_df = pd.DataFrame({'supply': supply, 'goal': goal, 'edge': relations})
relation_labels = dict(zip(zip(kg_df.supply, kg_df.goal), kg_df.edge))
return kg_df,relation_labels

The third perform is call_gpt_api(), which is accountable for making a POST request to the OpenAI API and output the api_response. Right here we assemble the info payload with mannequin info, immediate, and different parameters just like the mannequin (on this case: gpt-3.5-turbo-instruct), max_tokens, cease, and temperature. Then ship the request utilizing requests.publish and return the response. I’ve additionally included easy error dealing with to print an error message in case an exception happens. The attempt block incorporates the code that may increase an exception from the request throughout execution, so if an exception happens throughout this course of (for instance, resulting from community points, API errors, and so on.), the code throughout the besides block will likely be executed.

# OpenAI API Name Perform
def call_gpt_api(api_key, prompt_text):
world API_ENDPOINT
attempt:
information = {
"mannequin": "gpt-3.5-turbo",
"immediate": prompt_text,
"max_tokens": 3000,
"cease": "n",
"temperature": 0
}
headers = {"Content material-Sort": "utility/json", "Authorization": "Bearer " + api_key}
r = requests.publish(url=API_ENDPOINT, headers=headers, json=information)
response_data = r.json()  # Parse the response as JSON
print("Response content material:", response_data)
return response_data
besides Exception as e:
print("Error:", e)

Then the perform earlier than the final is the foremost() perform, which orchestrates the principle move of the script. First, it reads the textual content file contents from the “text_files” folder we had earlier and saves it within the variable kb_text. Deliver the worldwide variable prompt_text, which shops our immediate, then exchange a placeholder within the immediate template ($immediate) with the textual content file content material kb_text. Then name the call_gpt_api() perform, give it the api_key and prompt_text to get the OpenAI API response. The response is then handed to preparing_data_for_graph() to organize the info and get the DataFrame and the sting labels dictionary, lastly move these two values to the create_graph() perform to construct the data graph.

# Fundamental performdef foremost(text_file_path, api_key):
with open(file_path, 'r') as file:
kb_text = file.learn()
world prompt_text
prompt_text = prompt_text.exchange("$immediate", kb_text)
api_response = call_gpt_api(api_key, prompt_text)
df, rel_labels = preparing_data_for_graph(api_response)
create_graph(df, rel_labels)code

Lastly, we’ve got the begin() perform, which iterates via all of the textual content information in our “text_files” folder — if we’ve got a couple of, will get the identify and the trail of the file, and passes it together with the api_key to the principle perform to do its job.

# Begin Performdef begin():
for filename in os.listdir(text_files):
if filename.endswith(".txt"):
# Assemble the complete path to the textual content file
text_file_path = os.path.be part of(text_files, filename)
foremost(text_file_path, api_key)

When you’ve got appropriately adopted the steps, after operating the begin() perform, it is best to see an identical visualization.

You possibly can in fact save this data graph within the Neo4j database and take it additional.

NOTE: This workflow ONLY applies to movies you personal or whose phrases enable this type of obtain/processing.

Information graphs use semantic relationships to signify information, enabling a extra nuanced and context-aware understanding. This semantic richness permits for extra refined querying and evaluation, because the relationships between entities are explicitly outlined.

On this article, I define detailed steps on learn how to construct a pipeline that entails extracting audio from movies, transcribing with OpenAI’s Whisper ASR, and crafting a data graph. As somebody on this area, I hope that this text makes it simpler to know for learners, demonstrating the potential and flexibility of data graph purposes.

And as at all times the entire code is on the market in GitHub.

[ad_2]