Home Machine Learning Tips on how to Construct Knowledge Pipelines for Machine Studying | by Shaw Talebi | Might, 2024

Tips on how to Construct Knowledge Pipelines for Machine Studying | by Shaw Talebi | Might, 2024

0
Tips on how to Construct Knowledge Pipelines for Machine Studying | by Shaw Talebi | Might, 2024

[ad_1]

We begin by importing a couple of libraries and a secret YouTube API key. Should you do not need an API key, you may create one following this information.

import requests
import json
import polars as pl
from my_sk import my_key

from youtube_transcript_api import YouTubeTranscriptApi

Subsequent, we are going to outline variables to assist us extract video information from the YouTube API. Right here, I specify the ID of my YouTube channel and the API URL, initialize page_token, and create an inventory for storing video information.

# outline channel ID
channel_id = 'UCa9gErQ9AE5jT2DZLjXBIdA'

# outline url for API
url = 'https://www.googleapis.com/youtube/v3/search'

# initialize web page token
page_token = None

# intialize listing to retailer video information
video_record_list = []

The subsequent chunk of code could be scary, so I’ll clarify what’s occurring first. We’ll carry out GET requests to YouTube’s search API. This is rather like looking for movies on YouTube, however as an alternative of utilizing the UI, we are going to carry out searches programmatically.

Since search outcomes are restricted to 50 per web page, we have to recursively carry out searches to return each video that matches the search standards. Right here’s what that appears like in Python code.

# extract video information throughout a number of search consequence pages

whereas page_token != 0:
# outline parameters for API name
params = {'key': my_key, 'channelId': channel_id,
'half': ["snippet","id"], 'order': "date",
'maxResults':50, 'pageToken': page_token}
# make get request
response = requests.get(url, params=params)

# append video information from web page outcomes to listing
video_record_list += getVideoRecords(response)

attempt:
# seize subsequent web page token
page_token = json.hundreds(response.textual content)['nextPageToken']
besides:
# if no subsequent web page token kill whereas loop
page_token = 0

getVideoRecords() is a user-defined perform that extracts the related info from an API response.

# extract video information from single search consequence web page

def getVideoRecords(response: requests.fashions.Response) -> listing:
"""
Operate to extract YouTube video information from GET request response
"""

# initialize listing to retailer video information from web page outcomes
video_record_list = []

for raw_item in json.hundreds(response.textual content)['items']:

# solely execute for youtube movies
if raw_item['id']['kind'] != "youtube#video":
proceed

# extract related information
video_record = {}
video_record['video_id'] = raw_item['id']['videoId']
video_record['datetime'] = raw_item['snippet']['publishedAt']
video_record['title'] = raw_item['snippet']['title']

# append report to listing
video_record_list.append(video_record)

return video_record_list

Now that we’ve got details about all my YouTube movies let’s extract the mechanically generated captions. To make the video IDs simpler to entry, I’ll retailer the video information in a Polars dataframe.

# retailer information in polars dataframe
df = pl.DataFrame(video_record_list)
print(df.head())
Head of dataframe. Picture by creator.

To tug the video captions, I’ll use the youtube_transcript_api Python library. I’ll loop via every video ID within the dataframe and extract the related transcript.

# intialize listing to retailer video captions
transcript_text_list = []

# loop via every row of dataframe
for i in vary(len(df)):

# attempt to extract captions
attempt:
# get transcript
transcript = YouTubeTranscriptApi.get_transcript(df['video_id'][i])
# extract textual content transcript
transcript_text = extract_text(transcript)
# if not captions obtainable set as n/a
besides:
transcript_text = "n/a"

# append transcript textual content to listing
transcript_text_list.append(transcript_text)

Once more, I take advantage of a user-defined perform referred to as extract_text() to extract the required info from the API.

def extract_text(transcript: listing) -> str:
"""
Operate to extract textual content from transcript dictionary
"""

text_list = [transcript[i]['text'] for i in vary(len(transcript))]
return ' '.be part of(text_list)

Then we are able to add the transcripts for every video to the dataframe.

# add transcripts to dataframe
df = df.with_columns(pl.Sequence(title="transcript", values=transcript_text_list))
print(df.head())
Head of dataframe with transcripts. Picture by creator.

With the info extracted, we are able to remodel it so it’s prepared for the downstream use case. This requires some exploratory information evaluation (EDA).

Handing duplicates

An excellent place to begin for EDA is to look at the variety of distinctive rows and parts in every column. Right here, we anticipated every row to be uniquely recognized by the video_id. Moreover, every column mustn’t have repeating parts apart from movies for which no transcript was obtainable, which we set as “n/a”.

Right here’s some code to probe that info. We will see from the output the info match our expectations.

# form + distinctive values
print("form:", df.form)
print("n distinctive rows:", df.n_unique())
for j in vary(df.form[1]):
print("n distinctive parts (" + df.columns[j] + "):", df[:,j].n_unique())

### output
# form: (84, 4)
# n distinctive rows: 84
# n distinctive parts (video_id): 84
# n distinctive parts (datetime): 84
# n distinctive parts (title): 84
# n distinctive parts (transcript): 82

Test dtypes

Subsequent, we are able to study the info sorts of every column. Within the picture above, we noticed that each one columns are strings.

Whereas that is applicable for the video_id, title, and transcript, this isn’t a good selection for the datetime column. We will change this sort within the following manner.

# change datetime to Datetime dtype
df = df.with_columns(pl.col('datetime').forged(pl.Datetime))
print(df.head())
Head of dataframe after updating datetime dtype. Picture by creator.

Dealing with particular characters

Since we’re working with textual content information, it’s essential to look out for particular character strings. This requires a little bit of handbook skimming of the textual content, however after a couple of minutes, I discovered 2 particular instances: ’ → ' and & → &

Within the code beneath, I changed these strings with the suitable characters and altered “sha” to “Shaw”.

# listing all particular strings and their replacements
special_strings = [''', '&', 'sha ']
special_string_replacements = ["'", "&", "Shaw "]

# exchange every particular string showing in title and transcript columns
for i in vary(len(special_strings)):
df = df.with_columns(df['title'].str.exchange(special_strings[i],
special_string_replacements[i]).alias('title'))
df = df.with_columns(df['transcript'].str.exchange(special_strings[i],
special_string_replacements[i]).alias('transcript'))

Because the dataset right here could be very small (84 rows and 4 columns, ~900k characters), we are able to retailer the info instantly within the undertaking listing. This may be finished in a single line of code utilizing the write_parquet() methodology in Polars. The ultimate file measurement is 341 KB.

# write information to file
df.write_parquet('information/video-transcripts.parquet')

Right here, we mentioned the fundamentals of constructing information pipelines within the context of Full Stack Knowledge Science and walked via a concrete instance utilizing real-world information.

Within the subsequent article of this collection, we are going to proceed taking place the info science tech stack and talk about how we are able to use this information pipeline to develop a semantic search system for my YouTube movies.

Extra on this collection 👇

[ad_2]