Home Machine Learning Dune — A Hidden Community. On this article, with Patrik Szigeti… | by Milan Janosov | Mar, 2024

Dune — A Hidden Community. On this article, with Patrik Szigeti… | by Milan Janosov | Mar, 2024

0
Dune — A Hidden Community. On this article, with Patrik Szigeti… | by Milan Janosov | Mar, 2024

[ad_1]

Following the success of Dune each on the field workplace and with the critics in 2021, Dune: Half Two was one of the crucial anticipated motion pictures of 2024, and it didn’t disappoint. On monitor to earn extra, and holding increased scores each on Rotten Tomatoes and iMDB than its prequel on the time of writing this text, with its ever altering political panorama, Dune is the right franchise to dive into via community science. On this quick piece, we aimed to discover the connections between the totally different Homes and folks of the Impremium based mostly on the primary three books of Frank Herbert — Dune (1965), Dune Messiah (1969) and Kids of Dune (1976).

Within the first a part of this text, we current a Python-based strategy to amassing character profile information from the Dune Wiki and switch these profiles right into a catchy community graph. Then, within the second — moderately spoiler-heavy — part, we dive into the depth of the community and extract all of the tales it has to say in regards to the first trilogy of the Dune.

All photos have been created by the authors.

First, we use Python to gather the complete record of Dune characters. Then, we obtain their biography profiles from every character’s fan wiki web site and depend the variety of instances every character’s story mentions every other character’s story, assuming these mentions encode numerous interactions between any two pairs of characters. Then, we are going to use community science to show these relationships into a fancy graph.

1.1 Gathering the record of characters

First off, we collected the record of all related characters from the Dune fan wiki web site. Specifically, we by used urllib and bs4 to extracted the names and fan wiki id-s of every character talked about and has its personal wiki web page encpoded by their id. We did this for the primary three books: Dune, Dune Messiah and Childen of Dune. These three books cowl the rise of the home of Atreides.

Sources:

First, obtain the character itemizing web site’s html:

dune_meta = {
'Dune': {'url': 'https://dune.fandom.com/wiki/Dune_(novel)'},
'Dune Messiah': {'url': 'https://dune.fandom.com/wiki/Dune_Messiah'},
'Kids of Dune': {'url': 'https://dune.fandom.com/wiki/Children_of_Dune_(novel)'}
}

for guide, url in dune_meta.gadgets():
sauce = urlopen(url['url']).learn()
soup = bs.BeautifulSoup(sauce,'lxml')
dune_meta[book]['chars'] = soup.find_all('li')

A little bit handbook assist to fine-tune the character title and id:

dune_meta['Dune']['char_start'] = 'Abulurd'
dune_meta['Dune']['char_end'] = 'Arrakis'
dune_meta['Dune Messiah']['char_start'] = 'Abumojandis'
dune_meta['Dune Messiah']['char_end'] = 'Arrakis'
dune_meta['Children of Dune']['char_start'] = '2018 Version'
dune_meta['Children of Dune']['char_end'] = 'Classes'

Then, we extracted all the doubtless related names and the corresponding profile urls. Right here, we manually checked from which tag blocks the names begin (e.g. versus the define of the character itemizing web site). Moreover, we determined to drop the characters marked by ‘XD’ and ‘DE’ similar to the prolonged collection, in addition to characters that have been “Talked about solely” in a sure guide:

for okay, v in dune_meta.gadgets():
names_urls = {}
keep_row = False
print(f'----- {okay} -----')
for char in v['chars']:
if v['char_start'] in char.textual content.strip():
keep_row = True
if v['char_end'] in char.textual content.strip():
keep_row = False
if keep_row and 'Video' not in char.textual content:
attempt:
url = 'https://dune.fandom.com' + str(char).break up('href="')[1].break up('" title')[0]
title = char.textual content.strip()
if 'wiki' in url and 'XD' not in title and 'DE' not in title and '(Talked about solely)' not in title:
names_urls[name] = url
print(title)
besides:
move
dune_meta[k]['names_urls'] = names_urls

This code block then outputs the record of characters, equivalent to:

Instance on extracted names.

Lastly, we verify the variety of characters we collected and save their profile URLs and identifiers for the following subchapter.

dune_names_urls = {}
for okay, v in dune_meta.gadgets():
dune_names_urls.replace(dune_meta[k]['names_urls'])

names_ids = {n : u.break up('/')[-1] for n, u in dune_names_urls.gadgets()}

print(len(dune_names_urls))

The outputs of this cell, displaying 119 characters with profile URLs:

1.2 Downloading character profiles

Our purpose is to map out the social community of the Dune characters — which signifies that we have to determine who interacted with whom. Within the earlier sub chapter, we bought the record of all of the ‘whom,’ and now we are going to get the data about their private tales. We are going to get these tales by once more utilizing easy net scraping methods, after which save the supply of every characters private web site in a separate file regionally:

# output folder for the profile htmls
folderout = 'fandom_profiles'
if not os.path.exists(folderout):
os.makedirs(folderout)

# crawl and save the profile htmls
for ind, (title, url) in enumerate(dune_names_urls.gadgets()):
if not os.path.exists(folderout + '/' + title + '.html'):
attempt:
fout = open(folderout + '/' + title + '.html', "w")
fout.write(str(urlopen(url).learn()))
besides:
move

The results of working this code shall be a folder in our native listing with all of the fan wiki web site profiles belonging to each single chosen character.

1.3 Constructing the community

To construct the community between characters, we depend the variety of instances every character’s wiki web site supply references every other character’s wiki identifier utilizing the next logic. Right here, we construct up the sting record — the record of connections which comprise each the supply and the goal node (character) of the connections in addition to the load (co-reference frequency) between the 2 characters’ pages.

# extract the title mentions from the html sources
# and construct the record of edges in a dictionary
edges = {}

for fn in [fn for fn in os.listdir(folderout) if '.html' in fn]:

title = fn.break up('.html')[0]

with open(folderout + '/' + fn) as myfile:
textual content = myfile.learn()
soup = bs.BeautifulSoup(textual content,'lxml')
textual content = ' '.be part of([str(a) for a in soup.find_all('p')[2:]])
soup = bs.BeautifulSoup(textual content,'lxml')

for n, i in names_ids.gadgets():

w = textual content.break up('Picture Gallery')[0].depend('/' + i)
if w>0:
edge = 't'.be part of(sorted([name, n]))
if edge not in edges:
edges[edge] = w
else:
edges[edge] += w

len(edges)

As soon as we run this block of code, we are going to get the results of 307 because the variety of edges connecting the 119 Dune characters.

Subsequent, we use the NetworkX graph analytics library to show the sting record right into a graph object and output the variety of nodes and edges the graph has:

#  create the networkx graph from the dict of edges
import networkx as nx
G = nx.Graph()
for e, w in edges.gadgets():
if w>0:
e1, e2 = e.break up('t')
G.add_edge(e1, e2, weight=w)

G.remove_edges_from(nx.selfloop_edges(G))

print('Variety of nodes: ', G.number_of_nodes())
print('Variety of edges: ', G.number_of_edges())

The results of this code block:

The variety of nodes is just 72, that means 47 characters weren’t linked to any central member of their — in all probability moderately transient — wiki profiles. Moreover, we see a lower of 4 within the variety of edges as a result of just a few self-loops have been eliminated as effectively.

Let’s take a short view of the community utilizing the built-in Matplotlib plotter:

# take a really transient take a look at the community
import matplotlib.pyplot as plt
f, ax = plt.subplots(1,1,figsize=(15,15))
nx.draw(G, ax=ax, with_labels=True)

The output of this cell:

Preliminary community visualization of the Dune characters.

Whereas this visible already exhibits some community construction, we exported the graph right into a Gephi file utilizing the next line of code, and designed the community hooked up on the determine beneath (the how-to of such community visuals would be the matter of an upcoming tutorial article):

nx.write_gexf(G, 'dune_network.gexf')

The complete Dune community:

[ad_2]