Home Machine Learning Past RAG: Community Evaluation via LLMs for Information Extraction | by Andrea D’Agostino | Feb, 2024

Past RAG: Community Evaluation via LLMs for Information Extraction | by Andrea D’Agostino | Feb, 2024

0
Past RAG: Community Evaluation via LLMs for Information Extraction | by Andrea D’Agostino | Feb, 2024

[ad_1]

Finish-to-end knowledge science challenge utilizing Streamlit, Upstash, and OpenAI to construct higher information navigation and comprehension utilizing community evaluation

Picture by USGS on Unsplash

This text will information you thru an end-to-end knowledge science challenge utilizing a number of state-of-the-art instruments within the AI area. This instrument known as Thoughts Mapper as a result of it lets you create conceptual maps by injecting info right into a information base and retrieving it in a sensible method.

The motivation was to transcend the “easy” RAG framework, the place a person queries a vector database and its response is then fed to an LLM like GPT-4 for an enriched reply.

Thoughts Mapper leverages RAG to create intermediate outcome representations helpful to carry out some sort of information intelligence which is permits us in flip to raised perceive the output outcomes of RAG over lengthy and unstructured paperwork.

Merely talking, I need to use RAG as a foundational step to construct numerous responses, not simply textual. A thoughts map is one in all such responses.

Listed here are a few of the instrument’s options:

  • Manages textual content in principally all types: copy-paste, textual and originating from audio supply (video is contemplated too if the challenge is nicely obtained)
  • Makes use of an in-project SQLite database for knowledge persistence
  • Leverages the state-of-the-art Upstash vector database to retailer vectors effectively
  • Chunks from the vector database are then used to create a information graph of the data
  • A remaining LLM known as to touch upon the information graph and extract insights

We’ll use Streamlit as library for frontend rendering of our logic. The entire code will probably be written in Python.

If you’d like to try the app you’ll be constructing, test it out right here

I’ve uploaded a sequence of textual content paperwork copy-pasted from Wikipedia about distinguished people within the AI world like Sam Altman, Andrej Karpathy, and extra. We’ll question this data base to show how the challenge works.

A thoughts map seems like this, when utilizing a immediate like

Who’s Andrej Karpathy?”

Instance of a thoughts map. Picture by creator.

Be happy to navigate the linked software, present your OpenAI API key and Upstash REST Url + Token and immediate the prevailing information base for some demo insights.

The deployed Streamlit app has the inputs part disabled to keep away from exposing the database publicly. When you construct the app from the bottom up or clone it from Github, you’ll have the database obtainable underneath the primary department of the challenge.

If this introduction stimulated your curiosity, then be a part of me and let’s dive deeper into the reasons and code!

Right here’s the Github of the challenge if you wish to observe alongside.

The software program works following this algorithm

  1. person uploads or pastes textual content into the software program and saves the information right into a database. Person also can add an audio monitor which will get transcribed due to OpenAI’s Whisper mannequin
Enter part of the software program. Picture by creator.

2. when the information is saved, it’s break up into textual chunks and these chunks are then embedded utilizing OpenAI ada-002 mannequin

3. vectors are saved into Upstash vector database, with metadata hooked up

4. when person asks a query to the assistant, the question is embedded utilizing the identical mannequin and that vector is used to retrieve the highest n most related chunks utilizing dot product similarity metric

5. these related chunks of textual content, that are associated to the enter question, are fed into an AI agent accountable of extracting entities and relationships from all of the chunks

6. these entities and relationships make up a Python dictionary which is then used to construct the thoughts map

7. one other agent reads the content material of the identical dictionary and creates a remark to explain the thoughts map and spotlight related info

END.

Let’s briefly undergo the challenge dependencies to get a greater understanding of the blocks that make up the logic.

Poetry

I take advantage of Poetry for principally all of my initiatives. It’s a handy and easy Python env and package deal supervisor. You may obtain Poetry from this hyperlink.

When you cloned the repository, all it’s a must to do is poetry set up contained in the challenge’s folder in your terminal. Poetry will set up and maintain all of it.

Upstash Vector Database

Upstash was actually a current discovery and I felt I wished to try it out with an actual challenge. Whereas Upstash’s been releasing state-of-the-art merchandise for a while, it was lacking a vector database. Lower than a month in the past, the corporate launch the vector database, which is totally on the cloud and free for experimentation and much more. I discovered myself having fun with utilizing it’s API, and the net service had 0 lag.

OpenAI

As talked about, this challenge leverages Whisper for audio file transcription and GPT-4 to empower the brokers to extract and remark the thoughts map. We may additionally use open supply fashions if we wished to.

When you haven’t already, you’ll be able to setup an OpenAI API key at this hyperlink right here.

https://platform.openai.com

NetworkX

NetworkX empowers the thoughts map element within the software program. It takes care of making nodes of entities and edges amongst these. With Plotly, the interactive visualization lib, you’ll be able to actually visualize advanced networks. You may learn extra in regards to the lib at this hyperlink.

Streamlit

There are a bunch of core libraries like Pandas and Numpy however I gained’t even record them right here. However, Streamlit needs to be talked about as a result of it makes the frontend potential. An actual boon for knowledge scientists which have little information of frontend frameworks and JavaScript.

Now that we have now an higher concept of the primary parts of our software program, let’s begin constructing it from scratch. Sit tight as a result of it’s going to be fairly a protracted learn.

That is how the whole challenge seems:

Clearly the logic is contained within the src folder. It accommodates the majority of the logic, whereas there’s a devoted folder for the llm elements. We’ll go step-by-step and construct all the scripts. We’ll begin with the one devoted to the information construction, i.e. schema.py.

Let’s begin by defining the data schema. It’s typically the very first thing I do when working with knowledge. We’ll use SQLModel and Pydantic to outline an Info object that can retailer the data and permit desk creation in SQLite.

# schema.py

from sqlmodel import SQLModel, Discipline
from typing import Elective

import datetime
from enum import Enum

class FileType(Enum):
AUDIO = "audio"
TEXT = "textual content"
VIDEO = "video"

class Info(SQLModel, desk=True):
id: Elective[int] = Discipline(default=None, primary_key=True)
filename: str = Discipline()
title: Elective[str] = Discipline(default="NA", distinctive=False)
hash_id: str = Discipline(distinctive=True)
created_at: float = Discipline(default=datetime.datetime.now().timestamp())
file_type: FileType
textual content: str = Discipline(default="")
embedded: bool = Discipline(default=False)

__table_args__ = {"extend_existing": True}

Every textual content we’ll enter within the database will probably be an Info. It would have

  • and ID, which can act as a major key and thus be autoincremental
  • a filename that can point out the title of the file uploaded in string format
  • a title that the person can specify optionally in string format
  • hash_id: created by encoding with MD5 hashing the textual content. We’ll use the hash ID to carry out database operations like learn, delete and replace.
  • created_at is robotically generated by utilizing as a default worth the present time indicating when the merchandise was saved in database
  • file_type signifies whether or not the enter knowledge was textual, audio or video (not carried out, however might be)
  • textual content accommodates the supply knowledge used for your complete logic
  • embedded is a boolean worth that can assist us level to the gadgets which were embedded and thus current within the cloud vector database

Observe: the piece of code __table_args__ = {"extend_existing": True} is critical be capable of entry and manipulate knowledge within the database from Streamlit.

Now that we obtained the information schema down, let’s write our first utility perform: the logger. It’s an extremely helpful factor to have, and due to the lib Wealthy we’ll additionally take pleasure in having some cool colours within the terminal.

# logger.py

import logging
from wealthy.logging import RichHandler
from typing import Elective

def get_console_logger(title: Elective[str] = "default") -> logging.Logger:
logger = logging.getLogger(title)
if not logger.handlers:
logger.setLevel(logging.DEBUG)
console_handler = RichHandler()
console_handler.setLevel(logging.DEBUG)
formatter = logging.Formatter(
"%(asctime)s - %(title)s - %(levelname)s - %(message)s"
)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)

return logger

We’ll simply import it in all of our core scripts.

Since we’re at it, let’s additionally write our utils.py script with some helper features.

# utils.py

import wave
import contextlib
from pydub import AudioSegment

import hashlib
import datetime

from src import logger

logger = logger.get_console_logger("utils")

def compute_cost_of_audio_track(audio_track_file_path: str):
file_extension = audio_track_file_path.break up(".")[-1].decrease()
duration_seconds = 0
if file_extension == "wav":
with contextlib.closing(wave.open(audio_track_file_path, "rb")) as f:
frames = f.getnframes()
price = f.getframerate()
duration_seconds = frames / float(price)
elif file_extension == "mp3":
audio = AudioSegment.from_mp3(audio_track_file_path)
duration_seconds = len(audio) / 1000.0 # pydub returns period in milliseconds
else:
logger.error(f"Unsupported file format: {file_extension}")
return

audio_duration_in_minutes = duration_seconds / 60
value = spherical(audio_duration_in_minutes, 2) * 0.006 # default worth of whisper mannequin
logger.data(f"Price to transform {audio_track_file_path} is ${value:.2f}")
return value

def hash_text(textual content: str) -> str:
return hashlib.md5(textual content.encode()).hexdigest()

def convert_timestamp_to_datetime(timestamp: str) -> str:
return datetime.datetime.fromtimestamp(int(timestamp)).strftime("%Y-%m-%d %H:%M:%S")

We gained’t find yourself utilizing the compute_cost_of_audio_track perform on this model of the instrument, however I’ve included it nonetheless if you wish to use it as a substitute.

hash_text goes for use rather a lot to create the hash IDs to insert within the database, whereas convert_timestamp_to_datetime is beneficial to grasp the default datetime object positioned within the database upon merchandise creation.

Now let’s take a look at the database setup. We’ll setup the conventional CRUD interface:

# db.py

from sqlmodel import SQLModel, create_engine, Session, choose
from src.schema import Info
from src.logger import get_console_logger

sqlite_file_name = "database.db"
sqlite_url = f"sqlite:///{sqlite_file_name}"
engine = create_engine(sqlite_url, echo=False)

logger = get_console_logger("db")

SQLModel.metadata.create_all(engine)

def read_one(hash_id: dict):
with Session(engine) as session:
assertion = choose(Info).the place(Info.hash_id == hash_id)
info = session.exec(assertion).first()
return info

def add_one(knowledge: dict):
with Session(engine) as session:
if session.exec(
choose(Info).the place(Info.hash_id == knowledge.get("hash_id"))
).first():
logger.warning(f"Merchandise with hash_id {knowledge.get('hash_id')} already exists")
return None # or increase an exception, or deal with as wanted
info = Info(**knowledge)
session.add(info)
session.commit()
session.refresh(info)
logger.data(f"Merchandise with hash_id {knowledge.get('hash_id')} added to the database")
return info

def update_one(hash_id: dict, knowledge: dict):
with Session(engine) as session:
# Verify if the merchandise with the given hash_id exists
info = session.exec(
choose(Info).the place(Info.hash_id == hash_id)
).first()
if not info:
logger.warning(f"No merchandise with hash_id {hash_id} discovered for replace")
return None # or increase an exception, or deal with as wanted
for key, worth in knowledge.gadgets():
setattr(info, key, worth)
session.commit()
logger.data(f"Merchandise with hash_id {hash_id} up to date within the database")
return info

def delete_one(id: int):
with Session(engine) as session:
# Verify if the merchandise with the given hash_id exists
info = session.exec(
choose(Info).the place(Info.hash_id == id)
).first()
if not info:
logger.warning(f"No merchandise with hash_id {id} discovered for deletion")
return None # or increase an exception, or deal with as wanted
session.delete(info)
session.commit()
logger.data(f"Merchandise with hash_id {id} deleted from the database")

def add_many(knowledge: record):
with Session(engine) as session:
for information in knowledge:
# Reuse add_one perform for every merchandise
outcome = add_one(data)
if result's None:
logger.warning(
f"Merchandise with hash_id {data.get('hash_id')} couldn't be added"
)
else:
logger.data(
f"Merchandise with hash_id {data.get('hash_id')} added to the database"
)
session.commit() # Commit on the finish of the loop

def delete_many(ids: record):
with Session(engine) as session:
for id in ids:
# Reuse delete_one perform for every merchandise
outcome = delete_one(id)
if result's None:
logger.warning(f"No merchandise with hash_id {id} discovered for deletion")
else:
logger.data(f"Merchandise with hash_id {id} deleted from the database")
session.commit() # Commit on the finish of the loop

def read_all(question: dict = None):
with Session(engine) as session:
assertion = choose(Info)
if question:
assertion = assertion.the place(
*[getattr(Information, key) == value for key, value in query.items()]
)
info = session.exec(assertion).all()
return info

def delete_all():
with Session(engine) as session:
session.exec(Info).delete()
session.commit()
logger.data("All gadgets deleted from the database")

With this script, we’ll be capable of create the database and simply learn, create, delete and replace gadgets one after the other or in bulk.

Now that we have now our info construction and an interface to the database, we’ll transfer to the administration of audio information.

This was a very elective step, however I wished to spice issues up. Our code will permit customers to add any .mp3 or .wav information and transcribe their contents via OpenAI’s Whisper mannequin. My persona in thoughts was a college pupil that might accumulate his notes by way of voice recording.

Take into account Whisper is a paid mannequin. On the time of writing this text, the worth was $0.006 / minute. You may be taught extra at this hyperlink.

Let’s create whisper.py and a single perform known as create_transcript.

from src.logger import get_console_logger

logger = get_console_logger("whisper")

def create_transcript(openai_client, file_path: str) -> None:
audio_file = open(file_path, "rb")
logger.data(f"Creating transcript for {file_path}")
transcript = openai_client.audio.transcriptions.create(
mannequin="whisper-1", file=audio_file
)
logger.data(f"Transcript created for {file_path}")
return transcript.textual content

This perform could be very easy, and it’s only a easy wrapper round OpenAI’s audio module.

The attentive eye will discover that openai_client is an argument to the perform. That’s not a mistake, and we’ll see why in only a second.

Now we are able to deal with textual content in all (of the supported) types, that are primary textual content and audio. It’s time to vectorize these texts and push them to our Upstash vector database.

We’ll be utilizing a number of extra instruments right here to correctly embed our paperwork for vector search and RAG.

  • Tiktoken: the well-known library by OpenAI that permits for easy and environment friendly tokenization based mostly on LLM (in our case, GPT-3.5)
  • LangChain: I like this library, and discover it very versatile regardless of what portion of the group says about it. On this challenge, I borrow from it the RecursiveCharacterTextSplitter object

Once more, in case you cloned the repo, Poetry will import the required dependencies robotically. If not, simply run the command poetry add langchain tiktoken.

In fact, we’ll additionally want to put in Upstash Vector — the command is poetry add upstash-vector. As soon as put in, go to the web page https://console.upstash.com/ to setup your cloud surroundings.

Be sure to select 1536 as vector dimensionality to match the scale of OpenAI ADA mannequin.

As I discussed earlier than, Upstash is a paid instrument, however they do have a really beneficiant free tier that I used extensively for this challenge.

Free: The free plan is appropriate for small initiatives. It has a restrict of 10,000 queries and 10,000 updates restrict every day.

That is nice to get began constructing initiatives like these. Scalability, as well as, isn’t a difficulty since you’ll be able to simply tune your necessities.

As soon as executed, come up with your REST url and token

The endpoint and the token are wanted to determine connection by way of Python. Picture by creator.

Now we’re prepared to jot down our script.

# vector_db.py

from src.logger import get_console_logger

import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
from upstash_vector import Vector
from tqdm import tqdm
import random

logger = get_console_logger("vector_db")

MODEL = "text-embedding-ada-002"
ENCODER = tiktoken.encoding_for_model("gpt-3.5-turbo")

def token_len(textual content):
"""Calculate the token size of a given textual content.

Args:
textual content (str): The textual content to calculate the token size for.

Returns:
int: The variety of tokens within the textual content.
"""
return len(ENCODER.encode(textual content))

def get_embeddings(openai_client, chunks, mannequin=MODEL):
"""Get embeddings for a listing of textual content chunks utilizing the desired mannequin.

Args:
openai_client: The OpenAI consumer occasion to make use of for producing embeddings.
chunks (record of str): The textual content chunks to embed.
mannequin (str): The mannequin identifier to make use of for embedding.

Returns:
record of record of float: An inventory of embeddings, every comparable to a piece.
"""
chunks = [c.replace("n", " ") for c in chunks]
res = openai_client.embeddings.create(enter=chunks, mannequin=mannequin).knowledge
return [r.embedding for r in res]

def get_embedding(openai_client, textual content, mannequin=MODEL):
"""Get embedding for a single textual content utilizing the desired mannequin.

Args:
openai_client: The OpenAI consumer occasion to make use of for producing the embedding.
textual content (str): The textual content to embed.
mannequin (str): The mannequin identifier to make use of for embedding.

Returns:
record of float: The embedding of the given textual content.
"""
# textual content = textual content.change("n", " ")
return get_embeddings(openai_client, [text], mannequin)[0]

def query_vector_db(index, openai_client, query, top_n=1):
"""Question the vector database for related vectors to the given query.

Args:
index: The vector database index to question.
openai_client: The OpenAI consumer occasion to make use of for producing the query embedding.
query (str): The query to question the vector database with.
system_prompt (str, elective): An extra immediate to offer context for the query. Defaults to an empty string.
top_n (int, elective): The variety of prime related vectors to return. Defaults to 1.

Returns:
str: A string containing the concatenated texts of the highest related vectors.
"""
logger.data("Creating vector for query...")
question_embedding = get_embedding(openai_client, query)
logger.data("Querying vector database...")
res = index.question(vector=question_embedding, top_k=top_n, include_metadata=True)
context = "n-".be a part of([r.metadata["text"] for r in res])
logger.data(f"Context returned. Size: {len(context)} characters.")
return context

def create_chunks(textual content, chunk_size=150, chunk_overlap=20):
"""Create textual content chunks based mostly on specified measurement and overlap.

Args:
textual content (str): The textual content to separate into chunks.
chunk_size (int, elective): The specified measurement of every chunk. Defaults to 150.
chunk_overlap (int, elective): The variety of overlapping characters between chunks. Defaults to twenty.

Returns:
record of str: An inventory of textual content chunks.
"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=token_len,
separators=["nn", "n", " ", ""],
)
return text_splitter.split_text(textual content)

def add_chunks_to_vector_db(index, chunks, metadata):
"""Embed textual content chunks and add them to the vector database.

Args:
index: The vector database index so as to add chunks to.
chunks (record of str): The textual content chunks to embed and add.
metadata (dict): The metadata to affiliate with every chunk.

Returns:
None
"""
for chunk in chunks:
random_id = random.randint(0, 1000000) # workaround whereas ready for metadata search to be carried out
metadata["text"] = chunk
vec = Vector(
id=f"chunk-{random_id}", vector=get_embedding(chunk), metadata=metadata
)
index.upsert(vectors=[vec])
logger.data(f"Added chunk to vector db: {chunk}")

def fetch_by_source_hash_id(index, source_hash_id: str, max_results=10000):
"""Fetch vector IDs from the database by supply hash ID.

Args:
index: The vector database index to go looking.
source_hash_id (str): The supply hash ID to filter the vectors by.
max_results (int, elective): The utmost variety of outcomes to return. Defaults to 10000.

Returns:
record of str: An inventory of vector IDs that match the supply hash ID.
"""
ids = []
for i in tqdm(vary(0, max_results, 1000)):
search = index.vary(
cursor=str(i), restrict=1000, include_vectors=False, include_metadata=True
).vectors
for lead to search:
if outcome.metadata["source_hash_id"] == source_hash_id:
ids.append(outcome.id)
return ids

def fetch_all(index):
"""Fetch all vectors from the database.

Args:
index: The vector database index to fetch vectors from.

Returns:
record: An inventory of vectors from the database.
"""
return index.vary(
cursor="0", restrict=1000, include_vectors=False, include_metadata=True
).vectors

There’s extra occurring on this script so let me dive deeper for a second.

get_embedding and get_embeddings are used to encode one or a number of texts. Simply conveniently positioned right here for higher management.

query_vector_db permits us to question Upstash for related gadgets to our question vector. On this perform, we embed the question and carry out the search for via the index’s .question methodology. The index, along with OpenAI’s consumer, are handed in as arguments later within the Streamlit app. The returned object is a string known as context which is a concatenation of the highest N most related gadgets to the enter question.

Persevering with, we leverage LangChain’s RecursiveCharacterTextSplitter to effectively create textual chunks from the paperwork.

Now a little bit of CRUD interface additionally for the vector DB: including and fetching knowledge (updating and deletion are simply carried out too and we’ll do this within the frontend).

Observe: on the time of writing this text, Upstash doesn’t but assist search on metadata. Which means since we’re utilizing hash_id to establish our paperwork, these aren’t immediately querable. I’ve added a easy workaround within the code to flick thru a bunch (100k) paperwork and search for for the hash ID manually. I’ve learn on-line they’ll be implementing this performance quickly.

We’ll begin engaged on coding our LLM behaviors by engaged on prompts first.

There are going to be two brokers. The primary one is answerable for extracting community knowledge from the textual content, whereas the second is answerable for analyzing that community knowledge.

The immediate to the primary agent is the next:

You might be an skilled in creating community graphs from textual knowledge.
You might be additionally a note-taking skilled and you'll be able to create thoughts maps from textual content.
You might be tasked with making a thoughts map from a given textual content knowledge by extracting the ideas and relationships from the textual content.n
The relationships must be amongst objects, individuals, or locations talked about within the textual content.n

TYPES ought to solely be one of many following:
- is a
- is expounded to
- is a part of
- is just like
- is totally different from
- is a sort of

Your output must be a JSON containing the next:
{ "relationships": [{"source": ..., "target": ..., "type": ..., "origin": _source_or_target_}, {...}] } n
- supply: The supply noden
- goal: The goal noden
- kind: The kind of the connection between the supply and goal nodesn

NEVER change this output format. ENGLISH is the output language. NEVER change the output language.
Your response will probably be used as a Python dictionary, so be all the time aware of the syntax and the information sorts to return a JSON object.n

INPUT TEXT:n

The analyzer agent is as a substitute utilizing this immediate

You're a senior enterprise intelligence analyst, who is ready to extract invaluable insights from knowledge.
You might be tasked with extracting info from a given thoughts map knowledge.n
The thoughts map knowledge is a JSON containing the next:
{{ "relationships": [{{"source": ..., "target": ..., "type": ..."origin": _source_or_target_}}, {{...}}] }} n
- supply: The supply noden
- goal: The goal noden
- kind: The kind of the connection between the supply and goal nodesn
- origin: The origin node from which the connection originatesn

You might be to extract insights from the thoughts map knowledge and supply a abstract of the relationships.n

Your output must be a quick touch upon the thoughts map knowledge, highlighting related insights and relationships utilizing centrality and different graph evaluation strategies.n

NEVER change this output format. ENGLISH is the output language. NEVER change the output language.n
Preserve your output very temporary. Only a remark to spotlight the highest most related info.

MIND MAP DATA:n
{mind_map_data}

These two prompts will probably be imported within the Pythonic method: that’s, as scripts.

Let’s create a script within the LLM folder known as prompts.py and create a dictionary of intents the place we place the prompts as values.

# llm.prompts.py

PROMPTS = {
"mind_map_of_one": """You might be an skilled in creating community graphs from textual knowledge.
You might be additionally a note-taking skilled and you'll be able to create thoughts maps from textual content.
You might be tasked with making a thoughts map from a given textual content knowledge by extracting the ideas and relationships from the textual content.n
The relationships must be amongst objects, individuals, or locations talked about within the textual content.n

TYPES ought to solely be one of many following:
- is a
- is expounded to
- is a part of
- is just like
- is totally different from
- is a sort of

Your output must be a JSON containing the next:
{ "relationships": [{"source": ..., "target": ..., "type": ...}, {...}] } n
- supply: The supply noden
- goal: The goal noden
- kind: The kind of the connection between the supply and goal nodesn

NEVER change this output format. ENGLISH is the output language. NEVER change the output language.
Your response will probably be used as a Python dictionary, so be all the time aware of the syntax and the information sorts to return a JSON object.n

INPUT TEXT:n
""",
"inspector_of_mind_map": """
You're a senior enterprise intelligence analyst, who is ready to extract invaluable insights from knowledge.
You might be tasked with extracting info from a given thoughts map knowledge.n
The thoughts map knowledge is a JSON containing the next:
{{ "relationships": [{{"source": ..., "target": ..., "type": ...}}, {{...}}] }} n
- supply: The supply noden
- goal: The goal noden
- kind: The kind of the connection between the supply and goal nodesn
- origin: The origin node from which the connection originatesn

You might be to extract insights from the thoughts map knowledge and supply a abstract of the relationships.n

Your output must be a quick touch upon the thoughts map knowledge, highlighting related insights and relationships utilizing centrality and different graph evaluation strategies.n

NEVER change this output format. ENGLISH is the output language. NEVER change the output language.n
Preserve your output very temporary. Only a remark to spotlight the highest most related info.

MIND MAP DATA:n
{mind_map_data}
""",
}

On this method we are able to simply import and use the prompts just by pointing on the agent’s intent (mind_map_of_one, inspector_of_mind_map). We’ll import the prompts within the llm.py script.

# llm.llm.py

from src.logger import get_console_logger
from src.llm.prompts import PROMPTS

logger = get_console_logger("llm")
MIND_MAP_EXTRACTION_MODEL = "gpt-4-turbo-preview"
MIND_MAP_INSPECTION_MODEL = "gpt-4"

def extract_mind_map_data(openai_client: object, textual content: str) -> None:
logger.data(f"Extracting thoughts map knowledge from textual content...")
response = openai_client.chat.completions.create(
mannequin=MIND_MAP_EXTRACTION_MODEL,
response_format={"kind": "json_object"},
temperature=0,
messages=[
{"role": "system", "content": PROMPTS["mind_map_of_one"]},
{"function": "person", "content material": f"{textual content}"},
],
)
return response.selections[0].message.content material

def extract_mind_map_data_of_two(
openai_client: object, source_text: str, target_text: str
) -> None:
logger.data(f"Extracting thoughts map knowledge from two texts...")
user_prompt = PROMPTS["mind_map_of_many"].format(
source_text=source_text, target_text=target_text
)
response = openai_client.chat.completions.create(
mannequin=MIND_MAP_INSPECTION_MODEL,
response_format={"kind": "json_object"}, # this is essential!
messages=[
{"role": "system", "content": PROMPTS["mind_map_of_many"]},
{"function": "person", "content material": user_prompt},
],
)
return response.selections[0].message.content material

def extract_information_from_mind_map_data(openai_client_ object, knowledge: dict) -> None:
logger.data(f"Extracting info from thoughts map knowledge...")
user_prompt = PROMPTS["inspector_of_mind_map"].format(mind_map_data=knowledge)
response = openai_client.chat.completions.create(
mannequin="gpt-4",
messages=[
{"role": "system", "content": PROMPTS["inspector_of_mind_map"]},
{"function": "person", "content material": user_prompt},
],
)
return response.selections[0].message.content material

All of the heavy work is completed by the 2 easy features that merely join an GPT agent to the suitable immediate. Observe response_format={“kind"=”json_object"} within the first perform. This ensures that GPT-4 builds a JSON illustration of the textual content’s community knowledge. With out this line, your complete software turns into extremely unstable.

Let’s put the logic to the take a look at. When handed the immediate “Who’s Andrej Karpathy?” the primary agent creates this community illustration:

{
"relationships":[
{
"source":"Andrej Karpathy",
"target":"Slovak-Canadian",
"type":"is a"
},
{
"source":"Andrej Karpathy",
"target":"computer scientist",
"type":"is a"
},
{
"source":"Andrej Karpathy",
"target":"director of artificial intelligence and Autopilot Vision at Tesla",
"type":"served as"
},
{
"source":"Andrej Karpathy",
"target":"OpenAI",
"type":"worked at"
},
{
"source":"Andrej Karpathy",
"target":"deep learning",
"type":"specialized in"
},
{
"source":"Andrej Karpathy",
"target":"computer vision",
"type":"specialized in"
},
{
"source":"Andrej Karpathy",
"target":"Bratislava, Czechoslovakia",
"type":"was born in"
},
{
"source":"Andrej Karpathy",
"target":"Toronto",
"type":"moved to"
},
{
"source":"Andrej Karpathy",
"target":"University of Toronto",
"type":"completed degrees at"
},
{
"source":"Andrej Karpathy",
"target":"University of British Columbia",
"type":"completed master's degree at"
},
{
"source":"Andrej Karpathy",
"target":"OpenAI",
"type":"is a founding member of"
},
{
"source":"Andrej Karpathy",
"target":"Tesla",
"type":"became director of artificial intelligence at"
},
{
"source":"Andrej Karpathy",
"target":"Elon Musk",
"type":"reported to"
},
{
"source":"Andrej Karpathy",
"target":"MIT Technology Review's Innovators Under 35 for 2020",
"type":"was named one of"
},
{
"source":"Andrej Karpathy",
"target":"YouTube videos on how to create artificial neural networks",
"type":"makes"
},
{
"source":"Andrej Karpathy",
"target":"Stanford University",
"type":"received a PhD from"
},
{
"source":"Fei-Fei Li",
"target":"Stanford University",
"type":"is part of"
},
{
"source":"Andrej Karpathy",
"target":"natural language processing",
"type":"focused on"
},
{
"source":"Andrej Karpathy",
"target":"CS 231n: Convolutional Neural Networks for Visual Recognition",
"type":"authored and was the primary instructor of"
},
{
"source":"CS 231n: Convolutional Neural Networks for Visual Recognition",
"target":"Stanford",
"type":"is part of"
}
]
}

This knowledge comes from unstructured Wikipedia textual content uploaded within the instrument for testing functions. The illustration appears simply high-quality! Be happy to edit the prompts to extract much more potential info.

All that is still now’s to make use of this Python dictionary of relationships to create our interactive thoughts map with NetworkX and Plotly.

There’s going to be one perform solely, however goes to be fairly intense in case you’ve by no means labored with NetworkX earlier than. It’s not the best framework to work with, however the outputs you may get from changing into proficient at it are invaluable.

What we’ll do is initialize a graph object with G = nx.DiGraph(), which creates a brand new directed graph. The perform iterates over a listing of relationships offered within the knowledge dictionary. For every relationship, it provides an edge to the graph G from the supply node to the goal node, with an attribute kind that describes the connection.

for relationship in knowledge["relationships"]:
G.add_edge(
relationship["source"], relationship["target"], kind=relationship["type"]
)

As soon as executed, the graph’s format is computed utilizing the spring format algorithm, which positions the nodes in a method that tries to attenuate the overlap between edges and preserve the perimeters’ lengths uniform. The seed parameter ensures that the format is reproducible.

Lastly, Plotly’s Graph Objects (go) module takes care of making scatterplots for every knowledge level, representing a node on the chart.

Right here’s how the mind_map.py script seems.

# mind_map.py

import networkx as nx
from graphviz import Digraph

import plotly.categorical as px
import plotly.graph_objects as go

def create_plotly_mind_map(knowledge: dict) -> go.Determine:
"""
knowledge is a dictionary containing the next
{ "relationships": [{"source": ..., "target": ..., "type": ...}, {...}] }
supply: The supply node
goal: The goal node
kind: The kind of the connection between the supply and goal nodes
"""

### START - NETWORKX LOGIC ###
# Create a directed graph
G = nx.DiGraph()

# Add edges to the graph
for relationship in knowledge["relationships"]:
G.add_edge(
relationship["source"], relationship["target"], kind=relationship["type"]
)

# Create a format for our nodes
format = nx.spring_layout(G, seed=42)

traces = []
for relationship in knowledge["relationships"]:
x0, y0 = format[relationship["source"]]
x1, y1 = format[relationship["target"]]
edge_trace = go.Scatter(
x=[x0, x1, None],
y=[y0, y1, None],
line=dict(width=0.5, shade="#888"), # Set a single shade for all edges
hoverinfo="none",
mode="traces",
)
traces.append(edge_trace)

# Modify node hint to paint based mostly on supply node
node_x = []
node_y = []
for node in G.nodes():
x, y = format[node]
node_x.append(x)
node_y.append(y)

### END - NETWORKX LOGIC ###

node_trace = go.Scatter(
x=node_x,
y=node_y,
mode="markers+textual content",
# add textual content to the nodes and origin
textual content=[node for node in G.nodes()],
hoverinfo="textual content",
marker=dict(
showscale=False,
colorscale="Greys", # Change colorscale to grayscale
reversescale=True,
measurement=20,
shade='#505050', # Set node shade to grey
line_width=2,
),
)

# Add node and edge labels
edge_annotations = []
for edge in G.edges(knowledge=True):
x0, y0 = format[edge[0]]
x1, y1 = format[edge[1]]
edge_annotations.append(
dict(
x=(x0 + x1) / 2,
y=(y0 + y1) / 2,
xref="x",
yref="y",
textual content=edge[2]["type"],
showarrow=False,
font=dict(measurement=10),
)
)

node_annotations = []
for node in G.nodes():
x, y = format[node]
node_annotations.append(
dict(
x=x,
y=y,
xref="x",
yref="y",
textual content=node,
showarrow=False,
font=dict(measurement=12),
)
)

node_trace.textual content = [node for node in G.nodes()]

# Create the determine
fig = go.Determine(
knowledge=traces + [node_trace],
format=go.Structure(
showlegend=False,
hovermode="closest",
margin=dict(b=20, l=5, r=5, t=40),
annotations=edge_annotations,
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
),
)

# Modify the format to incorporate the legend
fig.update_layout(
legend=dict(
title="Origins",
traceorder="regular",
font=dict(measurement=12),
)
)

# Modify the node textual content shade for higher visibility on darkish background
node_trace.textfont = dict(shade="white")

# Modify the format to incorporate the legend and set the plot background to darkish
fig.update_layout(
paper_bgcolor="rgba(0,0,0,1)", # Set the background shade to black
plot_bgcolor="rgba(0,0,0,1)", # Set the plot space background shade to black
legend=dict(
title="Origins",
traceorder="regular",
font=dict(measurement=12, shade="white"), # Set legend textual content shade to white
),
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
)

for annotation in edge_annotations:
annotation["font"]["color"] = "white" # Set edge annotation textual content shade to white

# Replace the colour of the node annotations for higher visibility
for annotation in node_annotations:
annotation["font"]["color"] = "white" # Set node annotation textual content shade to white

# Replace the sting hint shade to be extra seen on a darkish background
for hint in traces:
if "line" in hint:
hint["line"][
"color"
] = "#888" # Set edge shade to a single shade for all edges

# Replace the node hint marker border shade for higher visibility
node_trace.marker.line.shade = "white"

return fig

Be happy to easily copy-paste this perform in your logic and alter it as you please.

And that is how the thoughts map seems for the immediate “Who’s Sam Altman?”

How a thoughts map seems. Picture by creator.

Nice work! We’re executed with the backend logic! Our final step is to implement the Streamlit app.

[ad_2]