Constructing a Chat Utility with LangChain, LLMs, and Streamlit for Complicated SQL Database Interplay | by Hamza Gharbi

Machine Learning

Constructing a Chat Utility with LangChain, LLMs, and Streamlit for Complicated SQL Database Interplay | by Hamza Gharbi | Feb, 2024

hhhhm

2024年2月10日

Constructing a Chat Utility with LangChain, LLMs, and Streamlit for Complicated SQL Database Interplay | by Hamza Gharbi | Feb, 2024

[ad_1]

Given the complexity of our desk, the agent may not absolutely perceive the data within the database by solely inspecting the schema and a pattern row. For instance, the agent ought to recognise {that a} question relating to automobiles equates to looking the ‘class’ column for the worth ‘Vehicles et moyens de déplacement’ (i.e., ‘Vehicles and technique of transportation’). Subsequently, further instruments are obligatory to supply the agent with extra context concerning the database.

Right here’s a breakdown of the additional instruments we plan to make use of:

get_categories_and_sub_categories: This instrument is designed to assist the agent fetch an inventory of distinct objects from the class and sub_category columns. This method is efficient because of the comparatively low variety of distinctive values inside these columns. If the columns contained a whole lot or hundreds of distinctive values, it is likely to be higher to make use of a retrieval instrument. In such circumstances, when a consumer asks a couple of class, the agent may search for essentially the most related classes in a vector database, which shops embeddings of assorted values. The agent would then use these classes for its SQL queries. Nevertheless, on condition that our class and sub_category columns do not have a variety of distinctive values, we’ll merely return the checklist immediately.

from langchain.instruments import instrument, Deviceimport ast
import json
from sql_agent.sql_db import db
def run_query_save_results(db, question):
res = db.run(question)
res = [el for sub in ast.literal_eval(res) for el in sub]
return res
def get_categories(question: str) -> str:
"""
Helpful to get classes and sub_categories. A json is returned the place the important thing could be class or sub_category,
and the worth is an inventory of distinctive itmes for both each.
"""
sub_cat = run_query_save_results(
db, "SELECT DISTINCT sous_categorie_de_produit FROM rappel_conso_table"
)
cat = run_query_save_results(
db, "SELECT DISTINCT categorie_de_produit FROM rappel_conso_table"
)
category_str = (
"Record of distinctive values of the categorie_de_produit column : n"
+ json.dumps(cat, ensure_ascii=False)
)
sub_category_str = (
"n Record of distinctive values of the sous_categorie_de_produit column : n"
+ json.dumps(sub_cat, ensure_ascii=False)
)
return category_str + sub_category_str

get_columns_descriptions: Since we are able to’t feed the columns descriptions within the schema immediately, we created an additional instrument that returns brief description for each ambiguous column. Some examples embrace:

"reference_fiche": "main key of the database and distinctive identifier within the database. ",
"nom_de_la_marque_du_produit": "A string representing the Title of the product model. Instance: Apple, Carrefour, and so on ... Whenever you filter by this column,you could use LOWER() perform to make the comparability case insensitive and you could use LIKE operator to make the comparability fuzzy.",
"noms_des_modeles_ou_references": "Names of the fashions or references. Can be utilized to get particular infos concerning the product. Instance: iPhone 12, and so on, sweet X, product Y, bread, butter ...",
"identification_des_produits": "Identification of the merchandise, for instance the gross sales lot.",


def get_columns_descriptions(question: str) -> str:
"""
Helpful to get the outline of the columns within the rappel_conso_table desk.
"""
return json.dumps(COLUMNS_DESCRIPTIONS)

get_today_date : instrument that retrieves in the present day’s date utilizing python datetime library. The agent will use this instrument when requested about temporality. For instance: “What are the recalled merchandise since final week ?”

from datetime import datetimedef get_today_date(question: str) -> str:
"""
Helpful to get the date of in the present day.
"""
# Getting in the present day's date in string format
today_date_string = datetime.now().strftime("%Y-%m-%d")
return today_date_string

Lastly we create an inventory of all these instruments and we feed it to the create_sql_agent perform. For each instrument we should outline a novel title inside the set of instruments supplied to the agent. The outline is non-compulsory however could be very really useful as it may be used to supply extra info.

def sql_agent_tools():
instruments = [
Tool.from_function(
func=get_categories,
name="get_categories_and_sub_categories",
description="""
Useful to get categories and sub_categories. A json is returned where the key can be category or sub_category, 
and the value is a list of unique items for either both.
""",
),
Tool.from_function(
func=get_columns_descriptions,
name="get_columns_descriptions",
description="""
Useful to get the description of the columns in the rappel_conso_table table.
""",
),
Tool.from_function(
func=get_today_date,
name="get_today_date",
description="""
Useful to get the date of today.
""",
),
]
return instruments

extra_tools = sql_agent_tools()agent = create_sql_agent(
llm=llm_agent,
toolkit=toolkit,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
extra_tools=extra_tools,
verbose=True,
)

Generally, the instrument descriptions aren’t sufficient for the agent to know when to make use of them. To deal with this, we are able to change the ending a part of the agent LLM immediate, referred to as the suffix. In our setup, the immediate has three sections:

Prefix: It is a string positioned earlier than the instrument checklist. We’re sticking with the default prefix, which instructs the agent on find out how to create and execute SQL queries in response to consumer questions, set a restrict on consequence numbers to 10 , test the queries fastidiously, and keep away from making adjustments to the database.
The checklist of instruments: This half lists out all of the instruments that the agent has at its disposal.
Suffix: That is the half the place we give the agent instructions on find out how to course of and take into consideration the consumer’s query.

Right here’s the default suffix for the SQL ReAct agent in Langchain:

SQL_SUFFIX = """Start!Query: {enter}
Thought: I ought to take a look at the tables within the database to see what I can question.  Then I ought to question the schema of essentially the most related tables.
{agent_scratchpad}"""

enter and agent_scratchpad are two placeholders. enter represents the consumer’s question and agent_scratchpad will symbolize the historical past of instrument invocations and the corresponding instrument outputs.

We are able to make the “Thought” half longer to provide extra directions on which instruments to make use of and when:

CUSTOM_SUFFIX = """Start!Query: {enter}
Thought Course of: It's crucial that I don't fabricate info not current within the database or have interaction in hallucination; 
sustaining trustworthiness is essential. If the consumer specifies a class, I ought to try to align it with the classes within the `categories_produits` 
or `sous_categorie_de_produit` columns of the `rappel_conso_table` desk, using the `get_categories` instrument with an empty string because the argument. 
Subsequent, I'll purchase the schema of the `rappel_conso_table` desk utilizing the `sql_db_schema` instrument. 
Using the `get_columns_descriptions` instrument is extremely advisable for a deeper understanding of the `rappel_conso_table` columns, aside from easy duties. 
When supplied with a product model, I'll search within the `nom_de_la_marque_du_produit` column; for a product kind, within the `noms_des_modeles_ou_references` column. 
The `get_today_date` instrument, requiring an empty string as an argument, will present in the present day's date. 
In SQL queries involving string or TEXT comparisons, I need to use the `LOWER()` perform for case-insensitive comparisons and the `LIKE` operator for fuzzy matching. 
Queries for at present recalled merchandise ought to return rows the place `date_de_fin_de_la_procedure_de_rappel` (the recall's ending date) is null or later than in the present day's date. 
When presenting merchandise, I'll embrace picture hyperlinks from the `liens_vers_les_images` column, formatted strictly as:  [lien vers l'image] url1, [lien vers l'image] url2 ... Preceded by the point out within the question's language "right here is(are) the picture(s) :"
Moreover, the precise recalled product lot will probably be included from the `identification_des_produits` column. 
My ultimate response have to be delivered within the language of the consumer's question.
{agent_scratchpad}
"""

This manner, the agent doesn’t simply know what instruments it has but in addition will get higher steerage on when to make use of them.

Now let’s modify the arguments for the create_sql_agent to account for brand spanking new suffix:

agent = create_sql_agent(
llm=llm_agent,
toolkit=toolkit,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
suffix=CUSTOM_SUFFIX,
extra_tools=agent_tools,
verbose=True,
)

Another choice we thought-about was to incorporate the directions within the prefix. Nevertheless, our empirical observations indicated that this had little to no influence on the ultimate response. Subsequently, we selected to retain the directions within the suffix. Conducting a extra in depth analysis of the mannequin outputs may very well be useful for an in depth comparability of the 2 approaches.

[ad_2]