[ad_1]
To have a greater understanding of the waitAndScreenshot
operate, let’s check out the log of the operate in motion:
After the web page is fully loaded, all interactive components are highlighted and a screenshot is taken.
export const waitTillHTMLRendered = async (
web page: Web page,
timeout: quantity = 30000,
checkOnlyHTMLBody: boolean = false
) => {
const waitTimeBetweenChecks: quantity = 1000;
const maximumChecks: quantity = timeout / waitTimeBetweenChecks; // assuming verify itself doesn't take time
let lastHTMLSize = 0;
let stableSizeCount = 0;
const COUNT_THRESHOLD = 3;const isSizeStable = (currentSize: quantity, lastSize: quantity) => {
if (currentSize !== lastSize) {
return false; // nonetheless rendering
} else if (currentSize === lastSize && lastSize === 0) {
return false; // web page stays empty - did not render
} else {
return true; // steady
}
};
for (let i = 0; i < maximumChecks; i++) {
const html = await web page.content material();
const currentHTMLSize = html.size;
const currentBodyHTMLSize = await web page.consider(
() => doc.physique.innerHTML.size
);
const currentSize = checkOnlyHTMLBody
? currentBodyHTMLSize
: currentHTMLSize;
// logging
console.log(
"final: ",
lastHTMLSize,
" <> curr: ",
currentHTMLSize,
" physique html dimension: ",
currentBodyHTMLSize
);
stableSizeCount = isSizeStable(currentSize, lastHTMLSize)
? stableSizeCount + 1
: 0;
console.log(`Steady dimension rely: ${stableSizeCount}`);
// if the HTML dimension stays the identical for 3 consecutive seconds, it assumes the web page has completed loading
if (stableSizeCount >= COUNT_THRESHOLD) {
console.log("Web page rendered totally..");
break;
}
lastHTMLSize = currentSize;
await web page.waitForTimeout(waitTimeBetweenChecks);
}
};
Step 2 (cont.) — Click on Response Stream
: The clickNavigationAndScreenshot
operate
This operate is used to click on on a particular ingredient on the web page and look forward to the web page to load fully after which take a screenshot. For the click on
motion, it makes use of one other operate referred to as clickOnLink
.
export const clickNavigationAndScreenshot = async (
linkText: string,
web page: Web page,
browser: Browser
) => {
let imagePath;attempt {
const navigationPromise = web page.waitForNavigation();
// The Click on motion
const clickResponse = await clickOnLink(linkText, web page);
if (!clickResponse) {
// if the hyperlink triggers a navigation on the identical web page, look forward to the web page to load fully after which take a screenshot
await navigationPromise;
imagePath = await waitAndScreenshot(web page);
} else {
// if the hyperlink opens in a brand new tab, ignore the navigationPromise as there will not be any navigation
navigationPromise.catch(() => undefined);
// swap to the brand new tab and take a screenshot
const newPage = await newTabNavigation(clickResponse, web page, browser);
if (newPage === undefined) {
throw new Error("The brand new web page can't be opened");
}
imagePath = await waitAndScreenshot(newPage);
}
return imagePath;
} catch (err) {
throw err;
}
};
The clickOnLink
operate
This operate loops by all the weather with the gpt-link-text
attribute (distinctive identifier obtained throughout ingredient annotation) and clicks on the one which matches the hyperlink textual content supplied by the LLM.
const clickOnLink = async (linkText: string, web page: Web page) => {
attempt {
const clickResponse = await web page.consider(async (linkText) => {const isHTMLElement = (ingredient: Component): ingredient is HTMLElement => {
return ingredient instanceof HTMLElement;
};
const components = doc.querySelectorAll("[gpt-link-text]");
// loop by all components with `gpt-link-text` attribute
for (const ingredient of components) {
if (!isHTMLElement(ingredient)) {
proceed;
}
// discover the ingredient that incorporates the focused hyperlink textual content
if (
ingredient
.getAttribute("gpt-link-text")
?.contains(linkText.trim().toLowerCase())
) {
// This if assertion is to deal with the case the place the hyperlink opens in a brand new tab
if (ingredient.getAttribute("goal") === "_blank") {
return ingredient.getAttribute("gpt-link-text");
}
// spotlight and carry out the clicking motion
ingredient.type.backgroundColor = "rgba(255,255,0,0.25)";
ingredient.click on();
return;
}
}
// provided that the loop ends with out returning
throw new Error(`Hyperlink with textual content not discovered: "${linkText}"`);
}, linkText);
return clickResponse;
} catch (err) {
if (err instanceof Error) {
throw err;
}
}
};
Component Annotation Service
Let’s look deeper into the highlightInteractiveElements
operate that is known as inside waitAndScreenshot
.
It’s a service that annotates the interactive HTML components for the agent. It could actually spotlight components with a crimson bounding field
and add distinctive identifiers to them.
Think about giving your AI agent a particular pair of glasses that lets it see the interactive spots on a web site — the buttons, hyperlinks, and fields — like glowing treasures on a treasure map.
That’s basically what the highlightInteractiveElements
operate does. It is like a highlighter for the digital world, sketching crimson bins round clickable gadgets and tagging them with digital nametags.
With the annotation, the accuracy of the agent’s interpretation of the picture is basically improved. This idea is known as Set-of-Mark Prompting
.
Right here is an instance of the annotated screenshot:
There’s a analysis paper discussing the significance of this matter intimately: Set-of-Mark Prompting.
Right here’s the way it performs:
- It begins by eradicating any previous digital nametags (html attribute
gpt-link-text
) that may confuse our AI. - Then, it lights up each clickable factor it finds with a crimson define to assist the AI spot the place to ‘click on’.
- Every interactive ingredient will get a singular nametag. This tag/attribute will likely be used to establish the ingredient that Puppeteer can later work together with.
One key element to recollect is when coping with puppeteer or some other testing framework that programmatically interacts with the net, the ingredient with a hyperlink textual content is probably not seen. Right here is a straightforward instance:
<div type="show: none">
<a href="https://www.instance.com">
<span>Click on me</span>
</a>
</div>
The mum or dad div is hidden, so the hyperlink just isn’t seen. This ingredient needs to be excluded. Recursive checking the mum or dad ingredient is important to make sure the ingredient is seen. See beneath graph for the logic:
Code implementation of the highlightInteractiveElements
operate
import { Web page } from "puppeteer";const INTERACTIVE_ELEMENTS = [
"a",
"button",
"input",
"textarea",
"[role=button]",
"[role=treeitem]",
'[onclick]:not([onclick=""])',
];
/**
* Reset the distinctive identifier attribute and take away beforehand highlighted components
* @param web page
*/
const resetUniqueIdentifierAttribute = async (web page: Web page): Promise<void> => {
await web page.consider(() => {
const UNIQUE_IDENTIFIER_ATTRIBUTE = "gpt-link-text";
const components = doc.querySelectorAll(
`[${UNIQUE_IDENTIFIER_ATTRIBUTE}]`
);
for (const ingredient of components) {
ingredient.removeAttribute(UNIQUE_IDENTIFIER_ATTRIBUTE);
}
});
};
/**
* This operate annotates all of the interactive components on the web page
* @param web page
*/
const annotateAllInteractiveElements = async (web page: Web page) => {
// $$eval technique runs Array.from(doc.querySelectorAll(selector)) throughout the `web page`and passes the outcome as the primary argument to the pageFunction.
// If no components match the selector, the primary argument to the pageFunction is [].
await web page.$$eval(
INTERACTIVE_ELEMENTS.be part of(", "), // the selector will be outlined outdoors the browser context
// the argument `components` will be an empty array if no components match the selector
operate (components) {
// any console.log won't be seen within the node terminal
// as an alternative, it will likely be seen within the browser console
// deal with empty array
if (components.size === 0) {
throw new Error("No components discovered");
}
//======================================VALIDATE ELEMENT CAN INTERACT=================================================
// This run-time verify should be outlined contained in the pageFunction as it's working within the browser context. If outlined outdoors, it is going to throw an error: "ReferenceError: isHTMLElement just isn't outlined"
const isHTMLElement = (ingredient: Component): ingredient is HTMLElement => {
// this assertion is to permit Component to be handled as HTMLElement and has `type` property
return ingredient instanceof HTMLElement;
};
const isElementStyleVisible = (ingredient: Component) => {
const type = window.getComputedStyle(ingredient);
return (
type.show !== "none" &&
type.visibility !== "hidden" &&
type.opacity !== "0" &&
type.width !== "0px" &&
type.peak !== "0px"
);
};
const isElementVisible = (ingredient: Component | undefined | null) => {
if (ingredient === null || ingredient === undefined) {
throw new Error("isElementVisible: Component is null or undefined");
}
let currentElement: Component | null = ingredient;
whereas (currentElement) {
if (!isElementStyleVisible(currentElement)) {
return false;
}
currentElement = currentElement.parentElement;
}
return true;
};
//========================================PREPARE UNIQUE IDENTIFIER================================================
const setUniqueIdentifierBasedOnTextContent = (ingredient: Component) => {
const UNIQUE_IDENTIFIER_ATTRIBUTE = "gpt-link-text";
const { textContent, tagName } = ingredient;
// if the node is a doc or doctype, textContent will likely be null
if (textContent === null) {
return;
}
ingredient.setAttribute(
UNIQUE_IDENTIFIER_ATTRIBUTE,
textContent.trim().toLowerCase()
);
};
//========================================HIGHLIGHT INTERACTIVE ELEMENTS================================================
for (const ingredient of components) {
if (isHTMLElement(ingredient)) {
// spotlight all of the interactive components with a crimson bonding field
ingredient.type.define = "2px strong crimson";
}
// assign a singular identifier to the ingredient
if (isElementVisible(ingredient)) {
// set a singular identifier attribute to the ingredient
// this attribute will likely be used to establish the ingredient that puppeteer ought to work together with
setUniqueIdentifierBasedOnTextContent(ingredient);
}
}
}
);
};
/**
* This operate highlights all of the interactive components on the web page
* @param web page
*/
export const highlightInteractiveElements = async (web page: Web page) => {
await resetUniqueIdentifierAttribute(web page);
await annotateAllInteractiveElements(web page);
};
On this article, we’ve got gone by the structure of the AI agent, the code implementation of every step, and a few ideas behind the design, comparable to Set-of-Mark Prompting. The agent is a chic system that requires cautious orchestration of various providers to work successfully, and at the moment it has loads of points and limitations. In case you have any questions or options, please be happy to succeed in out to me. I might be pleased to debate this matter additional.
Jason Li (Tianyi Li, LinkedIn) is a Full-stack Developer working at Mindset Well being in Melbourne Australia. Jason is enthusiastic about AI, front-end improvement and house associated applied sciences.
Selina Li (Selina Li, LinkedIn) is a Principal Information Engineer working at Officeworks in Melbourne Australia. Selina is enthusiastic about AI/ML, knowledge engineering and funding.
Jason and Selina would like to discover applied sciences to assist individuals obtain their objectives.
Until in any other case famous, all photographs are by the authors.
[ad_2]