diff --git a/README.md b/README.md
index 55a888b..0fdf3b9 100644
--- a/README.md
+++ b/README.md
@@ -4,12 +4,7 @@
-
-
🎉We are going open source🎉
-
- Let us know if you're interested in contributing! We're working on integrating the core logic for getting elements and extraction into the sdk!
-
-
+> **Notice:** The Dendrite SDK is not under active development anymore. However, the project will remain fully open source so that you and others can learn from it. Feel free to fork, study, or adapt this code for your own projects as you wish – reach out to us on Discord if you have questions! We love chatting about web AI agents. 🤖
## What is Dendrite?
@@ -24,33 +19,60 @@
#### A simple outlook integration
-With Dendrite, it's easy to create web interaction tools for your agent.
+With Dendrite it's easy to create web interaction tools for your agent.
+
+Here's how you can send an email:
```python
-from dendrite import Dendrite
+from dendrite import AsyncDendrite
+
-def send_email():
- client = Dendrite(auth="outlook.live.com")
+async def send_email(to, subject, message):
+ client = AsyncDendrite(auth="outlook.live.com")
# Navigate
- client.goto(
- "https://outlook.live.com/mail/0/",
- expected_page="An email inbox"
+ await client.goto(
+ "https://outlook.live.com/mail/0/", expected_page="An email inbox"
)
# Create new email and populate fields
- client.click("The new email button")
- client.fill_fields({
- "Recipient": to,
- "Subject": subject,
- "Message": message
- })
+ await client.click("The new email button")
+ await client.fill("The recipient field", to)
+ await client.press("Enter")
+ await client.fill("The subject field", subject)
+ await client.fill("The message field", message)
# Send email
- client.click("The send button")
+ await client.press("Enter", hold_cmd=True)
+
+
+if __name__ == "__main__":
+ import asyncio
+
+ asyncio.run(send_email("test@example.com", "Hello", "This is a test email"))
+
```
-To authenticate you'll need to use our Chrome Extension **Dendrite Vault**, you can download it [here](https://chromewebstore.google.com/detail/dendrite-vault/faflkoombjlhkgieldilpijjnblgabnn). Read more about authentication [in our docs](https://docs.dendrite.systems/examples/authentication-instagram).
+You'll need to add your own Anthropic key or [configure which LLMs to use yourself](https://docs.dendrite.systems/concepts/config).
+
+
+```.env
+ANTHROPIC_API_KEY=sk-...
+```
+
+To **authenticate** on any web service with Dendrite, follow these steps:
+
+1. Run the authentication command
+
+ ```bash
+ dendrite auth --url outlook.live.com
+ ```
+
+2. This command will open a browser that you'll be able to login with.
+
+3. After you've logged in, press enter in your terminal. This will save your cookies locally so that they can be used in your code.
+
+Read more about authentication [in our docs](https://docs.dendrite.systems/examples/authentication).
## Quickstart
@@ -60,81 +82,159 @@ pip install dendrite && dendrite install
#### Simple navigation and interaction
-Initialize the Dendrite client and start doing web interactions without boilerplate.
-
-[Get your API key here](https://dendrite.systems/app)
-
```python
-from dendrite import Dendrite
+from dendrite import AsyncDendrite
+
+async def main():
+ client = AsyncDendrite()
-client = Dendrite(dendrite_api_key="sk...")
+ await client.goto("https://google.com")
+ await client.fill("Search field", "Hello world")
+ await client.press("Enter")
-client.goto("https://google.com")
-client.fill("Search field", "Hello world")
-client.press("Enter")
+if __name__ == "__main__":
+ import asyncio
+ asyncio.run(main())
```
In the example above, we simply go to Google, populate the search field with "Hello world" and simulate a keypress on Enter. It's a simple example that starts to explore the endless possibilities with Dendrite. Now you can create tools for your agents that have access to the full web without depending on APIs.
-## More powerful examples
+## More Examples
-Now, let's have some fun. Earlier we showed you a simple send_email example. And sending emails is cool, but if that's all our agent can do it kind of sucks. So let's create two cooler examples.
+### Get any page as markdown
-### Download Bank Transactions
+This is a simple example of how to get any page as markdown, great for feeding to an LLM.
+
+```python
+from dendrite import AsyncDendrite
+from dotenv import load_dotenv
+
+async def main():
+ browser = AsyncDendrite()
+
+ await browser.goto("https://dendrite.systems")
+ await browser.wait_for("the page to load")
+
+ # Get the entire page as markdown
+ md = await browser.markdown()
+ print(md)
+ print("=" * 200)
+
+ # Only get a certain part of the page as markdown
+ data_extraction_md = await browser.markdown("the part about data extraction")
+ print(data_extraction_md)
+
+if __name__ == "__main__":
+ import asyncio
+ asyncio.run(main())
+```
-First up, a tool that allows our AI agent to download our bank's monthly transactions so that they can be analyzed and compiled into a report that can be sent to stakeholders with `send_email`.
+### Get Company Data from Y Combinator
+
+The classic web data extraction test, made easy:
```python
-from dendrite import Dendrite
+from dendrite import AsyncDendrite
+import pprint
+import asyncio
-def get_transactions() -> str:
- client = Dendrite(auth="mercury.com")
- # Navigate and wait for loading
- client.goto(
- "https://app.mercury.com/transactions",
- expected_page="Dashboard with transactions"
- )
- client.wait_for("The transactions to finish loading")
+async def main():
+ browser = AsyncDendrite()
- # Modify filters
- client.click("The 'add filter' button")
- client.click("The 'show transactions for' dropdown")
- client.click("The 'this month' option")
+ # Navigate
+ await browser.goto("https://www.ycombinator.com/companies")
+
+ # Find and fill the search field with "AI agent"
+ await browser.fill(
+ "Search field", value="AI agent"
+ ) # Element selector cached since before
+ await browser.press("Enter")
+
+ # Extract startups with natural language description
+ # Once created by our agent, the same script will be cached and reused
+ startups = await browser.extract(
+ "All companies. Return a list of dicts with name, location, description and url"
+ )
+ pprint.pprint(startups, indent=2)
- # Download file
- client.click("The 'export filtered' button")
- transactions = client.get_download()
- # Save file locally
- path = "files/transactions.xlsx"
- transactions.save_as(path)
+if __name__ == "__main__":
+ asyncio.run(main())
- return path
+```
-def analyze_transactions(path: str):
- ...
+returns
```
+[ { 'description': 'Book accommodations around the world.',
+ 'location': 'San Francisco, CA, USA',
+ 'name': 'Airbnb',
+ 'url': 'https://www.ycombinator.com/companies/airbnb'},
+ { 'description': 'Digital Analytics Platform',
+ 'location': 'San Francisco, CA, USA',
+ 'name': 'Amplitude',
+ 'url': 'https://www.ycombinator.com/companies/amplitude'},
+...
+] }
+```
+
-### Extract Google Analytics
+### Extract Data from Google Analytics
-Finally, it would be cool if we could add the amount of monthly visitors from Google Analytics to our report. We can do that by using the `extract` function:
+Here's how to get the amount of monthly visitors from Google Analytics using the `extract` function:
```python
-def get_visitor_count() -> int:
- client = Dendrite(auth="analytics.google.com")
+async def get_visitor_count() -> int:
+ client = AsyncDendrite(auth="analytics.google.com")
- client.goto(
+ await client.goto(
"https://analytics.google.com/analytics/web",
expected_page="Google Analytics dashboard"
)
# The Dendrite extract agent will create a web script that is cached
# and reused. It will self-heal when the website updates
- visitor_count = client.extract("The amount of visitors this month", int)
+ visitor_count = await client.extract("The amount of visitors this month", int)
return visitor_count
```
+### Download Bank Transactions
+
+Here's tool that allows our AI agent to download our bank's monthly transactions so that they can be analyzed and compiled into a report.
+
+```python
+from dendrite import AsyncDendrite
+
+async def get_transactions() -> str:
+ client = AsyncDendrite(auth="mercury.com")
+
+ # Navigate and wait for loading
+ await client.goto(
+ "https://app.mercury.com/transactions",
+ expected_page="Dashboard with transactions"
+ )
+ await client.wait_for("The transactions to finish loading")
+
+ # Modify filters
+ await client.click("The 'add filter' button")
+ await client.click("The 'show transactions for' dropdown")
+ await client.click("The 'this month' option")
+
+ # Download file
+ await client.click("The 'export filtered' button")
+ transactions = await client.get_download()
+
+ # Save file locally
+ path = "files/transactions.xlsx"
+ await transactions.save_as(path)
+
+ return path
+
+async def analyze_transactions(path: str):
+ ... # Analyze the transactions with LLM of our choice
+```
+
+
## Documentation
[Read the full docs here](https://docs.dendrite.systems)
@@ -145,7 +245,7 @@ def get_visitor_count() -> int:
When you want to scale up your AI agents, we support using browsers hosted by Browserbase. This way you can run many agents in parallel without having to worry about the infrastructure.
-To start using Browserbase just swap out the `Dendrite` class with `DendriteRemoteBrowser` and add your Browserbase API key and project id, either in the code or in a `.env` file like this:
+To start using Browserbase just swap out the `AsyncDendrite` class with `AsyncDendriteRemoteBrowser` and add your Browserbase API key and project id, either in the code or in a `.env` file like this:
```bash
# ... previous keys
@@ -154,17 +254,19 @@ BROWSERBASE_PROJECT_ID=
```
```python
-# from dendrite import Dendrite
-from dendrite import DendriteRemoteBrowser
-
-...
-
-# client = Dendrite(...)
-client = DendriteRemoteBrowser(
- # Use interchangeably with the Dendrite class
- browserbase_api_key="...", # or specify the browsebase keys in the .env file
- browserbase_project_id="..."
-)
+# from dendrite import AsyncDendrite
+from dendrite import AsyncDendriteRemoteBrowser
+
+async def main():
+ # client = AsyncDendrite(...)
+ client = AsyncDendriteRemoteBrowser(
+ # Use interchangeably with the AsyncDendrite class
+ browserbase_api_key="...", # or specify the browsebase keys in the .env file
+ browserbase_project_id="..."
+ )
+ ...
-...
+if __name__ == "__main__":
+ import asyncio
+ asyncio.run(main())
```
diff --git a/dendrite/__init__.py b/dendrite/__init__.py
index 931c728..8184634 100644
--- a/dendrite/__init__.py
+++ b/dendrite/__init__.py
@@ -1,33 +1,22 @@
import sys
-from loguru import logger
-from dendrite.async_api import (
- AsyncDendrite,
- AsyncElement,
- AsyncPage,
- AsyncElementsResponse,
-)
-from dendrite.sync_api import (
+from dendrite._loggers.d_logger import logger
+from dendrite.browser.async_api import AsyncDendrite, AsyncElement, AsyncPage
+from dendrite.logic.config import Config
+
+from dendrite.browser.sync_api import (
Dendrite,
Element,
Page,
- ElementsResponse,
)
-logger.remove()
-
-fmt = "{time: HH:mm:ss.SSS} | {level: <8}- {message}"
-
-logger.add(sys.stderr, level="INFO", format=fmt)
-
__all__ = [
"AsyncDendrite",
"AsyncElement",
"AsyncPage",
- "AsyncElementsResponse",
"Dendrite",
"Element",
"Page",
- "ElementsResponse",
+ "Config",
]
diff --git a/dendrite/_common/_exceptions/__init__.py b/dendrite/_cli/__init__.py
similarity index 100%
rename from dendrite/_common/_exceptions/__init__.py
rename to dendrite/_cli/__init__.py
diff --git a/dendrite/_cli/main.py b/dendrite/_cli/main.py
index 370e4de..4dc798b 100644
--- a/dendrite/_cli/main.py
+++ b/dendrite/_cli/main.py
@@ -1,7 +1,11 @@
import argparse
+import asyncio
import subprocess
import sys
+from dendrite.browser.async_api import AsyncDendrite
+from dendrite.logic.config import Config
+
def run_playwright_install():
try:
@@ -17,14 +21,35 @@ def run_playwright_install():
sys.exit(1)
+async def setup_auth(url: str):
+ try:
+ async with AsyncDendrite() as browser:
+ await browser.setup_auth(
+ url=url,
+ message="Please log in to the website. Once done, press Enter to continue...",
+ )
+ except Exception as e:
+ print(f"Error during authentication setup: {e}")
+ sys.exit(1)
+
+
def main():
parser = argparse.ArgumentParser(description="Dendrite SDK CLI tool")
- parser.add_argument("command", choices=["install"], help="Command to execute")
+ parser.add_argument(
+ "command", choices=["install", "auth"], help="Command to execute"
+ )
+
+ # Add auth-specific arguments
+ parser.add_argument("--url", help="URL to navigate to for authentication")
args = parser.parse_args()
if args.command == "install":
run_playwright_install()
+ elif args.command == "auth":
+ if not args.url:
+ parser.error("The --url argument is required for the auth command")
+ asyncio.run(setup_auth(args.url))
if __name__ == "__main__":
diff --git a/dendrite/_loggers/d_logger.py b/dendrite/_loggers/d_logger.py
new file mode 100644
index 0000000..0ff3276
--- /dev/null
+++ b/dendrite/_loggers/d_logger.py
@@ -0,0 +1,7 @@
+import sys
+
+from loguru import logger
+
+logger.remove()
+fmt = "{time: HH:mm:ss.SSS} | {level: <8} | {message}"
+logger.add(sys.stderr, level="DEBUG", format=fmt)
diff --git a/dendrite/async_api/__init__.py b/dendrite/async_api/__init__.py
deleted file mode 100644
index 48accf0..0000000
--- a/dendrite/async_api/__init__.py
+++ /dev/null
@@ -1,12 +0,0 @@
-from loguru import logger
-from ._core.dendrite_browser import AsyncDendrite
-from ._core.dendrite_element import AsyncElement
-from ._core.dendrite_page import AsyncPage
-from ._core.models.response import AsyncElementsResponse
-
-__all__ = [
- "AsyncDendrite",
- "AsyncElement",
- "AsyncPage",
- "AsyncElementsResponse",
-]
diff --git a/dendrite/async_api/_api/_http_client.py b/dendrite/async_api/_api/_http_client.py
deleted file mode 100644
index 9e694a6..0000000
--- a/dendrite/async_api/_api/_http_client.py
+++ /dev/null
@@ -1,66 +0,0 @@
-import os
-from typing import Optional
-import httpx
-from loguru import logger
-
-
-from dendrite.async_api._core.models.api_config import APIConfig
-
-
-class HTTPClient:
- def __init__(self, api_config: APIConfig, session_id: Optional[str] = None):
- self.api_key = api_config.dendrite_api_key
- self.session_id = session_id
- self.base_url = self.resolve_base_url()
-
- def resolve_base_url(self):
- base_url = (
- "http://localhost:8000/api/v1"
- if os.environ.get("DENDRITE_DEV")
- else "https://dendrite-server.azurewebsites.net/api/v1"
- )
- return base_url
-
- async def send_request(
- self,
- endpoint: str,
- params: Optional[dict] = None,
- data: Optional[dict] = None,
- headers: Optional[dict] = None,
- method: str = "GET",
- ) -> httpx.Response:
- url = f"{self.base_url}/{endpoint}"
-
- headers = headers or {}
- headers["Content-Type"] = "application/json"
- if self.api_key:
- headers["Authorization"] = f"Bearer {self.api_key}"
- if self.session_id:
- headers["X-Session-ID"] = self.session_id
-
- async with httpx.AsyncClient(timeout=300) as client:
- try:
- response = await client.request(
- method, url, params=params, json=data, headers=headers
- )
- response.raise_for_status()
- # logger.debug(
- # f"{method} to '{url}', that took: { time.time() - start_time }\n\nResponse: {dict_res}\n\n"
- # )
- return response
- except httpx.HTTPStatusError as http_err:
- logger.debug(
- f"HTTP error occurred: {http_err.response.status_code}: {http_err.response.text}"
- )
- raise
- except httpx.ConnectError as connect_err:
- logger.error(
- f"Connection error occurred: {connect_err}. {url} Server might be down"
- )
- raise
- except httpx.RequestError as req_err:
- # logger.debug(f"Request error occurred: {req_err}")
- raise
- except Exception as err:
- # logger.debug(f"An error occurred: {err}")
- raise
diff --git a/dendrite/async_api/_api/browser_api_client.py b/dendrite/async_api/_api/browser_api_client.py
deleted file mode 100644
index 2de035d..0000000
--- a/dendrite/async_api/_api/browser_api_client.py
+++ /dev/null
@@ -1,120 +0,0 @@
-from typing import Optional
-
-from loguru import logger
-from dendrite.async_api._api.response.cache_extract_response import (
- CacheExtractResponse,
-)
-from dendrite.async_api._api.response.selector_cache_response import (
- SelectorCacheResponse,
-)
-from dendrite.async_api._core.models.authentication import AuthSession
-from dendrite.async_api._api.response.get_element_response import GetElementResponse
-from dendrite.async_api._api.dto.ask_page_dto import AskPageDTO
-from dendrite.async_api._api.dto.authenticate_dto import AuthenticateDTO
-from dendrite.async_api._api.dto.get_elements_dto import GetElementsDTO
-from dendrite.async_api._api.dto.make_interaction_dto import MakeInteractionDTO
-from dendrite.async_api._api.dto.extract_dto import ExtractDTO
-from dendrite.async_api._api.dto.try_run_script_dto import TryRunScriptDTO
-from dendrite.async_api._api.dto.upload_auth_session_dto import UploadAuthSessionDTO
-from dendrite.async_api._api.response.ask_page_response import AskPageResponse
-from dendrite.async_api._api.response.interaction_response import (
- InteractionResponse,
-)
-from dendrite.async_api._api.response.extract_response import ExtractResponse
-from dendrite.async_api._api._http_client import HTTPClient
-from dendrite._common._exceptions.dendrite_exception import (
- InvalidAuthSessionError,
-)
-from dendrite.async_api._api.dto.get_elements_dto import CheckSelectorCacheDTO
-
-
-class BrowserAPIClient(HTTPClient):
-
- async def authenticate(self, dto: AuthenticateDTO):
- res = await self.send_request(
- "actions/authenticate", data=dto.model_dump(), method="POST"
- )
-
- if res.status_code == 204:
- raise InvalidAuthSessionError(domain=dto.domains)
-
- return AuthSession(**res.json())
-
- async def upload_auth_session(self, dto: UploadAuthSessionDTO):
- await self.send_request(
- "actions/upload-auth-session", data=dto.dict(), method="POST"
- )
-
- async def check_selector_cache(
- self, dto: CheckSelectorCacheDTO
- ) -> SelectorCacheResponse:
- res = await self.send_request(
- "actions/check-selector-cache", data=dto.dict(), method="POST"
- )
- return SelectorCacheResponse(**res.json())
-
- async def get_interactions_selector(
- self, dto: GetElementsDTO
- ) -> GetElementResponse:
- res = await self.send_request(
- "actions/get-interaction-selector", data=dto.dict(), method="POST"
- )
- return GetElementResponse(**res.json())
-
- async def make_interaction(self, dto: MakeInteractionDTO) -> InteractionResponse:
- res = await self.send_request(
- "actions/make-interaction", data=dto.dict(), method="POST"
- )
- res_dict = res.json()
- return InteractionResponse(
- status=res_dict["status"], message=res_dict["message"]
- )
-
- async def check_extract_cache(self, dto: ExtractDTO) -> CacheExtractResponse:
- res = await self.send_request(
- "actions/check-extract-cache", data=dto.dict(), method="POST"
- )
- return CacheExtractResponse(**res.json())
-
- async def extract(self, dto: ExtractDTO) -> ExtractResponse:
- res = await self.send_request(
- "actions/extract-page", data=dto.dict(), method="POST"
- )
- res_dict = res.json()
- return ExtractResponse(
- status=res_dict["status"],
- message=res_dict["message"],
- return_data=res_dict["return_data"],
- created_script=res_dict.get("created_script", None),
- used_cache=res_dict.get("used_cache", False),
- )
-
- async def ask_page(self, dto: AskPageDTO) -> AskPageResponse:
- res = await self.send_request(
- "actions/ask-page", data=dto.dict(), method="POST"
- )
- res_dict = res.json()
- return AskPageResponse(
- status=res_dict["status"],
- description=res_dict["description"],
- return_data=res_dict["return_data"],
- )
-
- async def try_run_cached(self, dto: TryRunScriptDTO) -> Optional[ExtractResponse]:
- res = await self.send_request(
- "actions/try-run-cached", data=dto.dict(), method="POST"
- )
- if res is None:
- return None
- res_dict = res.json()
- loaded_value = res_dict["return_data"]
- if loaded_value is None:
- return None
-
- return ExtractResponse(
- status=res_dict["status"],
- message=res_dict["message"],
- return_data=loaded_value,
- created_script=res_dict.get("created_script", None),
- used_cache=res_dict.get("used_cache", False),
- )
diff --git a/dendrite/async_api/_api/dto/ask_page_dto.py b/dendrite/async_api/_api/dto/ask_page_dto.py
deleted file mode 100644
index 770d172..0000000
--- a/dendrite/async_api/_api/dto/ask_page_dto.py
+++ /dev/null
@@ -1,11 +0,0 @@
-from typing import Any, Optional
-from pydantic import BaseModel
-from dendrite.async_api._core.models.api_config import APIConfig
-from dendrite.async_api._core.models.page_information import PageInformation
-
-
-class AskPageDTO(BaseModel):
- prompt: str
- return_schema: Optional[Any]
- page_information: PageInformation
- api_config: APIConfig
diff --git a/dendrite/async_api/_api/dto/authenticate_dto.py b/dendrite/async_api/_api/dto/authenticate_dto.py
deleted file mode 100644
index f5a1de7..0000000
--- a/dendrite/async_api/_api/dto/authenticate_dto.py
+++ /dev/null
@@ -1,6 +0,0 @@
-from typing import Union
-from pydantic import BaseModel
-
-
-class AuthenticateDTO(BaseModel):
- domains: Union[str, list[str]]
diff --git a/dendrite/async_api/_api/dto/get_interaction_dto.py b/dendrite/async_api/_api/dto/get_interaction_dto.py
deleted file mode 100644
index 1d93432..0000000
--- a/dendrite/async_api/_api/dto/get_interaction_dto.py
+++ /dev/null
@@ -1,10 +0,0 @@
-from pydantic import BaseModel
-
-from dendrite.async_api._core.models.api_config import APIConfig
-from dendrite.async_api._core.models.page_information import PageInformation
-
-
-class GetInteractionDTO(BaseModel):
- page_information: PageInformation
- api_config: APIConfig
- prompt: str
diff --git a/dendrite/async_api/_api/dto/get_session_dto.py b/dendrite/async_api/_api/dto/get_session_dto.py
deleted file mode 100644
index 6414cc3..0000000
--- a/dendrite/async_api/_api/dto/get_session_dto.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from typing import List
-from pydantic import BaseModel
-
-
-class GetSessionDTO(BaseModel):
- user_id: str
- domain: str
diff --git a/dendrite/async_api/_api/dto/google_search_dto.py b/dendrite/async_api/_api/dto/google_search_dto.py
deleted file mode 100644
index 8a16a1f..0000000
--- a/dendrite/async_api/_api/dto/google_search_dto.py
+++ /dev/null
@@ -1,12 +0,0 @@
-from typing import Optional
-from pydantic import BaseModel
-from dendrite.async_api._core.models.api_config import APIConfig
-from dendrite.async_api._core.models.page_information import PageInformation
-
-
-class GoogleSearchDTO(BaseModel):
- query: str
- country: Optional[str] = None
- filter_results_prompt: Optional[str] = None
- page_information: PageInformation
- api_config: APIConfig
diff --git a/dendrite/async_api/_api/dto/make_interaction_dto.py b/dendrite/async_api/_api/dto/make_interaction_dto.py
deleted file mode 100644
index 8edbc06..0000000
--- a/dendrite/async_api/_api/dto/make_interaction_dto.py
+++ /dev/null
@@ -1,19 +0,0 @@
-from typing import Literal, Optional
-from pydantic import BaseModel
-from dendrite.async_api._core.models.api_config import APIConfig
-from dendrite.async_api._core.models.page_diff_information import (
- PageDiffInformation,
-)
-
-
-InteractionType = Literal["click", "fill", "hover"]
-
-
-class MakeInteractionDTO(BaseModel):
- url: str
- dendrite_id: str
- interaction_type: InteractionType
- value: Optional[str] = None
- expected_outcome: Optional[str]
- page_delta_information: PageDiffInformation
- api_config: APIConfig
diff --git a/dendrite/async_api/_api/dto/try_run_script_dto.py b/dendrite/async_api/_api/dto/try_run_script_dto.py
deleted file mode 100644
index 2926401..0000000
--- a/dendrite/async_api/_api/dto/try_run_script_dto.py
+++ /dev/null
@@ -1,14 +0,0 @@
-from typing import Any, Optional
-from pydantic import BaseModel
-from dendrite.async_api._core.models.api_config import APIConfig
-
-
-class TryRunScriptDTO(BaseModel):
- url: str
- raw_html: str
- api_config: APIConfig
- prompt: str
- db_prompt: Optional[str] = (
- None # If you wish to cache a script based of a fixed prompt use this value
- )
- return_data_json_schema: Any
diff --git a/dendrite/async_api/_api/dto/upload_auth_session_dto.py b/dendrite/async_api/_api/dto/upload_auth_session_dto.py
deleted file mode 100644
index ecb68e1..0000000
--- a/dendrite/async_api/_api/dto/upload_auth_session_dto.py
+++ /dev/null
@@ -1,11 +0,0 @@
-from pydantic import BaseModel
-
-from dendrite.async_api._core.models.authentication import (
- AuthSession,
- StorageState,
-)
-
-
-class UploadAuthSessionDTO(BaseModel):
- auth_data: AuthSession
- storage_state: StorageState
diff --git a/dendrite/async_api/_api/response/cache_extract_response.py b/dendrite/async_api/_api/response/cache_extract_response.py
deleted file mode 100644
index 463d03b..0000000
--- a/dendrite/async_api/_api/response/cache_extract_response.py
+++ /dev/null
@@ -1,5 +0,0 @@
-from pydantic import BaseModel
-
-
-class CacheExtractResponse(BaseModel):
- exists: bool
diff --git a/dendrite/async_api/_api/response/extract_response.py b/dendrite/async_api/_api/response/extract_response.py
deleted file mode 100644
index ffc0e34..0000000
--- a/dendrite/async_api/_api/response/extract_response.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from typing import Generic, Optional, TypeVar
-from pydantic import BaseModel
-
-from dendrite.async_api._common.status import Status
-
-
-T = TypeVar("T")
-
-
-class ExtractResponse(BaseModel, Generic[T]):
- return_data: T
- message: str
- created_script: Optional[str] = None
- status: Status
- used_cache: bool
diff --git a/dendrite/async_api/_api/response/google_search_response.py b/dendrite/async_api/_api/response/google_search_response.py
deleted file mode 100644
index d435b71..0000000
--- a/dendrite/async_api/_api/response/google_search_response.py
+++ /dev/null
@@ -1,12 +0,0 @@
-from typing import List
-from pydantic import BaseModel
-
-
-class SearchResult(BaseModel):
- url: str
- title: str
- description: str
-
-
-class GoogleSearchResponse(BaseModel):
- results: List[SearchResult]
diff --git a/dendrite/async_api/_api/response/interaction_response.py b/dendrite/async_api/_api/response/interaction_response.py
deleted file mode 100644
index 3d24a6a..0000000
--- a/dendrite/async_api/_api/response/interaction_response.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from pydantic import BaseModel
-from dendrite.async_api._common.status import Status
-
-
-class InteractionResponse(BaseModel):
- message: str
- status: Status
diff --git a/dendrite/async_api/_api/response/selector_cache_response.py b/dendrite/async_api/_api/response/selector_cache_response.py
deleted file mode 100644
index 4c0e388..0000000
--- a/dendrite/async_api/_api/response/selector_cache_response.py
+++ /dev/null
@@ -1,5 +0,0 @@
-from pydantic import BaseModel
-
-
-class SelectorCacheResponse(BaseModel):
- exists: bool
diff --git a/dendrite/async_api/_api/response/session_response.py b/dendrite/async_api/_api/response/session_response.py
deleted file mode 100644
index 2d03b97..0000000
--- a/dendrite/async_api/_api/response/session_response.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from typing import List
-from pydantic import BaseModel
-
-
-class SessionResponse(BaseModel):
- cookies: List[dict]
- origins_storage: List[dict]
diff --git a/dendrite/async_api/_core/_impl_browser.py b/dendrite/async_api/_core/_impl_browser.py
deleted file mode 100644
index c4e0f99..0000000
--- a/dendrite/async_api/_core/_impl_browser.py
+++ /dev/null
@@ -1,88 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import TYPE_CHECKING
-
-if TYPE_CHECKING:
- from dendrite.async_api._core.dendrite_browser import AsyncDendrite
-
-from dendrite.async_api._core._type_spec import PlaywrightPage
-from playwright.async_api import Download, Browser, Playwright
-
-
-class ImplBrowser(ABC):
- @abstractmethod
- def __init__(self, settings):
- pass
- # self.settings = settings
-
- @abstractmethod
- async def get_download(
- self, dendrite_browser: "AsyncDendrite", pw_page: PlaywrightPage, timeout: float
- ) -> Download:
- """
- Retrieves the download event from the browser.
-
- Returns:
- Download: The download event.
-
- Raises:
- Exception: If there is an issue retrieving the download event.
- """
- pass
-
- @abstractmethod
- async def start_browser(self, playwright: Playwright, pw_options: dict) -> Browser:
- """
- Starts the browser session.
-
- Returns:
- Browser: The browser session.
-
- Raises:
- Exception: If there is an issue starting the browser session.
- """
- pass
-
- @abstractmethod
- async def configure_context(self, browser: "AsyncDendrite") -> None:
- """
- Configures the browser context.
-
- Args:
- browser (AsyncDendrite): The browser to configure.
-
- Raises:
- Exception: If there is an issue configuring the browser context.
- """
- pass
-
- @abstractmethod
- async def stop_session(self) -> None:
- """
- Stops the browser session.
-
- Raises:
- Exception: If there is an issue stopping the browser session.
- """
- pass
-
-
-class LocalImpl(ImplBrowser):
- def __init__(self) -> None:
- pass
-
- async def start_browser(self, playwright: Playwright, pw_options) -> Browser:
- return await playwright.chromium.launch(**pw_options)
-
- async def get_download(
- self,
- dendrite_browser: "AsyncDendrite",
- pw_page: PlaywrightPage,
- timeout: float,
- ) -> Download:
- return await dendrite_browser._download_handler.get_data(pw_page, timeout)
-
- async def configure_context(self, browser: "AsyncDendrite"):
- pass
-
- async def stop_session(self):
- pass
diff --git a/dendrite/async_api/_core/_impl_mapping.py b/dendrite/async_api/_core/_impl_mapping.py
deleted file mode 100644
index 3268943..0000000
--- a/dendrite/async_api/_core/_impl_mapping.py
+++ /dev/null
@@ -1,34 +0,0 @@
-from typing import Any, Dict, Optional, Type
-
-from dendrite.async_api._core._impl_browser import ImplBrowser, LocalImpl
-
-from dendrite.async_api._ext_impl.browserbase._impl import BrowserBaseImpl
-from dendrite.async_api._ext_impl.browserless._impl import BrowserlessImpl
-from dendrite.remote.browserless_config import BrowserlessConfig
-from dendrite.remote.browserbase_config import BrowserbaseConfig
-from dendrite.remote import Providers
-
-IMPL_MAPPING: Dict[Type[Providers], Type[ImplBrowser]] = {
- BrowserbaseConfig: BrowserBaseImpl,
- BrowserlessConfig: BrowserlessImpl,
- # BFloatProviderConfig: ,
-}
-
-SETTINGS_CLASSES: Dict[str, Type[Providers]] = {
- "browserbase": BrowserbaseConfig,
- "browserless": BrowserlessConfig,
-}
-
-
-def get_impl(remote_provider: Optional[Providers]) -> ImplBrowser:
- if remote_provider is None:
- return LocalImpl()
-
- try:
- provider_class = IMPL_MAPPING[type(remote_provider)]
- except KeyError:
- raise ValueError(
- f"No implementation for {type(remote_provider)}. Available providers: {', '.join(map(lambda x: x.__name__, IMPL_MAPPING.keys()))}"
- )
-
- return provider_class(remote_provider)
diff --git a/dendrite/async_api/_core/_type_spec.py b/dendrite/async_api/_core/_type_spec.py
deleted file mode 100644
index 8252e08..0000000
--- a/dendrite/async_api/_core/_type_spec.py
+++ /dev/null
@@ -1,44 +0,0 @@
-import inspect
-from typing import Any, Dict, Literal, Type, TypeVar, Union
-from pydantic import BaseModel
-from playwright.async_api import Page
-
-
-Interaction = Literal["click", "fill", "hover"]
-
-T = TypeVar("T")
-PydanticModel = TypeVar("PydanticModel", bound=BaseModel)
-PrimitiveTypes = PrimitiveTypes = Union[Type[bool], Type[int], Type[float], Type[str]]
-JsonSchema = Dict[str, Any]
-TypeSpec = Union[PrimitiveTypes, PydanticModel, JsonSchema]
-
-PlaywrightPage = Page
-
-
-def to_json_schema(type_spec: TypeSpec) -> Dict[str, Any]:
- if isinstance(type_spec, dict):
- # Assume it's already a JSON schema
- return type_spec
- if inspect.isclass(type_spec) and issubclass(type_spec, BaseModel):
- # Convert Pydantic model to JSON schema
- return type_spec.model_json_schema()
- if type_spec in (bool, int, float, str):
- # Convert basic Python types to JSON schema
- type_map = {bool: "boolean", int: "integer", float: "number", str: "string"}
- return {"type": type_map[type_spec]}
-
- raise ValueError(f"Unsupported type specification: {type_spec}")
-
-
-def convert_to_type_spec(type_spec: TypeSpec, return_data: Any) -> TypeSpec:
- if isinstance(type_spec, type):
- if issubclass(type_spec, BaseModel):
- return type_spec.model_validate(return_data)
- if type_spec in (str, float, bool, int):
- return type_spec(return_data)
-
- raise ValueError(f"Unsupported type: {type_spec}")
- if isinstance(type_spec, dict):
- return return_data
-
- raise ValueError(f"Unsupported type specification: {type_spec}")
diff --git a/dendrite/async_api/_core/_utils.py b/dendrite/async_api/_core/_utils.py
deleted file mode 100644
index f030135..0000000
--- a/dendrite/async_api/_core/_utils.py
+++ /dev/null
@@ -1,123 +0,0 @@
-from typing import Optional, Union, List, TYPE_CHECKING
-from playwright.async_api import FrameLocator, ElementHandle, Error, Frame
-from bs4 import BeautifulSoup
-from loguru import logger
-
-from dendrite.async_api._api.response.get_element_response import GetElementResponse
-from dendrite.async_api._core._type_spec import PlaywrightPage
-from dendrite.async_api._core.dendrite_element import AsyncElement
-from dendrite.async_api._core.models.response import AsyncElementsResponse
-
-if TYPE_CHECKING:
- from dendrite.async_api._core.dendrite_page import AsyncPage
-
-from dendrite.async_api._core._js import (
- GENERATE_DENDRITE_IDS_IFRAME_SCRIPT,
-)
-from dendrite.async_api._dom.util.mild_strip import mild_strip_in_place
-
-
-async def expand_iframes(
- page: PlaywrightPage,
- page_soup: BeautifulSoup,
-):
- async def get_iframe_path(frame: Frame):
- path_parts = []
- current_frame = frame
- while current_frame.parent_frame is not None:
- iframe_element = await current_frame.frame_element()
- iframe_id = await iframe_element.get_attribute("d-id")
- if iframe_id is None:
- # If any iframe_id in the path is None, we cannot build the path
- return None
- path_parts.insert(0, iframe_id)
- current_frame = current_frame.parent_frame
- return "|".join(path_parts)
-
- for frame in page.frames:
- if frame.parent_frame is None:
- continue # Skip the main frame
- iframe_element = await frame.frame_element()
- iframe_id = await iframe_element.get_attribute("d-id")
- if iframe_id is None:
- continue
- iframe_path = await get_iframe_path(frame)
- if iframe_path is None:
- continue
- try:
- await frame.evaluate(
- GENERATE_DENDRITE_IDS_IFRAME_SCRIPT, {"frame_path": iframe_path}
- )
- frame_content = await frame.content()
- frame_tree = BeautifulSoup(frame_content, "lxml")
- mild_strip_in_place(frame_tree)
- merge_iframe_to_page(iframe_id, page_soup, frame_tree)
- except Error as e:
- logger.debug(f"Error processing frame {iframe_id}: {e}")
- continue
-
-
-def merge_iframe_to_page(
- iframe_id: str,
- page: BeautifulSoup,
- iframe: BeautifulSoup,
-):
- iframe_element = page.find("iframe", {"d-id": iframe_id})
- if iframe_element is None:
- logger.debug(f"Could not find iframe with ID {iframe_id} in page soup")
- return
-
- iframe_element.replace_with(iframe)
-
-
-async def _get_all_elements_from_selector_soup(
- selector: str, soup: BeautifulSoup, page: "AsyncPage"
-) -> List[AsyncElement]:
- dendrite_elements: List[AsyncElement] = []
-
- elements = soup.select(selector)
-
- for element in elements:
- frame = page._get_context(element)
- d_id = element.get("d-id", "")
- locator = frame.locator(f"xpath=//*[@d-id='{d_id}']")
-
- if not d_id:
- continue
-
- if isinstance(d_id, list):
- d_id = d_id[0]
- dendrite_elements.append(
- AsyncElement(d_id, locator, page.dendrite_browser, page._browser_api_client)
- )
-
- return dendrite_elements
-
-
-async def get_elements_from_selectors_soup(
- page: "AsyncPage",
- soup: BeautifulSoup,
- res: GetElementResponse,
- only_one: bool,
-) -> Union[Optional[AsyncElement], List[AsyncElement], AsyncElementsResponse]:
- if isinstance(res.selectors, dict):
- result = {}
- for key, selectors in res.selectors.items():
- for selector in selectors:
- dendrite_elements = await _get_all_elements_from_selector_soup(
- selector, soup, page
- )
- if len(dendrite_elements) > 0:
- result[key] = dendrite_elements[0]
- break
- return AsyncElementsResponse(result)
- elif isinstance(res.selectors, list):
- for selector in reversed(res.selectors):
- dendrite_elements = await _get_all_elements_from_selector_soup(
- selector, soup, page
- )
-
- if len(dendrite_elements) > 0:
- return dendrite_elements[0] if only_one else dendrite_elements
-
- return None
diff --git a/dendrite/async_api/_core/mixin/extract.py b/dendrite/async_api/_core/mixin/extract.py
deleted file mode 100644
index a0c4347..0000000
--- a/dendrite/async_api/_core/mixin/extract.py
+++ /dev/null
@@ -1,253 +0,0 @@
-import asyncio
-import time
-from typing import Any, Optional, Type, overload, List
-from dendrite.async_api._api.dto.extract_dto import ExtractDTO
-from dendrite.async_api._api.response.cache_extract_response import (
- CacheExtractResponse,
-)
-from dendrite.async_api._api.response.extract_response import ExtractResponse
-from dendrite.async_api._core._type_spec import (
- JsonSchema,
- PydanticModel,
- TypeSpec,
- convert_to_type_spec,
- to_json_schema,
-)
-from dendrite.async_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite.async_api._core._managers.navigation_tracker import NavigationTracker
-from loguru import logger
-
-
-CACHE_TIMEOUT = 5
-
-
-class ExtractionMixin(DendritePageProtocol):
- """
- Mixin that provides extraction functionality for web pages.
-
- This mixin provides various `extract` methods that allow extracting
- different types of data (e.g., bool, int, float, string, Pydantic models, etc.)
- from a web page based on a given prompt.
- """
-
- @overload
- async def extract(
- self,
- prompt: str,
- type_spec: Type[bool],
- use_cache: bool = True,
- timeout: int = 180,
- ) -> bool: ...
-
- @overload
- async def extract(
- self,
- prompt: str,
- type_spec: Type[int],
- use_cache: bool = True,
- timeout: int = 180,
- ) -> int: ...
-
- @overload
- async def extract(
- self,
- prompt: str,
- type_spec: Type[float],
- use_cache: bool = True,
- timeout: int = 180,
- ) -> float: ...
-
- @overload
- async def extract(
- self,
- prompt: str,
- type_spec: Type[str],
- use_cache: bool = True,
- timeout: int = 180,
- ) -> str: ...
-
- @overload
- async def extract(
- self,
- prompt: Optional[str],
- type_spec: Type[PydanticModel],
- use_cache: bool = True,
- timeout: int = 180,
- ) -> PydanticModel: ...
-
- @overload
- async def extract(
- self,
- prompt: Optional[str],
- type_spec: JsonSchema,
- use_cache: bool = True,
- timeout: int = 180,
- ) -> JsonSchema: ...
-
- @overload
- async def extract(
- self,
- prompt: str,
- type_spec: None = None,
- use_cache: bool = True,
- timeout: int = 180,
- ) -> Any: ...
-
- async def extract(
- self,
- prompt: Optional[str],
- type_spec: Optional[TypeSpec] = None,
- use_cache: bool = True,
- timeout: int = 180,
- ) -> TypeSpec:
- """
- Extract data from a web page based on a prompt and optional type specification.
- Args:
- prompt (Optional[str]): The prompt to describe the information to extract.
- type_spec (Optional[TypeSpec], optional): The type specification for the extracted data.
- use_cache (bool, optional): Whether to use cached results. Defaults to True.
- timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
- up to 5000ms will be spent attempting to use cached scripts before falling back to the
- extraction agent for the remaining time that will attempt to generate a new script. Defaults to 15000 (15 seconds).
-
- Returns:
- ExtractResponse: The extracted data wrapped in a ExtractResponse object.
- Raises:
- TimeoutError: If the extraction process exceeds the specified timeout.
- """
-
- logger.info(f"Starting extraction with prompt: {prompt}")
-
- json_schema = None
- if type_spec:
- json_schema = to_json_schema(type_spec)
- logger.debug(f"Type specification converted to JSON schema: {json_schema}")
-
- if prompt is None:
- prompt = ""
-
- start_time = time.time()
- page = await self._get_page()
- navigation_tracker = NavigationTracker(page)
- navigation_tracker.start_nav_tracking()
-
- # Check if a script exists in the cache
- if use_cache:
- cache_available = await check_if_extract_cache_available(
- self, prompt, json_schema
- )
-
- if cache_available:
- logger.info("Cache available, attempting to use cached extraction")
- result = await attempt_extraction_with_backoff(
- self,
- prompt,
- json_schema,
- remaining_timeout=CACHE_TIMEOUT,
- only_use_cache=True,
- )
- if result:
- return convert_and_return_result(result, type_spec)
-
- logger.info(
- "Using extraction agent to perform extraction, since no cache was found or failed."
- )
- result = await attempt_extraction_with_backoff(
- self,
- prompt,
- json_schema,
- remaining_timeout=timeout - (time.time() - start_time),
- only_use_cache=False,
- )
-
- if result:
- return convert_and_return_result(result, type_spec)
-
- logger.error(f"Extraction failed after {time.time() - start_time:.2f} seconds")
- return None
-
-
-async def check_if_extract_cache_available(
- obj: DendritePageProtocol, prompt: str, json_schema: Optional[JsonSchema]
-) -> bool:
- page = await obj._get_page()
- page_information = await page.get_page_information(include_screenshot=False)
- dto = ExtractDTO(
- page_information=page_information,
- api_config=obj._get_dendrite_browser().api_config,
- prompt=prompt,
- return_data_json_schema=json_schema,
- )
- cache_response: CacheExtractResponse = (
- await obj._get_browser_api_client().check_extract_cache(dto)
- )
- return cache_response.exists
-
-
-async def attempt_extraction_with_backoff(
- obj: DendritePageProtocol,
- prompt: str,
- json_schema: Optional[JsonSchema],
- remaining_timeout: float = 180.0,
- only_use_cache: bool = False,
-) -> Optional[ExtractResponse]:
- TIMEOUT_INTERVAL: List[float] = [0.15, 0.45, 1.0, 2.0, 4.0, 8.0]
- total_elapsed_time = 0
- start_time = time.time()
-
- for current_timeout in TIMEOUT_INTERVAL:
- if total_elapsed_time >= remaining_timeout:
- logger.error(f"Timeout reached after {total_elapsed_time:.2f} seconds")
- return None
-
- request_start_time = time.time()
- page = await obj._get_page()
- page_information = await page.get_page_information(
- include_screenshot=not only_use_cache
- )
- extract_dto = ExtractDTO(
- page_information=page_information,
- api_config=obj._get_dendrite_browser().api_config,
- prompt=prompt,
- return_data_json_schema=json_schema,
- use_screenshot=True,
- use_cache=only_use_cache,
- force_use_cache=only_use_cache,
- )
-
- res = await obj._get_browser_api_client().extract(extract_dto)
- request_duration = time.time() - request_start_time
-
- if res.status == "impossible":
- logger.error(f"Impossible to extract data. Reason: {res.message}")
- return None
-
- if res.status == "success":
- logger.success(
- f"Extraction successful: '{res.message}'\nUsed cache: {res.used_cache}\nUsed script:\n\n{res.created_script}"
- )
- return res
-
- sleep_duration = max(0, current_timeout - request_duration)
- logger.info(
- f"Extraction attempt failed. Status: {res.status}\nMessage: {res.message}\nSleeping for {sleep_duration:.2f} seconds"
- )
- await asyncio.sleep(sleep_duration)
- total_elapsed_time = time.time() - start_time
-
- logger.error(
- f"All extraction attempts failed after {total_elapsed_time:.2f} seconds"
- )
- return None
-
-
-def convert_and_return_result(
- res: ExtractResponse, type_spec: Optional[TypeSpec]
-) -> TypeSpec:
- converted_res = res.return_data
- if type_spec is not None:
- logger.debug("Converting extraction result to specified type")
- converted_res = convert_to_type_spec(type_spec, res.return_data)
-
- logger.info("Extraction process completed successfully")
- return converted_res
diff --git a/dendrite/async_api/_core/mixin/get_element.py b/dendrite/async_api/_core/mixin/get_element.py
deleted file mode 100644
index 4d54f67..0000000
--- a/dendrite/async_api/_core/mixin/get_element.py
+++ /dev/null
@@ -1,340 +0,0 @@
-import asyncio
-import time
-from typing import Dict, List, Literal, Optional, Union, overload
-
-from loguru import logger
-
-from dendrite.async_api._api.dto.get_elements_dto import GetElementsDTO
-from dendrite.async_api._api.response.get_element_response import GetElementResponse
-from dendrite.async_api._api.dto.get_elements_dto import CheckSelectorCacheDTO
-from dendrite.async_api._core._utils import get_elements_from_selectors_soup
-from dendrite.async_api._core.dendrite_element import AsyncElement
-from dendrite.async_api._core.models.response import AsyncElementsResponse
-from dendrite.async_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite.async_api._core.models.api_config import APIConfig
-
-
-CACHE_TIMEOUT = 5
-
-
-class GetElementMixin(DendritePageProtocol):
- @overload
- async def get_elements(
- self,
- prompt_or_elements: str,
- use_cache: bool = True,
- timeout: int = 15000,
- context: str = "",
- ) -> List[AsyncElement]:
- """
- Retrieves a list of Dendrite elements based on a string prompt.
-
- Args:
- prompt_or_elements (str): The prompt describing the elements to be retrieved.
- use_cache (bool, optional): Whether to use cached results. Defaults to True.
- timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
- up to 5000ms will be spent attempting to use cached selectors before falling back to the
- find element agent for the remaining time. Defaults to 15000 (15 seconds).
- context (str, optional): Additional context for the retrieval. Defaults to an empty string.
-
- Returns:
- List[AsyncElement]: A list of Dendrite elements found on the page.
- """
-
- @overload
- async def get_elements(
- self,
- prompt_or_elements: Dict[str, str],
- use_cache: bool = True,
- timeout: int = 15000,
- context: str = "",
- ) -> AsyncElementsResponse:
- """
- Retrieves Dendrite elements based on a dictionary.
-
- Args:
- prompt_or_elements (Dict[str, str]): A dictionary where keys are field names and values are prompts describing the elements to be retrieved.
- use_cache (bool, optional): Whether to use cached results. Defaults to True.
- timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
- up to 5000ms will be spent attempting to use cached selectors before falling back to the
- find element agent for the remaining time. Defaults to 15000 (15 seconds).
- context (str, optional): Additional context for the retrieval. Defaults to an empty string.
-
- Returns:
- AsyncElementsResponse: A response object containing the retrieved elements with attributes matching the keys in the dict.
- """
-
- async def get_elements(
- self,
- prompt_or_elements: Union[str, Dict[str, str]],
- use_cache: bool = True,
- timeout: int = 15000,
- context: str = "",
- ) -> Union[List[AsyncElement], AsyncElementsResponse]:
- """
- Retrieves Dendrite elements based on either a string prompt or a dictionary of prompts.
-
- This method determines the type of the input (string or dictionary) and retrieves the appropriate elements.
- If the input is a string, it fetches a list of elements. If the input is a dictionary, it fetches elements for each key-value pair.
-
- Args:
- prompt_or_elements (Union[str, Dict[str, str]]): The prompt or dictionary of prompts for element retrieval.
- use_cache (bool, optional): Whether to use cached results. Defaults to True.
- timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
- up to 5000ms will be spent attempting to use cached selectors before falling back to the
- find element agent for the remaining time. Defaults to 15000 (15 seconds).
- context (str, optional): Additional context for the retrieval. Defaults to an empty string.
-
- Returns:
- Union[List[AsyncElement], AsyncElementsResponse]: A list of elements or a response object containing the retrieved elements.
-
- Raises:
- ValueError: If the input is neither a string nor a dictionary.
- """
-
- return await self._get_element(
- prompt_or_elements,
- only_one=False,
- use_cache=use_cache,
- timeout=timeout / 1000,
- )
-
- async def get_element(
- self,
- prompt: str,
- use_cache=True,
- timeout=15000,
- ) -> Optional[AsyncElement]:
- """
- Retrieves a single Dendrite element based on the provided prompt.
-
- Args:
- prompt (str): The prompt describing the element to be retrieved.
- use_cache (bool, optional): Whether to use cached results. Defaults to True.
- timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
- up to 5000ms will be spent attempting to use cached selectors before falling back to the
- find element agent for the remaining time. Defaults to 15000 (15 seconds).
-
- Returns:
- AsyncElement: The retrieved element.
- """
- return await self._get_element(
- prompt,
- only_one=True,
- use_cache=use_cache,
- timeout=timeout / 1000,
- )
-
- @overload
- async def _get_element(
- self,
- prompt_or_elements: str,
- only_one: Literal[True],
- use_cache: bool,
- timeout,
- ) -> Optional[AsyncElement]:
- """
- Retrieves a single Dendrite element based on the provided prompt.
-
- Args:
- prompt (Union[str, Dict[str, str]]): The prompt describing the element to be retrieved.
- only_one (Literal[True]): Indicates that only one element should be retrieved.
- use_cache (bool): Whether to use cached results.
- timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
- up to 5000ms will be spent attempting to use cached selectors before falling back to the
- find element agent for the remaining time. Defaults to 15000 (15 seconds).
-
- Returns:
- AsyncElement: The retrieved element.
- """
-
- @overload
- async def _get_element(
- self,
- prompt_or_elements: Union[str, Dict[str, str]],
- only_one: Literal[False],
- use_cache: bool,
- timeout,
- ) -> Union[List[AsyncElement], AsyncElementsResponse]:
- """
- Retrieves a list of Dendrite elements based on the provided prompt.
-
- Args:
- prompt (str): The prompt describing the elements to be retrieved.
- only_one (Literal[False]): Indicates that multiple elements should be retrieved.
- use_cache (bool): Whether to use cached results.
- timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
- up to 5000ms will be spent attempting to use cached selectors before falling back to the
- find element agent for the remaining time. Defaults to 15000 (15 seconds).
-
- Returns:
- List[AsyncElement]: A list of retrieved elements.
- """
-
- async def _get_element(
- self,
- prompt_or_elements: Union[str, Dict[str, str]],
- only_one: bool,
- use_cache: bool,
- timeout: float,
- ) -> Union[
- Optional[AsyncElement],
- List[AsyncElement],
- AsyncElementsResponse,
- ]:
- """
- Retrieves Dendrite elements based on the provided prompt, either a single element or a list of elements.
-
- This method sends a request with the prompt and retrieves the elements based on the `only_one` flag.
-
- Args:
- prompt_or_elements (Union[str, Dict[str, str]]): The prompt or dictionary of prompts for element retrieval.
- only_one (bool): Whether to retrieve only one element or a list of elements.
- use_cache (bool): Whether to use cached results.
- timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
- up to 5000ms will be spent attempting to use cached selectors before falling back to the
- find element agent for the remaining time. Defaults to 15000 (15 seconds).
-
- Returns:
- Union[AsyncElement, List[AsyncElement], AsyncElementsResponse]: The retrieved element, list of elements, or response object.
- """
-
- api_config = self._get_dendrite_browser().api_config
- start_time = time.time()
-
- # First, let's check if there is a cached selector
- page = await self._get_page()
- cache_available = await test_if_cache_available(
- self, prompt_or_elements, page.url
- )
-
- # If we have cached elements, attempt to use them with an exponentation backoff
- if cache_available and use_cache == True:
- logger.info(f"Cache available, attempting to use cached selectors")
- res = await attempt_with_backoff(
- self,
- prompt_or_elements,
- only_one,
- api_config,
- remaining_timeout=CACHE_TIMEOUT,
- only_use_cache=True,
- )
- if res:
- return res
- else:
- logger.debug(
- f"After attempting to use cached selectors several times without success, let's find the elements using the find element agent."
- )
-
- # Now that no cached selectors were found or they failed repeatedly, let's use the find element agent to find the requested elements.
- logger.info(
- "Proceeding to use the find element agent to find the requested elements."
- )
- res = await attempt_with_backoff(
- self,
- prompt_or_elements,
- only_one,
- api_config,
- remaining_timeout=timeout - (time.time() - start_time),
- only_use_cache=False,
- )
- if res:
- return res
-
- logger.error(
- f"Failed to retrieve elements within the specified timeout of {timeout} seconds"
- )
- return None
-
-
-async def test_if_cache_available(
- obj: DendritePageProtocol, prompt_or_elements: Union[str, Dict[str, str]], url: str
-) -> bool:
- dto = CheckSelectorCacheDTO(
- url=url,
- prompt=prompt_or_elements,
- )
- cache_available = await obj._get_browser_api_client().check_selector_cache(dto)
-
- return cache_available.exists
-
-
-async def attempt_with_backoff(
- obj: DendritePageProtocol,
- prompt_or_elements: Union[str, Dict[str, str]],
- only_one: bool,
- api_config: APIConfig,
- remaining_timeout: float,
- only_use_cache: bool = False,
-) -> Union[Optional[AsyncElement], List[AsyncElement], AsyncElementsResponse]:
- TIMEOUT_INTERVAL: List[float] = [0.15, 0.45, 1.0, 2.0, 4.0, 8.0]
- total_elapsed_time = 0
- start_time = time.time()
-
- for current_timeout in TIMEOUT_INTERVAL:
- if total_elapsed_time >= remaining_timeout:
- logger.error(f"Timeout reached after {total_elapsed_time:.2f} seconds")
- return None
-
- request_start_time = time.time()
- page = await obj._get_page()
- page_information = await page.get_page_information(
- include_screenshot=not only_use_cache
- )
- dto = GetElementsDTO(
- page_information=page_information,
- prompt=prompt_or_elements,
- api_config=api_config,
- use_cache=only_use_cache,
- only_one=only_one,
- force_use_cache=only_use_cache,
- )
- res = await obj._get_browser_api_client().get_interactions_selector(dto)
- request_duration = time.time() - request_start_time
-
- if res.status == "impossible":
- logger.error(
- f"Impossible to get elements for '{prompt_or_elements}'. Reason: {res.message}"
- )
- return None
-
- if res.status == "success":
- response = await get_elements_from_selectors_soup(
- page, await page._get_previous_soup(), res, only_one
- )
- if response:
- return response
-
- sleep_duration = max(0, current_timeout - request_duration)
- logger.info(
- f"Failed to get elements for prompt:\n\n'{prompt_or_elements}'\n\nStatus: {res.status}\n\nMessage: {res.message}\n\nSleeping for {sleep_duration:.2f} seconds"
- )
- await asyncio.sleep(sleep_duration)
- total_elapsed_time = time.time() - start_time
-
- logger.error(f"All attempts failed after {total_elapsed_time:.2f} seconds")
- return None
-
-
-async def get_elements_from_selectors(
- obj: DendritePageProtocol, res: GetElementResponse, only_one: bool
-) -> Union[Optional[AsyncElement], List[AsyncElement], AsyncElementsResponse]:
- if isinstance(res.selectors, dict):
- result = {}
- for key, selectors in res.selectors.items():
- for selector in selectors:
- page = await obj._get_page()
- dendrite_elements = await page._get_all_elements_from_selector(selector)
- if len(dendrite_elements) > 0:
- result[key] = dendrite_elements[0]
- break
- return AsyncElementsResponse(result)
- elif isinstance(res.selectors, list):
- for selector in reversed(res.selectors):
- page = await obj._get_page()
- dendrite_elements = await page._get_all_elements_from_selector(selector)
-
- if len(dendrite_elements) > 0:
- return dendrite_elements[0] if only_one else dendrite_elements
-
- return None
diff --git a/dendrite/async_api/_core/models/api_config.py b/dendrite/async_api/_core/models/api_config.py
deleted file mode 100644
index fd92cac..0000000
--- a/dendrite/async_api/_core/models/api_config.py
+++ /dev/null
@@ -1,33 +0,0 @@
-from typing import Optional
-from pydantic import BaseModel, model_validator
-
-from dendrite._common._exceptions.dendrite_exception import MissingApiKeyError
-
-
-class APIConfig(BaseModel):
- """
- Configuration model for API keys used in the Dendrite SDK.
-
- Attributes:
- dendrite_api_key (Optional[str]): The API key for Dendrite services.
- openai_api_key (Optional[str]): The API key for OpenAI services. If you wish to use your own API key, you can do so by passing it to the AsyncDendrite.
- anthropic_api_key (Optional[str]): The API key for Anthropic services. If you wish to use your own API key, you can do so by passing it to the AsyncDendrite.
-
- Raises:
- ValueError: If a valid dendrite_api_key is not provided.
- """
-
- dendrite_api_key: Optional[str] = None
- openai_api_key: Optional[str] = None
- anthropic_api_key: Optional[str] = None
-
- @model_validator(mode="before")
- def _check_api_keys(cls, values):
- dendrite_api_key = values.get("dendrite_api_key")
-
- if not dendrite_api_key:
- raise MissingApiKeyError(
- "A valid dendrite_api_key must be provided. Make sure you have set the DENDRITE_API_KEY environment variable or passed it to the AsyncDendrite."
- )
-
- return values
diff --git a/dendrite/async_api/_core/models/authentication.py b/dendrite/async_api/_core/models/authentication.py
deleted file mode 100644
index 3c2656e..0000000
--- a/dendrite/async_api/_core/models/authentication.py
+++ /dev/null
@@ -1,47 +0,0 @@
-from pydantic import BaseModel
-from typing import List, Literal, Optional
-from typing_extensions import TypedDict
-
-
-class Cookie(TypedDict, total=False):
- name: str
- value: str
- domain: str
- path: str
- expires: float
- httpOnly: bool
- secure: bool
- sameSite: Literal["Lax", "None", "Strict"]
-
-
-class LocalStorageEntry(TypedDict):
- name: str
- value: str
-
-
-class OriginState(TypedDict):
- origin: str
- localStorage: List[LocalStorageEntry]
-
-
-class StorageState(TypedDict, total=False):
- cookies: List[Cookie]
- origins: List[OriginState]
-
-
-class DomainState(BaseModel):
- domain: str
- storage_state: StorageState
-
-
-class AuthSession(BaseModel):
- user_agent: Optional[str]
- domain_states: List[DomainState]
-
- def to_storage_state(self) -> StorageState:
- cookies = []
- origins = []
- for domain_state in self.domain_states:
- cookies.extend(domain_state.storage_state.get("cookies", []))
- origins.extend(domain_state.storage_state.get("origins", []))
- return StorageState(cookies=cookies, origins=origins)
diff --git a/dendrite/async_api/_core/models/page_diff_information.py b/dendrite/async_api/_core/models/page_diff_information.py
deleted file mode 100644
index 786bbc3..0000000
--- a/dendrite/async_api/_core/models/page_diff_information.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from pydantic import BaseModel
-from dendrite.async_api._core.models.page_information import PageInformation
-
-
-class PageDiffInformation(BaseModel):
- page_before: PageInformation
- page_after: PageInformation
diff --git a/dendrite/async_api/_core/models/page_information.py b/dendrite/async_api/_core/models/page_information.py
deleted file mode 100644
index 67e1909..0000000
--- a/dendrite/async_api/_core/models/page_information.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from typing import Dict, Optional
-from typing_extensions import TypedDict
-from pydantic import BaseModel
-
-
-class InteractableElementInfo(TypedDict):
- attrs: Optional[str]
- text: Optional[str]
-
-
-class PageInformation(BaseModel):
- url: str
- raw_html: str
- screenshot_base64: str
- time_since_frame_navigated: float
diff --git a/dendrite/async_api/_core/models/response.py b/dendrite/async_api/_core/models/response.py
deleted file mode 100644
index 79b216f..0000000
--- a/dendrite/async_api/_core/models/response.py
+++ /dev/null
@@ -1,55 +0,0 @@
-from typing import Dict, Iterator
-
-from dendrite.async_api._core.dendrite_element import AsyncElement
-
-
-class AsyncElementsResponse:
- """
- AsyncElementsResponse is a class that encapsulates a dictionary of Dendrite elements,
- allowing for attribute-style access and other convenient interactions.
-
- This class is used to store and access the elements retrieved by the `get_elements` function.
- The attributes of this class dynamically match the keys of the dictionary passed to the `get_elements` function,
- allowing for direct attribute-style access to the corresponding `AsyncElement` objects.
-
- Attributes:
- _data (Dict[str, AsyncElement]): A dictionary where keys are the names of elements and values are the corresponding `AsyncElement` objects.
-
- Args:
- data (Dict[str, AsyncElement]): The dictionary of elements to be encapsulated by the class.
-
- Methods:
- __getattr__(name: str) -> AsyncElement:
- Allows attribute-style access to the elements in the dictionary.
-
- __getitem__(key: str) -> AsyncElement:
- Enables dictionary-style access to the elements.
-
- __iter__() -> Iterator[str]:
- Provides an iterator over the keys in the dictionary.
-
- __repr__() -> str:
- Returns a string representation of the class instance.
- """
-
- _data: Dict[str, AsyncElement]
-
- def __init__(self, data: Dict[str, AsyncElement]):
- self._data = data
-
- def __getattr__(self, name: str) -> AsyncElement:
- try:
- return self._data[name]
- except KeyError:
- raise AttributeError(
- f"'{self.__class__.__name__}' object has no attribute '{name}'"
- )
-
- def __getitem__(self, key: str) -> AsyncElement:
- return self._data[key]
-
- def __iter__(self) -> Iterator[str]:
- return iter(self._data)
-
- def __repr__(self) -> str:
- return f"{self.__class__.__name__}({self._data})"
diff --git a/dendrite/async_api/_core/protocol/page_protocol.py b/dendrite/async_api/_core/protocol/page_protocol.py
deleted file mode 100644
index 2aa9449..0000000
--- a/dendrite/async_api/_core/protocol/page_protocol.py
+++ /dev/null
@@ -1,20 +0,0 @@
-from typing import TYPE_CHECKING, Protocol
-
-from dendrite.async_api._api.browser_api_client import BrowserAPIClient
-
-if TYPE_CHECKING:
- from dendrite.async_api._core.dendrite_page import AsyncPage
- from dendrite.async_api._core.dendrite_browser import AsyncDendrite
-
-
-class DendritePageProtocol(Protocol):
- """
- Protocol that specifies the required methods and attributes
- for the `ExtractionMixin` to work.
- """
-
- def _get_dendrite_browser(self) -> "AsyncDendrite": ...
-
- def _get_browser_api_client(self) -> BrowserAPIClient: ...
-
- async def _get_page(self) -> "AsyncPage": ...
diff --git a/dendrite/async_api/_dom/util/mild_strip.py b/dendrite/async_api/_dom/util/mild_strip.py
deleted file mode 100644
index 54050fb..0000000
--- a/dendrite/async_api/_dom/util/mild_strip.py
+++ /dev/null
@@ -1,52 +0,0 @@
-from bs4 import BeautifulSoup, Doctype, Tag, Comment
-
-
-def mild_strip(soup: Tag, keep_d_id: bool = True) -> BeautifulSoup:
- new_soup = BeautifulSoup(str(soup), "html.parser")
- _mild_strip(new_soup, keep_d_id)
- return new_soup
-
-
-def mild_strip_in_place(soup: BeautifulSoup, keep_d_id: bool = True) -> None:
- _mild_strip(soup, keep_d_id)
-
-
-def _mild_strip(soup: BeautifulSoup, keep_d_id: bool = True) -> None:
- for element in soup(text=lambda text: isinstance(text, Comment)):
- element.extract()
-
- # for text in soup.find_all(text=lambda text: isinstance(text, NavigableString)):
- # if len(text) > 200:
- # text.replace_with(text[:200] + f"... [{len(text)-200} more chars]")
-
- for tag in soup(
- ["head", "script", "style", "path", "polygon", "defs", "svg", "br", "Doctype"]
- ):
- tag.extract()
-
- for element in soup.contents:
- if isinstance(element, Doctype):
- element.extract()
-
- # for tag in soup.find_all(True):
- # tag.attrs = {
- # attr: (value[:100] if isinstance(value, str) else value)
- # for attr, value in tag.attrs.items()
- # }
- # if keep_d_id == False:
- # del tag["d-id"]
- for tag in soup.find_all(True):
- if tag.attrs.get("is-interactable-d_id") == "true":
- continue
-
- tag.attrs = {
- attr: (value[:100] if isinstance(value, str) else value)
- for attr, value in tag.attrs.items()
- }
- if keep_d_id == False:
- del tag["d-id"]
-
- # if browser != None:
- # for elem in list(soup.descendants):
- # if isinstance(elem, Tag) and not browser.element_is_visible(elem):
- # elem.extract()
diff --git a/dendrite/async_api/_api/__init__.py b/dendrite/browser/__init__.py
similarity index 100%
rename from dendrite/async_api/_api/__init__.py
rename to dendrite/browser/__init__.py
diff --git a/dendrite/async_api/_api/dto/__init__.py b/dendrite/browser/_common/_exceptions/__init__.py
similarity index 100%
rename from dendrite/async_api/_api/dto/__init__.py
rename to dendrite/browser/_common/_exceptions/__init__.py
diff --git a/dendrite/_common/_exceptions/_constants.py b/dendrite/browser/_common/_exceptions/_constants.py
similarity index 100%
rename from dendrite/_common/_exceptions/_constants.py
rename to dendrite/browser/_common/_exceptions/_constants.py
diff --git a/dendrite/_common/_exceptions/dendrite_exception.py b/dendrite/browser/_common/_exceptions/dendrite_exception.py
similarity index 98%
rename from dendrite/_common/_exceptions/dendrite_exception.py
rename to dendrite/browser/_common/_exceptions/dendrite_exception.py
index 4d62481..ddfdeed 100644
--- a/dendrite/_common/_exceptions/dendrite_exception.py
+++ b/dendrite/browser/_common/_exceptions/dendrite_exception.py
@@ -5,7 +5,7 @@
from loguru import logger
-from dendrite._common._exceptions._constants import INVALID_AUTH_SESSION_MSG
+from dendrite.browser._common._exceptions._constants import INVALID_AUTH_SESSION_MSG
class BaseDendriteException(Exception):
@@ -110,8 +110,6 @@ class IncorrectOutcomeError(BaseDendriteException):
Inherits from BaseDendriteException.
"""
- pass
-
class BrowserNotLaunchedError(BaseDendriteException):
"""
diff --git a/dendrite/async_api/_common/constants.py b/dendrite/browser/_common/constants.py
similarity index 100%
rename from dendrite/async_api/_common/constants.py
rename to dendrite/browser/_common/constants.py
diff --git a/dendrite/sync_api/_common/status.py b/dendrite/browser/_common/types.py
similarity index 100%
rename from dendrite/sync_api/_common/status.py
rename to dendrite/browser/_common/types.py
diff --git a/dendrite/browser/async_api/__init__.py b/dendrite/browser/async_api/__init__.py
new file mode 100644
index 0000000..87168f2
--- /dev/null
+++ b/dendrite/browser/async_api/__init__.py
@@ -0,0 +1,11 @@
+from loguru import logger
+
+from .dendrite_browser import AsyncDendrite
+from .dendrite_element import AsyncElement
+from .dendrite_page import AsyncPage
+
+__all__ = [
+ "AsyncDendrite",
+ "AsyncElement",
+ "AsyncPage",
+]
diff --git a/dendrite/async_api/_common/event_sync.py b/dendrite/browser/async_api/_event_sync.py
similarity index 92%
rename from dendrite/async_api/_common/event_sync.py
rename to dendrite/browser/async_api/_event_sync.py
index db93358..a953aee 100644
--- a/dendrite/async_api/_common/event_sync.py
+++ b/dendrite/browser/async_api/_event_sync.py
@@ -1,8 +1,8 @@
-import time
import asyncio
-from typing import Generic, Optional, Type, TypeVar, Union, cast
-from playwright.async_api import Page, Download, FileChooser
+import time
+from typing import Generic, Optional, Type, TypeVar
+from playwright.async_api import Download, FileChooser, Page
Events = TypeVar("Events", Download, FileChooser)
diff --git a/dendrite/browser/async_api/_utils.py b/dendrite/browser/async_api/_utils.py
new file mode 100644
index 0000000..3ccf4f4
--- /dev/null
+++ b/dendrite/browser/async_api/_utils.py
@@ -0,0 +1,157 @@
+import inspect
+from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union
+
+import tldextract
+from bs4 import BeautifulSoup
+from loguru import logger
+from playwright.async_api import Error, Frame
+from pydantic import BaseModel
+
+from dendrite.models.selector import Selector
+
+from .dendrite_element import AsyncElement
+from .types import PlaywrightPage, TypeSpec
+
+if TYPE_CHECKING:
+ from .dendrite_page import AsyncPage
+
+from dendrite.logic.dom.strip import mild_strip_in_place
+
+from .js import GENERATE_DENDRITE_IDS_IFRAME_SCRIPT
+
+
+def get_domain_w_suffix(url: str) -> str:
+ parsed_url = tldextract.extract(url)
+ if parsed_url.suffix == "":
+ raise ValueError(f"Invalid URL: {url}")
+
+ return f"{parsed_url.domain}.{parsed_url.suffix}"
+
+
+async def expand_iframes(
+ page: PlaywrightPage,
+ page_soup: BeautifulSoup,
+):
+ async def get_iframe_path(frame: Frame):
+ path_parts = []
+ current_frame = frame
+ while current_frame.parent_frame is not None:
+ iframe_element = await current_frame.frame_element()
+ iframe_id = await iframe_element.get_attribute("d-id")
+ if iframe_id is None:
+ # If any iframe_id in the path is None, we cannot build the path
+ return None
+ path_parts.insert(0, iframe_id)
+ current_frame = current_frame.parent_frame
+ return "|".join(path_parts)
+
+ for frame in page.frames:
+ if frame.parent_frame is None:
+ continue # Skip the main frame
+ try:
+ iframe_element = await frame.frame_element()
+
+ iframe_id = await iframe_element.get_attribute("d-id")
+ if iframe_id is None:
+ continue
+ iframe_path = await get_iframe_path(frame)
+ except Error as e:
+ continue
+
+ if iframe_path is None:
+ continue
+
+ try:
+ await frame.evaluate(
+ GENERATE_DENDRITE_IDS_IFRAME_SCRIPT, {"frame_path": iframe_path}
+ )
+ frame_content = await frame.content()
+ frame_tree = BeautifulSoup(frame_content, "lxml")
+ mild_strip_in_place(frame_tree)
+ merge_iframe_to_page(iframe_id, page_soup, frame_tree)
+ except Error as e:
+ continue
+
+
+def merge_iframe_to_page(
+ iframe_id: str,
+ page: BeautifulSoup,
+ iframe: BeautifulSoup,
+):
+ iframe_element = page.find("iframe", {"d-id": iframe_id})
+ if iframe_element is None:
+ logger.debug(f"Could not find iframe with ID {iframe_id} in page soup")
+ return
+
+ iframe_element.replace_with(iframe)
+
+
+async def _get_all_elements_from_selector_soup(
+ selector: str, soup: BeautifulSoup, page: "AsyncPage"
+) -> List[AsyncElement]:
+ dendrite_elements: List[AsyncElement] = []
+
+ elements = soup.select(selector)
+
+ for element in elements:
+ frame = page._get_context(element)
+ d_id = element.get("d-id", "")
+ locator = frame.locator(f"xpath=//*[@d-id='{d_id}']")
+
+ if not d_id:
+ continue
+
+ if isinstance(d_id, list):
+ d_id = d_id[0]
+ dendrite_elements.append(
+ AsyncElement(d_id, locator, page.dendrite_browser, page._browser_api_client)
+ )
+
+ return dendrite_elements
+
+
+async def get_elements_from_selectors_soup(
+ page: "AsyncPage",
+ soup: BeautifulSoup,
+ selectors: List[Selector],
+ only_one: bool,
+) -> Union[Optional[AsyncElement], List[AsyncElement]]:
+
+ for selector in reversed(selectors):
+ dendrite_elements = await _get_all_elements_from_selector_soup(
+ selector.selector, soup, page
+ )
+
+ if len(dendrite_elements) > 0:
+ return dendrite_elements[0] if only_one else dendrite_elements
+
+ return None
+
+
+def to_json_schema(type_spec: TypeSpec) -> Dict[str, Any]:
+ if isinstance(type_spec, dict):
+ # Assume it's already a JSON schema
+ return type_spec
+ if inspect.isclass(type_spec) and issubclass(type_spec, BaseModel):
+ # Convert Pydantic model to JSON schema
+ return type_spec.model_json_schema()
+ if type_spec in (bool, int, float, str):
+ # Convert basic Python types to JSON schema
+ type_map = {bool: "boolean", int: "integer", float: "number", str: "string"}
+ return {"type": type_map[type_spec]}
+
+ raise ValueError(f"Unsupported type specification: {type_spec}")
+
+
+def convert_to_type_spec(type_spec: TypeSpec, return_data: Any) -> TypeSpec:
+ if isinstance(type_spec, type):
+ if issubclass(type_spec, BaseModel):
+ return type_spec.model_validate(return_data)
+ if type_spec in (str, float, bool, int):
+ return type_spec(return_data)
+
+ raise ValueError(f"Unsupported type: {type_spec}")
+ if isinstance(type_spec, dict):
+ return return_data
+
+ raise ValueError(f"Unsupported type specification: {type_spec}")
diff --git a/dendrite/async_api/_ext_impl/__init__.py b/dendrite/browser/async_api/browser_impl/__init__.py
similarity index 100%
rename from dendrite/async_api/_ext_impl/__init__.py
rename to dendrite/browser/async_api/browser_impl/__init__.py
diff --git a/dendrite/async_api/_ext_impl/browserbase/__init__.py b/dendrite/browser/async_api/browser_impl/browserbase/__init__.py
similarity index 100%
rename from dendrite/async_api/_ext_impl/browserbase/__init__.py
rename to dendrite/browser/async_api/browser_impl/browserbase/__init__.py
diff --git a/dendrite/async_api/_ext_impl/browserbase/_client.py b/dendrite/browser/async_api/browser_impl/browserbase/_client.py
similarity index 97%
rename from dendrite/async_api/_ext_impl/browserbase/_client.py
rename to dendrite/browser/async_api/browser_impl/browserbase/_client.py
index 0689641..b29b607 100644
--- a/dendrite/async_api/_ext_impl/browserbase/_client.py
+++ b/dendrite/browser/async_api/browser_impl/browserbase/_client.py
@@ -1,11 +1,12 @@
import asyncio
-from pathlib import Path
import time
+from pathlib import Path
from typing import Optional, Union
+
import httpx
from loguru import logger
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
class BrowserbaseClient:
diff --git a/dendrite/async_api/_ext_impl/browserbase/_download.py b/dendrite/browser/async_api/browser_impl/browserbase/_download.py
similarity index 92%
rename from dendrite/async_api/_ext_impl/browserbase/_download.py
rename to dendrite/browser/async_api/browser_impl/browserbase/_download.py
index d18561c..7a92880 100644
--- a/dendrite/async_api/_ext_impl/browserbase/_download.py
+++ b/dendrite/browser/async_api/browser_impl/browserbase/_download.py
@@ -1,13 +1,16 @@
-from pathlib import Path
import re
import shutil
-from typing import Union
import zipfile
+from pathlib import Path
+from typing import Union
+
from loguru import logger
from playwright.async_api import Download
-from dendrite.async_api._core.models.download_interface import DownloadInterface
-from dendrite.async_api._ext_impl.browserbase._client import BrowserbaseClient
+from dendrite.browser.async_api.browser_impl.browserbase._client import (
+ BrowserbaseClient,
+)
+from dendrite.browser.async_api.protocol.download_protocol import DownloadInterface
class AsyncBrowserbaseDownload(DownloadInterface):
diff --git a/dendrite/async_api/_ext_impl/browserbase/_impl.py b/dendrite/browser/async_api/browser_impl/browserbase/_impl.py
similarity index 80%
rename from dendrite/async_api/_ext_impl/browserbase/_impl.py
rename to dendrite/browser/async_api/browser_impl/browserbase/_impl.py
index c67846e..b44b219 100644
--- a/dendrite/async_api/_ext_impl/browserbase/_impl.py
+++ b/dendrite/browser/async_api/browser_impl/browserbase/_impl.py
@@ -1,21 +1,23 @@
from typing import TYPE_CHECKING, Optional
-from dendrite._common._exceptions.dendrite_exception import BrowserNotLaunchedError
-from dendrite.async_api._core._impl_browser import ImplBrowser
-from dendrite.async_api._core._type_spec import PlaywrightPage
-from dendrite.remote.browserbase_config import BrowserbaseConfig
+
+from dendrite.browser._common._exceptions.dendrite_exception import (
+ BrowserNotLaunchedError,
+)
+from dendrite.browser.async_api.protocol.browser_protocol import BrowserProtocol
+from dendrite.browser.async_api.types import PlaywrightPage
+from dendrite.browser.remote.browserbase_config import BrowserbaseConfig
if TYPE_CHECKING:
- from dendrite.async_api._core.dendrite_browser import AsyncDendrite
-from dendrite.async_api._ext_impl.browserbase._client import BrowserbaseClient
-from playwright.async_api import Playwright
+ from dendrite.browser.async_api.dendrite_browser import AsyncDendrite
+
from loguru import logger
+from playwright.async_api import Playwright
-from dendrite.async_api._ext_impl.browserbase._download import (
- AsyncBrowserbaseDownload,
-)
+from ._client import BrowserbaseClient
+from ._download import AsyncBrowserbaseDownload
-class BrowserBaseImpl(ImplBrowser):
+class BrowserbaseImpl(BrowserProtocol):
def __init__(self, settings: BrowserbaseConfig) -> None:
self.settings = settings
self._client = BrowserbaseClient(
diff --git a/dendrite/async_api/_ext_impl/browserless/__init__.py b/dendrite/browser/async_api/browser_impl/browserless/__init__.py
similarity index 100%
rename from dendrite/async_api/_ext_impl/browserless/__init__.py
rename to dendrite/browser/async_api/browser_impl/browserless/__init__.py
diff --git a/dendrite/async_api/_ext_impl/browserless/_impl.py b/dendrite/browser/async_api/browser_impl/browserless/_impl.py
similarity index 71%
rename from dendrite/async_api/_ext_impl/browserless/_impl.py
rename to dendrite/browser/async_api/browser_impl/browserless/_impl.py
index e5b87b4..698557d 100644
--- a/dendrite/async_api/_ext_impl/browserless/_impl.py
+++ b/dendrite/browser/async_api/browser_impl/browserless/_impl.py
@@ -1,23 +1,30 @@
import json
from typing import TYPE_CHECKING, Optional
-from dendrite._common._exceptions.dendrite_exception import BrowserNotLaunchedError
-from dendrite.async_api._core._impl_browser import ImplBrowser
-from dendrite.async_api._core._type_spec import PlaywrightPage
-from dendrite.remote.browserless_config import BrowserlessConfig
+
+from dendrite.browser._common._exceptions.dendrite_exception import (
+ BrowserNotLaunchedError,
+)
+from dendrite.browser.async_api.protocol.browser_protocol import BrowserProtocol
+from dendrite.browser.async_api.types import PlaywrightPage
+from dendrite.browser.remote.browserless_config import BrowserlessConfig
if TYPE_CHECKING:
- from dendrite.async_api._core.dendrite_browser import AsyncDendrite
-from dendrite.async_api._ext_impl.browserbase._client import BrowserbaseClient
-from playwright.async_api import Playwright
-from loguru import logger
+ from dendrite.browser.async_api.dendrite_browser import AsyncDendrite
+
import urllib.parse
-from dendrite.async_api._ext_impl.browserbase._download import (
+from loguru import logger
+from playwright.async_api import Playwright
+
+from dendrite.browser.async_api.browser_impl.browserbase._client import (
+ BrowserbaseClient,
+)
+from dendrite.browser.async_api.browser_impl.browserbase._download import (
AsyncBrowserbaseDownload,
)
-class BrowserlessImpl(ImplBrowser):
+class BrowserlessImpl(BrowserProtocol):
def __init__(self, settings: BrowserlessConfig) -> None:
self.settings = settings
self._session_id: Optional[str] = None
diff --git a/dendrite/browser/async_api/browser_impl/impl_mapping.py b/dendrite/browser/async_api/browser_impl/impl_mapping.py
new file mode 100644
index 0000000..d588769
--- /dev/null
+++ b/dendrite/browser/async_api/browser_impl/impl_mapping.py
@@ -0,0 +1,34 @@
+from typing import Dict, Optional, Type
+
+from dendrite.browser.remote import Providers
+from dendrite.browser.remote.browserbase_config import BrowserbaseConfig
+from dendrite.browser.remote.browserless_config import BrowserlessConfig
+
+from ..protocol.browser_protocol import BrowserProtocol
+from .browserbase._impl import BrowserbaseImpl
+from .browserless._impl import BrowserlessImpl
+from .local._impl import LocalImpl
+
+IMPL_MAPPING: Dict[Type[Providers], Type[BrowserProtocol]] = {
+ BrowserbaseConfig: BrowserbaseImpl,
+ BrowserlessConfig: BrowserlessImpl,
+}
+
+SETTINGS_CLASSES: Dict[str, Type[Providers]] = {
+ "browserbase": BrowserbaseConfig,
+ "browserless": BrowserlessConfig,
+}
+
+
+def get_impl(remote_provider: Optional[Providers]) -> BrowserProtocol:
+ if remote_provider is None:
+ return LocalImpl()
+
+ try:
+ provider_class = IMPL_MAPPING[type(remote_provider)]
+ except KeyError:
+ raise ValueError(
+ f"No implementation for {type(remote_provider)}. Available providers: {', '.join(map(lambda x: x.__name__, IMPL_MAPPING.keys()))}"
+ )
+
+ return provider_class(remote_provider)
diff --git a/dendrite/browser/async_api/browser_impl/local/_impl.py b/dendrite/browser/async_api/browser_impl/local/_impl.py
new file mode 100644
index 0000000..ebc5010
--- /dev/null
+++ b/dendrite/browser/async_api/browser_impl/local/_impl.py
@@ -0,0 +1,52 @@
+from pathlib import Path
+from typing import TYPE_CHECKING, Optional, Union, overload
+
+from loguru import logger
+from typing_extensions import Literal
+
+from dendrite.browser._common.constants import STEALTH_ARGS
+
+if TYPE_CHECKING:
+ from dendrite.browser.async_api.dendrite_browser import AsyncDendrite
+
+import os
+import shutil
+import tempfile
+
+from playwright.async_api import (
+ Browser,
+ BrowserContext,
+ Download,
+ Playwright,
+ StorageState,
+)
+
+from dendrite.browser.async_api.protocol.browser_protocol import BrowserProtocol
+from dendrite.browser.async_api.types import PlaywrightPage
+
+
+class LocalImpl(BrowserProtocol):
+ def __init__(self) -> None:
+ pass
+
+ async def start_browser(
+ self,
+ playwright: Playwright,
+ pw_options: dict,
+ storage_state: Optional[StorageState] = None,
+ ) -> Browser:
+ return await playwright.chromium.launch(**pw_options)
+
+ async def get_download(
+ self,
+ dendrite_browser: "AsyncDendrite",
+ pw_page: PlaywrightPage,
+ timeout: float,
+ ) -> Download:
+ return await dendrite_browser._download_handler.get_data(pw_page, timeout)
+
+ async def configure_context(self, browser: "AsyncDendrite"):
+ pass
+
+ async def stop_session(self):
+ pass
diff --git a/dendrite/async_api/_core/dendrite_browser.py b/dendrite/browser/async_api/dendrite_browser.py
similarity index 66%
rename from dendrite/async_api/_core/dendrite_browser.py
rename to dendrite/browser/async_api/dendrite_browser.py
index 07722ee..ca51778 100644
--- a/dendrite/async_api/_core/dendrite_browser.py
+++ b/dendrite/browser/async_api/dendrite_browser.py
@@ -1,53 +1,48 @@
-from abc import ABC, abstractmethod
+import os
import pathlib
import re
-from typing import Any, List, Literal, Optional, Sequence, Union
+from abc import ABC
+from typing import Any, List, Optional, Sequence, Union
from uuid import uuid4
-import os
+
from loguru import logger
from playwright.async_api import (
- async_playwright,
- Playwright,
- BrowserContext,
- FileChooser,
Download,
Error,
+ FileChooser,
FilePayload,
+ StorageState,
+ async_playwright,
)
-from dendrite.async_api._api.dto.authenticate_dto import AuthenticateDTO
-from dendrite.async_api._api.dto.upload_auth_session_dto import UploadAuthSessionDTO
-from dendrite.async_api._common.event_sync import EventSync
-from dendrite.async_api._core._impl_browser import ImplBrowser
-from dendrite.async_api._core._impl_mapping import get_impl
-from dendrite.async_api._core._managers.page_manager import (
- PageManager,
-)
-
-from dendrite.async_api._core._type_spec import PlaywrightPage
-from dendrite.async_api._core.dendrite_page import AsyncPage
-from dendrite.async_api._common.constants import STEALTH_ARGS
-from dendrite.async_api._core.mixin.ask import AskMixin
-from dendrite.async_api._core.mixin.click import ClickMixin
-from dendrite.async_api._core.mixin.extract import ExtractionMixin
-from dendrite.async_api._core.mixin.fill_fields import FillFieldsMixin
-from dendrite.async_api._core.mixin.get_element import GetElementMixin
-from dendrite.async_api._core.mixin.keyboard import KeyboardMixin
-from dendrite.async_api._core.mixin.screenshot import ScreenshotMixin
-from dendrite.async_api._core.mixin.wait_for import WaitForMixin
-from dendrite.async_api._core.mixin.markdown import MarkdownMixin
-from dendrite.async_api._core.models.authentication import (
- AuthSession,
-)
-
-from dendrite.async_api._core.models.api_config import APIConfig
-from dendrite.async_api._api.browser_api_client import BrowserAPIClient
-from dendrite._common._exceptions.dendrite_exception import (
+from dendrite.browser._common._exceptions.dendrite_exception import (
BrowserNotLaunchedError,
DendriteException,
IncorrectOutcomeError,
)
-from dendrite.remote import Providers
+from dendrite.browser._common.constants import STEALTH_ARGS
+from dendrite.browser.async_api._utils import get_domain_w_suffix
+from dendrite.browser.remote import Providers
+from dendrite.logic.config import Config
+from dendrite.logic import AsyncLogicEngine
+
+from ._event_sync import EventSync
+from .browser_impl.impl_mapping import get_impl
+from .dendrite_page import AsyncPage
+from .manager.page_manager import PageManager
+from .mixin import (
+ AskMixin,
+ ClickMixin,
+ ExtractionMixin,
+ FillFieldsMixin,
+ GetElementMixin,
+ KeyboardMixin,
+ MarkdownMixin,
+ ScreenshotMixin,
+ WaitForMixin,
+)
+from .protocol.browser_protocol import BrowserProtocol
+from .types import PlaywrightPage
class AsyncDendrite(
@@ -87,51 +82,36 @@ class AsyncDendrite(
def __init__(
self,
- auth: Optional[Union[str, List[str]]] = None,
- dendrite_api_key: Optional[str] = None,
- openai_api_key: Optional[str] = None,
- anthropic_api_key: Optional[str] = None,
playwright_options: Any = {
"headless": False,
"args": STEALTH_ARGS,
},
remote_config: Optional[Providers] = None,
+ config: Optional[Config] = None,
+ auth: Optional[Union[List[str], str]] = None,
):
"""
- Initializes AsyncDendrite with API keys and Playwright options.
+ Initialize AsyncDendrite with optional domain authentication.
Args:
- auth (Optional[Union[str, List[str]]]): The domains on which the browser should try and authenticate.
- dendrite_api_key (Optional[str]): The Dendrite API key. If not provided, it's fetched from the environment variables.
- openai_api_key (Optional[str]): Your own OpenAI API key, provide it, along with other custom API keys, if you wish to use Dendrite without paying for a license.
- anthropic_api_key (Optional[str]): The own Anthropic API key, provide it, along with other custom API keys, if you wish to use Dendrite without paying for a license.
- playwright_options (Any): Options for configuring Playwright. Defaults to running in non-headless mode with stealth arguments.
-
- Raises:
- MissingApiKeyError: If the Dendrite API key is not provided or found in the environment variables.
+ playwright_options: Options for configuring Playwright
+ remote_config: Remote browser provider configuration
+ config: Configuration object
+ auth: List of domains or single domain to load authentication state for
"""
-
- api_config = APIConfig(
- dendrite_api_key=dendrite_api_key or os.environ.get("DENDRITE_API_KEY"),
- openai_api_key=openai_api_key,
- anthropic_api_key=anthropic_api_key,
- )
-
self._impl = self._get_impl(remote_config)
-
- self.api_config = api_config
- self.playwright: Optional[Playwright] = None
- self.browser_context: Optional[BrowserContext] = None
+ self._playwright_options = playwright_options
+ self._config = config or Config()
+ auth_url = [auth] if isinstance(auth, str) else auth or []
+ self._auth_domains = [get_domain_w_suffix(url) for url in auth_url]
self._id = uuid4().hex
- self._playwright_options = playwright_options
self._active_page_manager: Optional[PageManager] = None
self._user_id: Optional[str] = None
self._upload_handler = EventSync(event_type=FileChooser)
self._download_handler = EventSync(event_type=Download)
self.closed = False
- self._auth = auth
- self._browser_api_client = BrowserAPIClient(api_config, self._id)
+ self._browser_api_client: AsyncLogicEngine = AsyncLogicEngine(self._config)
@property
def pages(self) -> List[AsyncPage]:
@@ -150,10 +130,12 @@ async def _get_page(self) -> AsyncPage:
active_page = await self.get_active_page()
return active_page
- def _get_browser_api_client(self) -> BrowserAPIClient:
+ @property
+ def logic_engine(self) -> AsyncLogicEngine:
return self._browser_api_client
- def _get_dendrite_browser(self) -> "AsyncDendrite":
+ @property
+ def dendrite_browser(self) -> "AsyncDendrite":
return self
async def __aenter__(self):
@@ -163,15 +145,10 @@ async def __aexit__(self, exc_type, exc_val, exc_tb):
# Ensure cleanup is handled
await self.close()
- def _get_impl(self, remote_provider: Optional[Providers]) -> ImplBrowser:
+ def _get_impl(self, remote_provider: Optional[Providers]) -> BrowserProtocol:
# if remote_provider is None:)
return get_impl(remote_provider)
- async def _get_auth_session(self, domains: Union[str, list[str]]):
- dto = AuthenticateDTO(domains=domains)
- auth_session: AuthSession = await self._browser_api_client.authenticate(dto)
- return auth_session
-
async def get_active_page(self) -> AsyncPage:
"""
Retrieves the currently active page managed by the PageManager.
@@ -294,18 +271,23 @@ async def _launch(self):
os.environ["PW_TEST_SCREENSHOT_NO_FONTS_READY"] = "1"
self._playwright = await async_playwright().start()
- # browser = await self._playwright.chromium.launch(**self._playwright_options)
+ # Get and merge storage states for authenticated domains
+ storage_states = []
+ for domain in self._auth_domains:
+ state = await self._get_domain_storage_state(domain)
+ if state:
+ storage_states.append(state)
+
+ # Launch browser
browser = await self._impl.start_browser(
self._playwright, self._playwright_options
)
- if self._auth:
- auth_session = await self._get_auth_session(self._auth)
- self.browser_context = await browser.new_context(
- storage_state=auth_session.to_storage_state(),
- user_agent=auth_session.user_agent,
- )
+ # Create context with merged storage state if available
+ if storage_states:
+ merged_state = await self._merge_storage_states(storage_states)
+ self.browser_context = await browser.new_context(storage_state=merged_state)
else:
self.browser_context = (
browser.contexts[0]
@@ -314,7 +296,6 @@ async def _launch(self):
)
self._active_page_manager = PageManager(self, self.browser_context)
-
await self._impl.configure_context(self)
return browser, self.browser_context, self._active_page_manager
@@ -336,38 +317,34 @@ async def add_cookies(self, cookies):
async def close(self):
"""
- Closes the browser and uploads authentication session data if available.
+ Closes the browser and updates storage states for authenticated domains before cleanup.
- This method stops the Playwright instance, closes the browser context, and uploads any
- stored authentication session data if applicable.
+ This method updates the storage states for authenticated domains, stops the Playwright
+ instance, and closes the browser context.
Returns:
None
Raises:
- Exception: If there is an issue closing the browser or uploading session data.
+ Exception: If there is an issue closing the browser or updating session data.
"""
-
self.closed = True
+
try:
- if self.browser_context:
- if self._auth:
- auth_session = await self._get_auth_session(self._auth)
- storage_state = await self.browser_context.storage_state()
- dto = UploadAuthSessionDTO(
- auth_data=auth_session, storage_state=storage_state
- )
- await self._browser_api_client.upload_auth_session(dto)
+ if self.browser_context and self._auth_domains:
+ # Update storage state for each authenticated domain
+ for domain in self._auth_domains:
+ await self.save_auth(domain)
+
await self._impl.stop_session()
await self.browser_context.close()
except Error:
pass
+
try:
if self._playwright:
await self._playwright.stop()
- except AttributeError:
- pass
- except Exception:
+ except (AttributeError, Exception):
pass
def _is_launched(self):
@@ -464,3 +441,111 @@ async def _get_filechooser(
Exception: If there is an issue uploading files.
"""
return await self._upload_handler.get_data(pw_page, timeout=timeout)
+
+ async def save_auth(self, url: str) -> None:
+ """
+ Save authentication state for a specific domain.
+
+ Args:
+ domain (str): Domain to save authentication for (e.g., "github.com")
+ """
+ if not self.browser_context:
+ raise DendriteException("Browser context not initialized")
+
+ domain = get_domain_w_suffix(url)
+
+ # Get current storage state
+ storage_state = await self.browser_context.storage_state()
+
+ # Filter storage state for specific domain
+ filtered_state = {
+ "origins": [
+ origin
+ for origin in storage_state.get("origins", [])
+ if domain in origin.get("origin", "")
+ ],
+ "cookies": [
+ cookie
+ for cookie in storage_state.get("cookies", [])
+ if domain in cookie.get("domain", "")
+ ],
+ }
+
+ # Save to cache
+ self._config.storage_cache.set(
+ {"domain": domain}, StorageState(**filtered_state)
+ )
+
+ async def setup_auth(
+ self,
+ url: str,
+ message: str = "Please log in to the website. Once done, press Enter to continue...",
+ ) -> None:
+ """
+ Set up authentication for a specific URL.
+
+ Args:
+ url (str): URL to navigate to for login
+ message (str): Message to show while waiting for user input
+ """
+ # Extract domain from URL
+ # domain = urlparse(url).netloc
+ # if not domain:
+ # domain = urlparse(f"https://{url}").netloc
+
+ domain = get_domain_w_suffix(url)
+
+ try:
+ # Start Playwright
+ self._playwright = await async_playwright().start()
+
+ # Launch browser with normal context
+ browser = await self._impl.start_browser(
+ self._playwright, {**self._playwright_options, "headless": False}
+ )
+
+ self.browser_context = await browser.new_context()
+ self._active_page_manager = PageManager(self, self.browser_context)
+
+ # Navigate to login page
+ await self.goto(url)
+
+ # Wait for user to complete login
+ print(message)
+ input()
+
+ # Save the storage state for this domain
+ await self.save_auth(domain)
+
+ finally:
+ # Clean up
+ await self.close()
+
+ async def _get_domain_storage_state(self, domain: str) -> Optional[StorageState]:
+ """Get storage state for a specific domain"""
+ return self._config.storage_cache.get({"domain": domain}, index=0)
+
+ async def _merge_storage_states(self, states: List[StorageState]) -> StorageState:
+ """Merge multiple storage states into one"""
+ merged = {"origins": [], "cookies": []}
+ seen_origins = set()
+ seen_cookies = set()
+
+ for state in states:
+ # Merge origins
+ for origin in state.get("origins", []):
+ origin_key = origin.get("origin", "")
+ if origin_key not in seen_origins:
+ merged["origins"].append(origin)
+ seen_origins.add(origin_key)
+
+ # Merge cookies
+ for cookie in state.get("cookies", []):
+ cookie_key = (
+ f"{cookie.get('name')}:{cookie.get('domain')}:{cookie.get('path')}"
+ )
+ if cookie_key not in seen_cookies:
+ merged["cookies"].append(cookie)
+ seen_cookies.add(cookie_key)
+
+ return StorageState(**merged)
diff --git a/dendrite/async_api/_core/dendrite_element.py b/dendrite/browser/async_api/dendrite_element.py
similarity index 86%
rename from dendrite/async_api/_core/dendrite_element.py
rename to dendrite/browser/async_api/dendrite_element.py
index e4e4fed..bca3159 100644
--- a/dendrite/async_api/_core/dendrite_element.py
+++ b/dendrite/browser/async_api/dendrite_element.py
@@ -1,4 +1,5 @@
from __future__ import annotations
+
import asyncio
import base64
import functools
@@ -8,20 +9,19 @@
from loguru import logger
from playwright.async_api import Locator
-from dendrite.async_api._api.browser_api_client import BrowserAPIClient
-from dendrite._common._exceptions.dendrite_exception import IncorrectOutcomeError
+from dendrite.browser._common._exceptions.dendrite_exception import (
+ IncorrectOutcomeError,
+)
+from dendrite.logic import AsyncLogicEngine
if TYPE_CHECKING:
- from dendrite.async_api._core.dendrite_browser import AsyncDendrite
-from dendrite.async_api._core._managers.navigation_tracker import NavigationTracker
-from dendrite.async_api._core.models.page_diff_information import (
- PageDiffInformation,
-)
-from dendrite.async_api._core._type_spec import Interaction
-from dendrite.async_api._api.response.interaction_response import (
- InteractionResponse,
-)
-from dendrite.async_api._api.dto.make_interaction_dto import MakeInteractionDTO
+ from .dendrite_browser import AsyncDendrite
+
+from dendrite.models.dto.make_interaction_dto import VerifyActionDTO
+from dendrite.models.response.interaction_response import InteractionResponse
+
+from .manager.navigation_tracker import NavigationTracker
+from .types import Interaction
def perform_action(interaction_type: Interaction):
@@ -51,11 +51,11 @@ async def wrapper(
await func(self, *args, **kwargs)
return InteractionResponse(status="success", message="")
- api_config = self._dendrite_browser.api_config
-
page_before = await self._dendrite_browser.get_active_page()
page_before_info = await page_before.get_page_information()
-
+ soup = await page_before._get_previous_soup()
+ screenshot_before = page_before_info.screenshot_base64
+ tag_name = soup.find(attrs={"d-id": self.dendrite_id})
# Call the original method here
await func(
self,
@@ -67,25 +67,24 @@ async def wrapper(
await self._wait_for_page_changes(page_before.url)
page_after = await self._dendrite_browser.get_active_page()
- page_after_info = await page_after.get_page_information()
- page_delta_information = PageDiffInformation(
- page_before=page_before_info, page_after=page_after_info
+ screenshot_after = (
+ await page_after.screenshot_manager.take_full_page_screenshot()
)
- dto = MakeInteractionDTO(
+ dto = VerifyActionDTO(
url=page_before.url,
dendrite_id=self.dendrite_id,
interaction_type=interaction_type,
expected_outcome=expected_outcome,
- page_delta_information=page_delta_information,
- api_config=api_config,
+ screenshot_before=screenshot_before,
+ screenshot_after=screenshot_after,
+ tag_name=str(tag_name),
)
- res = await self._browser_api_client.make_interaction(dto)
+ res = await self._browser_api_client.verify_action(dto)
if res.status == "failed":
raise IncorrectOutcomeError(
- message=res.message,
- screenshot_base64=page_delta_information.page_after.screenshot_base64,
+ message=res.message, screenshot_base64=screenshot_after
)
return res
@@ -108,7 +107,7 @@ def __init__(
dendrite_id: str,
locator: Locator,
dendrite_browser: AsyncDendrite,
- browser_api_client: BrowserAPIClient,
+ browser_api_client: AsyncLogicEngine,
):
"""
Initialize a AsyncElement.
diff --git a/dendrite/async_api/_core/dendrite_page.py b/dendrite/browser/async_api/dendrite_page.py
similarity index 88%
rename from dendrite/async_api/_core/dendrite_page.py
rename to dendrite/browser/async_api/dendrite_page.py
index 7d45eeb..f2b78e2 100644
--- a/dendrite/async_api/_core/dendrite_page.py
+++ b/dendrite/browser/async_api/dendrite_page.py
@@ -1,57 +1,35 @@
-import re
import asyncio
import pathlib
+import re
import time
-
-from typing import (
- TYPE_CHECKING,
- Any,
- List,
- Literal,
- Optional,
- Sequence,
- Union,
-)
+from typing import TYPE_CHECKING, Any, List, Literal, Optional, Sequence, Union
from bs4 import BeautifulSoup, Tag
from loguru import logger
-
-from playwright.async_api import (
- FrameLocator,
- Keyboard,
- Download,
- FilePayload,
-)
-
-
-from dendrite.async_api._api.browser_api_client import BrowserAPIClient
-from dendrite.async_api._core._js import GENERATE_DENDRITE_IDS_SCRIPT
-from dendrite.async_api._core._type_spec import PlaywrightPage
-from dendrite.async_api._core.dendrite_element import AsyncElement
-from dendrite.async_api._core.mixin.ask import AskMixin
-from dendrite.async_api._core.mixin.click import ClickMixin
-from dendrite.async_api._core.mixin.extract import ExtractionMixin
-from dendrite.async_api._core.mixin.fill_fields import FillFieldsMixin
-from dendrite.async_api._core.mixin.get_element import GetElementMixin
-from dendrite.async_api._core.mixin.keyboard import KeyboardMixin
-from dendrite.async_api._core.mixin.markdown import MarkdownMixin
-from dendrite.async_api._core.mixin.wait_for import WaitForMixin
-from dendrite.async_api._core.models.page_information import PageInformation
-
+from playwright.async_api import Download, FilePayload, FrameLocator, Keyboard
+
+from dendrite.logic import AsyncLogicEngine
+from dendrite.models.page_information import PageInformation
+
+from .dendrite_element import AsyncElement
+from .js import GENERATE_DENDRITE_IDS_SCRIPT
+from .mixin.ask import AskMixin
+from .mixin.click import ClickMixin
+from .mixin.extract import ExtractionMixin
+from .mixin.fill_fields import FillFieldsMixin
+from .mixin.get_element import GetElementMixin
+from .mixin.keyboard import KeyboardMixin
+from .mixin.markdown import MarkdownMixin
+from .mixin.wait_for import WaitForMixin
+from .types import PlaywrightPage
if TYPE_CHECKING:
- from dendrite.async_api._core.dendrite_browser import AsyncDendrite
-
+ from .dendrite_browser import AsyncDendrite
-from dendrite.async_api._core._managers.screenshot_manager import ScreenshotManager
-from dendrite._common._exceptions.dendrite_exception import (
- DendriteException,
-)
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
-
-from dendrite.async_api._core._utils import (
- expand_iframes,
-)
+from ._utils import expand_iframes
+from .manager.screenshot_manager import ScreenshotManager
class AsyncPage(
@@ -75,14 +53,14 @@ def __init__(
self,
page: PlaywrightPage,
dendrite_browser: "AsyncDendrite",
- browser_api_client: "BrowserAPIClient",
+ browser_api_client: AsyncLogicEngine,
):
self.playwright_page = page
self.screenshot_manager = ScreenshotManager(page)
- self.dendrite_browser = dendrite_browser
self._browser_api_client = browser_api_client
self._last_main_frame_url = page.url
self._last_frame_navigated_timestamp = time.time()
+ self._dendrite_browser = dendrite_browser
self.playwright_page.on("framenavigated", self._on_frame_navigated)
@@ -91,6 +69,10 @@ def _on_frame_navigated(self, frame):
self._last_main_frame_url = frame.url
self._last_frame_navigated_timestamp = time.time()
+ @property
+ def dendrite_browser(self) -> "AsyncDendrite":
+ return self._dendrite_browser
+
@property
def url(self):
"""
@@ -114,10 +96,8 @@ def keyboard(self) -> Keyboard:
async def _get_page(self) -> "AsyncPage":
return self
- def _get_dendrite_browser(self) -> "AsyncDendrite":
- return self.dendrite_browser
-
- def _get_browser_api_client(self) -> BrowserAPIClient:
+ @property
+ def logic_engine(self) -> AsyncLogicEngine:
return self._browser_api_client
async def goto(
@@ -292,7 +272,7 @@ async def _generate_dendrite_ids(self):
await self.playwright_page.wait_for_load_state(
state="load", timeout=3000
)
- logger.debug(
+ logger.exception(
f"Failed to generate dendrite IDs: {e}, attempt {tries+1}/3"
)
tries += 1
diff --git a/dendrite/async_api/_core/_js/__init__.py b/dendrite/browser/async_api/js/__init__.py
similarity index 100%
rename from dendrite/async_api/_core/_js/__init__.py
rename to dendrite/browser/async_api/js/__init__.py
diff --git a/dendrite/async_api/_core/_js/eventListenerPatch.js b/dendrite/browser/async_api/js/eventListenerPatch.js
similarity index 100%
rename from dendrite/async_api/_core/_js/eventListenerPatch.js
rename to dendrite/browser/async_api/js/eventListenerPatch.js
diff --git a/dendrite/sync_api/_core/_js/generateDendriteIDs.js b/dendrite/browser/async_api/js/generateDendriteIDs.js
similarity index 97%
rename from dendrite/sync_api/_core/_js/generateDendriteIDs.js
rename to dendrite/browser/async_api/js/generateDendriteIDs.js
index 1d4b348..d03b8cd 100644
--- a/dendrite/sync_api/_core/_js/generateDendriteIDs.js
+++ b/dendrite/browser/async_api/js/generateDendriteIDs.js
@@ -9,6 +9,7 @@ var hashCode = (str) => {
return hash;
}
+
const getElementIndex = (element) => {
let index = 1;
let sibling = element.previousElementSibling;
@@ -42,7 +43,8 @@ const usedHashes = new Map();
var markHidden = (hidden_element) => {
// Mark the hidden element itself
- hidden
+ hidden_element.setAttribute('data-hidden', 'true');
+
}
document.querySelectorAll('*').forEach((element, index) => {
diff --git a/dendrite/async_api/_core/_js/generateDendriteIDsIframe.js b/dendrite/browser/async_api/js/generateDendriteIDsIframe.js
similarity index 100%
rename from dendrite/async_api/_core/_js/generateDendriteIDsIframe.js
rename to dendrite/browser/async_api/js/generateDendriteIDsIframe.js
diff --git a/dendrite/async_api/_api/response/__init__.py b/dendrite/browser/async_api/manager/__init__.py
similarity index 100%
rename from dendrite/async_api/_api/response/__init__.py
rename to dendrite/browser/async_api/manager/__init__.py
diff --git a/dendrite/async_api/_core/_managers/navigation_tracker.py b/dendrite/browser/async_api/manager/navigation_tracker.py
similarity index 97%
rename from dendrite/async_api/_core/_managers/navigation_tracker.py
rename to dendrite/browser/async_api/manager/navigation_tracker.py
index dc80337..2ae51aa 100644
--- a/dendrite/async_api/_core/_managers/navigation_tracker.py
+++ b/dendrite/browser/async_api/manager/navigation_tracker.py
@@ -1,10 +1,9 @@
import asyncio
import time
-
from typing import TYPE_CHECKING, Dict, Optional
if TYPE_CHECKING:
- from dendrite.async_api._core.dendrite_page import AsyncPage
+ from ..dendrite_page import AsyncPage
class NavigationTracker:
diff --git a/dendrite/async_api/_core/_managers/page_manager.py b/dendrite/browser/async_api/manager/page_manager.py
similarity index 79%
rename from dendrite/async_api/_core/_managers/page_manager.py
rename to dendrite/browser/async_api/manager/page_manager.py
index 0d30cbf..e9069af 100644
--- a/dendrite/async_api/_core/_managers/page_manager.py
+++ b/dendrite/browser/async_api/manager/page_manager.py
@@ -1,12 +1,13 @@
-from typing import Optional, TYPE_CHECKING
+from typing import TYPE_CHECKING, Optional
from loguru import logger
from playwright.async_api import BrowserContext, Download, FileChooser
if TYPE_CHECKING:
- from dendrite.async_api._core.dendrite_browser import AsyncDendrite
-from dendrite.async_api._core._type_spec import PlaywrightPage
-from dendrite.async_api._core.dendrite_page import AsyncPage
+ from ..dendrite_browser import AsyncDendrite
+
+from ..dendrite_page import AsyncPage
+from ..types import PlaywrightPage
class PageManager:
@@ -16,6 +17,17 @@ def __init__(self, dendrite_browser, browser_context: BrowserContext):
self.browser_context = browser_context
self.dendrite_browser: AsyncDendrite = dendrite_browser
+ # Handle existing pages in the context
+ existing_pages = browser_context.pages
+ if existing_pages:
+ for page in existing_pages:
+ client = self.dendrite_browser.logic_engine
+ dendrite_page = AsyncPage(page, self.dendrite_browser, client)
+ self.pages.append(dendrite_page)
+ # Set the first existing page as active
+ if self.active_page is None:
+ self.active_page = dendrite_page
+
browser_context.on("page", self._page_on_open_handler)
async def new_page(self) -> AsyncPage:
@@ -25,7 +37,7 @@ async def new_page(self) -> AsyncPage:
if self.active_page and new_page == self.active_page.playwright_page:
return self.active_page
- client = self.dendrite_browser._get_browser_api_client()
+ client = self.dendrite_browser.logic_engine
dendrite_page = AsyncPage(new_page, self.dendrite_browser, client)
self.pages.append(dendrite_page)
self.active_page = dendrite_page
@@ -75,7 +87,7 @@ def _page_on_open_handler(self, page: PlaywrightPage):
page.on("download", self._page_on_download_handler)
page.on("filechooser", self._page_on_filechooser_handler)
- client = self.dendrite_browser._get_browser_api_client()
+ client = self.dendrite_browser.logic_engine
dendrite_page = AsyncPage(page, self.dendrite_browser, client)
self.pages.append(dendrite_page)
self.active_page = dendrite_page
diff --git a/dendrite/async_api/_core/_managers/screenshot_manager.py b/dendrite/browser/async_api/manager/screenshot_manager.py
similarity index 97%
rename from dendrite/async_api/_core/_managers/screenshot_manager.py
rename to dendrite/browser/async_api/manager/screenshot_manager.py
index 6fce4b1..2c2613c 100644
--- a/dendrite/async_api/_core/_managers/screenshot_manager.py
+++ b/dendrite/browser/async_api/manager/screenshot_manager.py
@@ -2,7 +2,7 @@
import os
from uuid import uuid4
-from dendrite.async_api._core._type_spec import PlaywrightPage
+from ..types import PlaywrightPage
class ScreenshotManager:
diff --git a/dendrite/browser/async_api/mixin/__init__.py b/dendrite/browser/async_api/mixin/__init__.py
new file mode 100644
index 0000000..046a61c
--- /dev/null
+++ b/dendrite/browser/async_api/mixin/__init__.py
@@ -0,0 +1,21 @@
+from .ask import AskMixin
+from .click import ClickMixin
+from .extract import ExtractionMixin
+from .fill_fields import FillFieldsMixin
+from .get_element import GetElementMixin
+from .keyboard import KeyboardMixin
+from .markdown import MarkdownMixin
+from .screenshot import ScreenshotMixin
+from .wait_for import WaitForMixin
+
+__all__ = [
+ "AskMixin",
+ "ClickMixin",
+ "ExtractionMixin",
+ "FillFieldsMixin",
+ "GetElementMixin",
+ "KeyboardMixin",
+ "MarkdownMixin",
+ "ScreenshotMixin",
+ "WaitForMixin",
+]
diff --git a/dendrite/async_api/_core/mixin/ask.py b/dendrite/browser/async_api/mixin/ask.py
similarity index 93%
rename from dendrite/async_api/_core/mixin/ask.py
rename to dendrite/browser/async_api/mixin/ask.py
index 05f6a04..b7efd7a 100644
--- a/dendrite/async_api/_core/mixin/ask.py
+++ b/dendrite/browser/async_api/mixin/ask.py
@@ -4,16 +4,12 @@
from loguru import logger
-from dendrite.async_api._api.dto.ask_page_dto import AskPageDTO
-from dendrite.async_api._core._type_spec import (
- JsonSchema,
- PydanticModel,
- TypeSpec,
- convert_to_type_spec,
- to_json_schema,
-)
-from dendrite.async_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
+from dendrite.browser.async_api._utils import convert_to_type_spec, to_json_schema
+from dendrite.models.dto.ask_page_dto import AskPageDTO
+
+from ..protocol.page_protocol import DendritePageProtocol
+from ..types import JsonSchema, PydanticModel, TypeSpec
# The timeout interval between retries in milliseconds
TIMEOUT_INTERVAL = [150, 450, 1000]
@@ -135,7 +131,6 @@ async def ask(
Raises:
DendriteException: If the request fails, the exception includes the failure message and a screenshot.
"""
- api_config = self._get_dendrite_browser().api_config
start_time = time.time()
attempt_start = start_time
attempt = -1
@@ -182,13 +177,12 @@ async def ask(
dto = AskPageDTO(
page_information=page_information,
- api_config=api_config,
prompt=entire_prompt,
return_schema=schema,
)
try:
- res = await self._get_browser_api_client().ask_page(dto)
+ res = await self.logic_engine.ask_page(dto)
logger.debug(f"Got response in {time.time() - attempt_start} seconds")
if res.status == "error":
diff --git a/dendrite/async_api/_core/mixin/click.py b/dendrite/browser/async_api/mixin/click.py
similarity index 84%
rename from dendrite/async_api/_core/mixin/click.py
rename to dendrite/browser/async_api/mixin/click.py
index e8b0370..d6460f2 100644
--- a/dendrite/async_api/_core/mixin/click.py
+++ b/dendrite/browser/async_api/mixin/click.py
@@ -1,11 +1,10 @@
-import asyncio
-from typing import Any, Optional
-from dendrite.async_api._api.response.interaction_response import (
- InteractionResponse,
-)
-from dendrite.async_api._core.mixin.get_element import GetElementMixin
-from dendrite.async_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from typing import Optional
+
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
+from dendrite.models.response.interaction_response import InteractionResponse
+
+from ..mixin.get_element import GetElementMixin
+from ..protocol.page_protocol import DendritePageProtocol
class ClickMixin(GetElementMixin, DendritePageProtocol):
diff --git a/dendrite/browser/async_api/mixin/extract.py b/dendrite/browser/async_api/mixin/extract.py
new file mode 100644
index 0000000..b718567
--- /dev/null
+++ b/dendrite/browser/async_api/mixin/extract.py
@@ -0,0 +1,317 @@
+import asyncio
+import time
+from typing import Any, Callable, List, Optional, Type, overload
+
+from loguru import logger
+
+from dendrite.browser.async_api._utils import convert_to_type_spec, to_json_schema
+from dendrite.logic.code.code_session import execute
+from dendrite.models.dto.cached_extract_dto import CachedExtractDTO
+from dendrite.models.dto.extract_dto import ExtractDTO
+from dendrite.models.response.extract_response import ExtractResponse
+from dendrite.models.scripts import Script
+
+from ..manager.navigation_tracker import NavigationTracker
+from ..protocol.page_protocol import DendritePageProtocol
+from ..types import JsonSchema, PydanticModel, TypeSpec
+
+CACHE_TIMEOUT = 5
+
+
+class ExtractionMixin(DendritePageProtocol):
+ """
+ Mixin that provides extraction functionality for web pages.
+
+ This mixin provides various `extract` methods that allow extracting
+ different types of data (e.g., bool, int, float, string, Pydantic models, etc.)
+ from a web page based on a given prompt.
+ """
+
+ @overload
+ async def extract(
+ self,
+ prompt: str,
+ type_spec: Type[bool],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> bool: ...
+
+ @overload
+ async def extract(
+ self,
+ prompt: str,
+ type_spec: Type[int],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> int: ...
+
+ @overload
+ async def extract(
+ self,
+ prompt: str,
+ type_spec: Type[float],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> float: ...
+
+ @overload
+ async def extract(
+ self,
+ prompt: str,
+ type_spec: Type[str],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> str: ...
+
+ @overload
+ async def extract(
+ self,
+ prompt: Optional[str],
+ type_spec: Type[PydanticModel],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> PydanticModel: ...
+
+ @overload
+ async def extract(
+ self,
+ prompt: Optional[str],
+ type_spec: JsonSchema,
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> JsonSchema: ...
+
+ @overload
+ async def extract(
+ self,
+ prompt: str,
+ type_spec: None = None,
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> Any: ...
+
+ async def extract(
+ self,
+ prompt: Optional[str],
+ type_spec: Optional[TypeSpec] = None,
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> TypeSpec:
+ """
+ Extract data from a web page based on a prompt and optional type specification.
+ Args:
+ prompt (Optional[str]): The prompt to describe the information to extract.
+ type_spec (Optional[TypeSpec], optional): The type specification for the extracted data.
+ use_cache (bool, optional): Whether to use cached results. Defaults to True.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached scripts before falling back to the
+ extraction agent for the remaining time that will attempt to generate a new script. Defaults to 15000 (15 seconds).
+
+ Returns:
+ ExtractResponse: The extracted data wrapped in a ExtractResponse object.
+ Raises:
+ TimeoutError: If the extraction process exceeds the specified timeout.
+ """
+ logger.info(f"Starting extraction with prompt: {prompt}")
+
+ json_schema = None
+ if type_spec:
+ json_schema = to_json_schema(type_spec)
+ logger.debug(f"Type specification converted to JSON schema: {json_schema}")
+
+ if prompt is None:
+ prompt = ""
+
+ start_time = time.time()
+ page = await self._get_page()
+ navigation_tracker = NavigationTracker(page)
+ navigation_tracker.start_nav_tracking()
+
+ # First try using cached extraction if enabled
+ if use_cache:
+ logger.info("Testing cache")
+ cached_result = await self._try_cached_extraction(prompt, json_schema)
+ if cached_result:
+ return convert_and_return_result(cached_result, type_spec)
+
+ # If cache failed or disabled, proceed with extraction agent
+ logger.info(
+ "Using extraction agent to perform extraction, since no cache was found or failed."
+ )
+ result = await self._extract_with_agent(
+ prompt,
+ json_schema,
+ timeout - (time.time() - start_time),
+ )
+
+ if result:
+ return convert_and_return_result(result, type_spec)
+
+ logger.error(f"Extraction failed after {time.time() - start_time:.2f} seconds")
+ return None
+
+ async def _try_cached_extraction(
+ self,
+ prompt: str,
+ json_schema: Optional[JsonSchema],
+ ) -> Optional[ExtractResponse]:
+ """
+ Attempts to extract data using cached scripts with exponential backoff.
+ Only tries up to 5 most recent scripts.
+
+ Args:
+ prompt: The prompt describing what to extract
+ json_schema: Optional JSON schema for type validation
+
+ Returns:
+ ExtractResponse if successful, None otherwise
+ """
+ page = await self._get_page()
+ dto = CachedExtractDTO(url=page.url, prompt=prompt)
+ scripts = await self.logic_engine.get_cached_scripts(dto)
+ logger.debug(f"Found {len(scripts)} scripts in cache, {scripts}")
+ if len(scripts) == 0:
+ logger.debug(
+ f"No scripts found in cache for prompt: {prompt} in domain: {page.url}"
+ )
+ return None
+
+ async def try_cached_extract():
+ page = await self._get_page()
+ soup = await page._get_soup()
+ # Take at most the last 5 scripts
+ recent_scripts = scripts[-min(5, len(scripts)) :]
+ for script in recent_scripts:
+ res = await test_script(script, str(soup), json_schema)
+ if res is not None:
+ return ExtractResponse(
+ status="success",
+ message="Re-used a preexisting script from cache with the same specifications.",
+ return_data=res,
+ created_script=script.script,
+ )
+
+ return None
+
+ return await _attempt_with_backoff_helper(
+ "cached_extraction",
+ try_cached_extract,
+ CACHE_TIMEOUT,
+ )
+
+ async def _extract_with_agent(
+ self,
+ prompt: str,
+ json_schema: Optional[JsonSchema],
+ remaining_timeout: float,
+ ) -> Optional[ExtractResponse]:
+ """
+ Attempts to extract data using the extraction agent with exponential backoff.
+
+ Args:
+ prompt: The prompt describing what to extract
+ json_schema: Optional JSON schema for type validation
+ remaining_timeout: Maximum time to spend on extraction
+
+ Returns:
+ ExtractResponse if successful, None otherwise
+ """
+
+ async def try_extract_with_agent():
+ page = await self._get_page()
+ page_information = await page.get_page_information(include_screenshot=True)
+ extract_dto = ExtractDTO(
+ page_information=page_information,
+ prompt=prompt,
+ return_data_json_schema=json_schema,
+ use_screenshot=True,
+ )
+
+ res: ExtractResponse = await self.logic_engine.extract(extract_dto)
+
+ if res.status == "impossible":
+ logger.error(f"Impossible to extract data. Reason: {res.message}")
+ return None
+
+ if res.status == "success":
+ logger.success(f"Extraction successful: '{res.message}'")
+ return res
+
+ return None
+
+ return await _attempt_with_backoff_helper(
+ "extraction_agent",
+ try_extract_with_agent,
+ remaining_timeout,
+ )
+
+
+async def _attempt_with_backoff_helper(
+ operation_name: str,
+ operation: Callable,
+ timeout: float,
+ backoff_intervals: List[float] = [0.15, 0.45, 1.0, 2.0, 4.0, 8.0],
+) -> Optional[Any]:
+ """
+ Generic helper function that implements exponential backoff for operations.
+
+ Args:
+ operation_name: Name of the operation for logging
+ operation: Async function to execute
+ timeout: Maximum time to spend attempting the operation
+ backoff_intervals: List of timeouts between attempts
+
+ Returns:
+ The result of the operation if successful, None otherwise
+ """
+ total_elapsed_time = 0
+ start_time = time.time()
+
+ for i, current_timeout in enumerate(backoff_intervals):
+ if total_elapsed_time >= timeout:
+ logger.error(f"Timeout reached after {total_elapsed_time:.2f} seconds")
+ return None
+
+ request_start_time = time.time()
+ result = await operation()
+ request_duration = time.time() - request_start_time
+
+ if result:
+ return result
+
+ sleep_duration = max(0, current_timeout - request_duration)
+ logger.info(
+ f"{operation_name} attempt {i+1} failed. Sleeping for {sleep_duration:.2f} seconds"
+ )
+ await asyncio.sleep(sleep_duration)
+ total_elapsed_time = time.time() - start_time
+
+ logger.error(
+ f"All {operation_name} attempts failed after {total_elapsed_time:.2f} seconds"
+ )
+ return None
+
+
+def convert_and_return_result(
+ res: ExtractResponse, type_spec: Optional[TypeSpec]
+) -> TypeSpec:
+ converted_res = res.return_data
+ if type_spec is not None:
+ logger.debug("Converting extraction result to specified type")
+ converted_res = convert_to_type_spec(type_spec, res.return_data)
+
+ logger.info("Extraction process completed successfully")
+ return converted_res
+
+
+async def test_script(
+ script: Script,
+ raw_html: str,
+ return_data_json_schema: Any,
+) -> Optional[Any]:
+
+ try:
+ res = execute(script.script, raw_html, return_data_json_schema)
+ return res
+ except Exception as e:
+ logger.debug(f"Script failed with error: {str(e)} ")
diff --git a/dendrite/async_api/_core/mixin/fill_fields.py b/dendrite/browser/async_api/mixin/fill_fields.py
similarity index 90%
rename from dendrite/async_api/_core/mixin/fill_fields.py
rename to dendrite/browser/async_api/mixin/fill_fields.py
index 55d5760..fad759f 100644
--- a/dendrite/async_api/_core/mixin/fill_fields.py
+++ b/dendrite/browser/async_api/mixin/fill_fields.py
@@ -1,11 +1,11 @@
import asyncio
from typing import Any, Dict, Optional
-from dendrite.async_api._api.response.interaction_response import (
- InteractionResponse,
-)
-from dendrite.async_api._core.mixin.get_element import GetElementMixin
-from dendrite.async_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
+from dendrite.models.response.interaction_response import InteractionResponse
+
+from ..mixin.get_element import GetElementMixin
+from ..protocol.page_protocol import DendritePageProtocol
class FillFieldsMixin(GetElementMixin, DendritePageProtocol):
diff --git a/dendrite/browser/async_api/mixin/get_element.py b/dendrite/browser/async_api/mixin/get_element.py
new file mode 100644
index 0000000..51f8235
--- /dev/null
+++ b/dendrite/browser/async_api/mixin/get_element.py
@@ -0,0 +1,304 @@
+import asyncio
+import time
+from typing import (
+ TYPE_CHECKING,
+ Any,
+ Callable,
+ Dict,
+ List,
+ Literal,
+ Optional,
+ Union,
+ overload,
+)
+
+from bs4 import BeautifulSoup
+from loguru import logger
+
+from .._utils import _get_all_elements_from_selector_soup
+from ..dendrite_element import AsyncElement
+
+if TYPE_CHECKING:
+ from ..dendrite_page import AsyncPage
+
+from dendrite.models.dto.cached_selector_dto import CachedSelectorDTO
+from dendrite.models.dto.get_elements_dto import GetElementsDTO
+
+from ..protocol.page_protocol import DendritePageProtocol
+
+CACHE_TIMEOUT = 5
+
+
+class GetElementMixin(DendritePageProtocol):
+ async def get_element(
+ self,
+ prompt: str,
+ use_cache=True,
+ timeout=15000,
+ ) -> Optional[AsyncElement]:
+ """
+ Retrieves a single Dendrite element based on the provided prompt.
+
+ Args:
+ prompt (str): The prompt describing the element to be retrieved.
+ use_cache (bool, optional): Whether to use cached results. Defaults to True.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached selectors before falling back to the
+ find element agent for the remaining time. Defaults to 15000 (15 seconds).
+
+ Returns:
+ AsyncElement: The retrieved element.
+ """
+ return await self._get_element(
+ prompt,
+ only_one=True,
+ use_cache=use_cache,
+ timeout=timeout / 1000,
+ )
+
+ @overload
+ async def _get_element(
+ self,
+ prompt_or_elements: str,
+ only_one: Literal[True],
+ use_cache: bool,
+ timeout,
+ ) -> Optional[AsyncElement]:
+ """
+ Retrieves a single Dendrite element based on the provided prompt.
+
+ Args:
+ prompt (Union[str, Dict[str, str]]): The prompt describing the element to be retrieved.
+ only_one (Literal[True]): Indicates that only one element should be retrieved.
+ use_cache (bool): Whether to use cached results.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached selectors before falling back to the
+ find element agent for the remaining time. Defaults to 15000 (15 seconds).
+
+ Returns:
+ AsyncElement: The retrieved element.
+ """
+
+ @overload
+ async def _get_element(
+ self,
+ prompt_or_elements: str,
+ only_one: Literal[False],
+ use_cache: bool,
+ timeout,
+ ) -> List[AsyncElement]:
+ """
+ Retrieves a list of Dendrite elements based on the provided prompt.
+
+ Args:
+ prompt (str): The prompt describing the elements to be retrieved.
+ only_one (Literal[False]): Indicates that multiple elements should be retrieved.
+ use_cache (bool): Whether to use cached results.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached selectors before falling back to the
+ find element agent for the remaining time. Defaults to 15000 (15 seconds).
+
+ Returns:
+ List[AsyncElement]: A list of retrieved elements.
+ """
+
+ async def _get_element(
+ self,
+ prompt_or_elements: str,
+ only_one: bool,
+ use_cache: bool,
+ timeout: float,
+ ) -> Union[
+ Optional[AsyncElement],
+ List[AsyncElement],
+ ]:
+ """
+ Retrieves Dendrite elements based on the provided prompt, either a single element or a list of elements.
+
+ This method sends a request with the prompt and retrieves the elements based on the `only_one` flag.
+
+ Args:
+ prompt_or_elements (Union[str, Dict[str, str]]): The prompt or dictionary of prompts for element retrieval.
+ only_one (bool): Whether to retrieve only one element or a list of elements.
+ use_cache (bool): Whether to use cached results.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached selectors before falling back to the
+ find element agent for the remaining time. Defaults to 15000 (15 seconds).
+
+ Returns:
+ Union[AsyncElement, List[AsyncElement], AsyncElementsResponse]: The retrieved element, list of elements, or response object.
+ """
+
+ logger.info(f"Getting element for prompt: '{prompt_or_elements}'")
+ start_time = time.time()
+ page = await self._get_page()
+ soup = await page._get_soup()
+
+ if use_cache:
+ cached_elements = await self._try_cached_selectors(
+ page, soup, prompt_or_elements, only_one
+ )
+ if cached_elements:
+ return cached_elements
+
+ # Now that no cached selectors were found or they failed repeatedly, let's use the find element agent
+ logger.info(
+ "Proceeding to use the find element agent to find the requested elements."
+ )
+ res = await try_get_element(
+ self,
+ prompt_or_elements,
+ only_one,
+ remaining_timeout=timeout - (time.time() - start_time),
+ )
+ if res:
+ return res
+
+ logger.error(
+ f"Failed to retrieve elements within the specified timeout of {timeout} seconds"
+ )
+ return None
+
+ async def _try_cached_selectors(
+ self,
+ page: "AsyncPage",
+ soup: BeautifulSoup,
+ prompt: str,
+ only_one: bool,
+ ) -> Union[Optional[AsyncElement], List[AsyncElement]]:
+ """
+ Attempts to retrieve elements using cached selectors with exponential backoff.
+
+ Args:
+ page: The current page object
+ soup: The BeautifulSoup object of the current page
+ prompt: The prompt to search for
+ only_one: Whether to return only one element
+
+ Returns:
+ The found elements if successful, None otherwise
+ """
+ dto = CachedSelectorDTO(url=page.url, prompt=prompt)
+ selectors = await self.logic_engine.get_cached_selectors(dto)
+
+ if len(selectors) == 0:
+ logger.debug("No cached selectors found")
+ return None
+
+ logger.debug("Attempting to use cached selectors with backoff")
+ # Take at most the last 5 selectors
+ recent_selectors = selectors[-min(5, len(selectors)) :]
+ str_selectors = list(map(lambda x: x.selector, recent_selectors))
+
+ async def try_cached_selectors():
+ return await get_elements_from_selectors_soup(
+ page, soup, str_selectors, only_one
+ )
+
+ return await _attempt_with_backoff_helper(
+ "cached_selectors",
+ try_cached_selectors,
+ timeout=CACHE_TIMEOUT,
+ )
+
+
+async def _attempt_with_backoff_helper(
+ operation_name: str,
+ operation: Callable,
+ timeout: float,
+ backoff_intervals: List[float] = [0.15, 0.45, 1.0, 2.0, 4.0, 8.0],
+) -> Optional[Any]:
+ """
+ Generic helper function that implements exponential backoff for operations.
+
+ Args:
+ operation_name: Name of the operation for logging
+ operation: Async function to execute
+ timeout: Maximum time to spend attempting the operation
+ backoff_intervals: List of timeouts between attempts
+
+ Returns:
+ The result of the operation if successful, None otherwise
+ """
+ total_elapsed_time = 0
+ start_time = time.time()
+
+ for i, current_timeout in enumerate(backoff_intervals):
+ if total_elapsed_time >= timeout:
+ logger.error(f"Timeout reached after {total_elapsed_time:.2f} seconds")
+ return None
+
+ request_start_time = time.time()
+ result = await operation()
+ request_duration = time.time() - request_start_time
+
+ if result:
+ return result
+
+ sleep_duration = max(0, current_timeout - request_duration)
+ logger.info(
+ f"{operation_name} attempt {i+1} failed. Sleeping for {sleep_duration:.2f} seconds"
+ )
+ await asyncio.sleep(sleep_duration)
+ total_elapsed_time = time.time() - start_time
+
+ logger.error(
+ f"All {operation_name} attempts failed after {total_elapsed_time:.2f} seconds"
+ )
+ return None
+
+
+async def try_get_element(
+ obj: DendritePageProtocol,
+ prompt_or_elements: Union[str, Dict[str, str]],
+ only_one: bool,
+ remaining_timeout: float,
+) -> Union[Optional[AsyncElement], List[AsyncElement]]:
+
+ async def _try_get_element():
+ page = await obj._get_page()
+ page_information = await page.get_page_information()
+ dto = GetElementsDTO(
+ page_information=page_information,
+ prompt=prompt_or_elements,
+ only_one=only_one,
+ )
+ res = await obj.logic_engine.get_element(dto)
+
+ if res.status == "impossible":
+ logger.error(
+ f"Impossible to get elements for '{prompt_or_elements}'. Reason: {res.message}"
+ )
+ return None
+
+ if res.status == "success":
+ logger.success(f"d[id]: {res.d_id} Selectors:{res.selectors}")
+ if res.selectors is not None:
+ return await get_elements_from_selectors_soup(
+ page, await page._get_previous_soup(), res.selectors, only_one
+ )
+ return None
+
+ return await _attempt_with_backoff_helper(
+ "find_element_agent",
+ _try_get_element,
+ remaining_timeout,
+ )
+
+
+async def get_elements_from_selectors_soup(
+ page: "AsyncPage",
+ soup: BeautifulSoup,
+ selectors: List[str],
+ only_one: bool,
+) -> Union[Optional[AsyncElement], List[AsyncElement]]:
+
+ for selector in reversed(selectors):
+ dendrite_elements = await _get_all_elements_from_selector_soup(
+ selector, soup, page
+ )
+
+ if len(dendrite_elements) > 0:
+ return dendrite_elements[0] if only_one else dendrite_elements
+
+ return None
diff --git a/dendrite/async_api/_core/mixin/keyboard.py b/dendrite/browser/async_api/mixin/keyboard.py
similarity index 90%
rename from dendrite/async_api/_core/mixin/keyboard.py
rename to dendrite/browser/async_api/mixin/keyboard.py
index ee26559..bf4e145 100644
--- a/dendrite/async_api/_core/mixin/keyboard.py
+++ b/dendrite/browser/async_api/mixin/keyboard.py
@@ -1,6 +1,8 @@
-from typing import Any, Union, Literal
-from dendrite.async_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from typing import Literal, Union
+
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
+
+from ..protocol.page_protocol import DendritePageProtocol
class KeyboardMixin(DendritePageProtocol):
diff --git a/dendrite/async_api/_core/mixin/markdown.py b/dendrite/browser/async_api/mixin/markdown.py
similarity index 90%
rename from dendrite/async_api/_core/mixin/markdown.py
rename to dendrite/browser/async_api/mixin/markdown.py
index 01ada25..687db67 100644
--- a/dendrite/async_api/_core/mixin/markdown.py
+++ b/dendrite/browser/async_api/mixin/markdown.py
@@ -1,12 +1,12 @@
-from typing import Optional
-from bs4 import BeautifulSoup
import re
+from typing import Optional
-from dendrite.async_api._core.mixin.extract import ExtractionMixin
-from dendrite.async_api._core.protocol.page_protocol import DendritePageProtocol
-
+from bs4 import BeautifulSoup
from markdownify import markdownify as md
+from ..mixin.extract import ExtractionMixin
+from ..protocol.page_protocol import DendritePageProtocol
+
class MarkdownMixin(ExtractionMixin, DendritePageProtocol):
async def markdown(self, prompt: Optional[str] = None):
diff --git a/dendrite/async_api/_core/mixin/screenshot.py b/dendrite/browser/async_api/mixin/screenshot.py
similarity index 88%
rename from dendrite/async_api/_core/mixin/screenshot.py
rename to dendrite/browser/async_api/mixin/screenshot.py
index c150eb4..200d4a1 100644
--- a/dendrite/async_api/_core/mixin/screenshot.py
+++ b/dendrite/browser/async_api/mixin/screenshot.py
@@ -1,4 +1,4 @@
-from dendrite.async_api._core.protocol.page_protocol import DendritePageProtocol
+from ..protocol.page_protocol import DendritePageProtocol
class ScreenshotMixin(DendritePageProtocol):
diff --git a/dendrite/async_api/_core/mixin/wait_for.py b/dendrite/browser/async_api/mixin/wait_for.py
similarity index 88%
rename from dendrite/async_api/_core/mixin/wait_for.py
rename to dendrite/browser/async_api/mixin/wait_for.py
index 6bd042e..7c60f88 100644
--- a/dendrite/async_api/_core/mixin/wait_for.py
+++ b/dendrite/browser/async_api/mixin/wait_for.py
@@ -1,13 +1,15 @@
import asyncio
import time
+from loguru import logger
-from dendrite.async_api._core.mixin.ask import AskMixin
-from dendrite.async_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import PageConditionNotMet
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from dendrite.browser._common._exceptions.dendrite_exception import (
+ DendriteException,
+ PageConditionNotMet,
+)
-from loguru import logger
+from ..mixin.ask import AskMixin
+from ..protocol.page_protocol import DendritePageProtocol
class WaitForMixin(AskMixin, DendritePageProtocol):
diff --git a/dendrite/async_api/_common/__init__.py b/dendrite/browser/async_api/protocol/__init__.py
similarity index 100%
rename from dendrite/async_api/_common/__init__.py
rename to dendrite/browser/async_api/protocol/__init__.py
diff --git a/dendrite/browser/async_api/protocol/browser_protocol.py b/dendrite/browser/async_api/protocol/browser_protocol.py
new file mode 100644
index 0000000..304064d
--- /dev/null
+++ b/dendrite/browser/async_api/protocol/browser_protocol.py
@@ -0,0 +1,68 @@
+from typing import TYPE_CHECKING, Optional, Protocol, Union
+
+from typing_extensions import Literal
+
+from dendrite.browser.remote import Providers
+
+if TYPE_CHECKING:
+ from ..dendrite_browser import AsyncDendrite
+
+from playwright.async_api import Browser, Download, Playwright
+
+from ..types import PlaywrightPage
+
+
+class BrowserProtocol(Protocol):
+ def __init__(self, settings: Providers) -> None: ...
+
+ async def get_download(
+ self, dendrite_browser: "AsyncDendrite", pw_page: PlaywrightPage, timeout: float
+ ) -> Download:
+ """
+ Retrieves the download event from the browser.
+
+ Returns:
+ Download: The download event.
+
+ Raises:
+ Exception: If there is an issue retrieving the download event.
+ """
+ ...
+
+ async def start_browser(
+ self,
+ playwright: Playwright,
+ pw_options: dict,
+ ) -> Browser:
+ """
+ Starts the browser session.
+
+ Args:
+ playwright: The playwright instance
+ pw_options: Playwright launch options
+
+ Returns:
+ Browser: A Browser instance
+ """
+ ...
+
+ async def configure_context(self, browser: "AsyncDendrite") -> None:
+ """
+ Configures the browser context.
+
+ Args:
+ browser (AsyncDendrite): The browser to configure.
+
+ Raises:
+ Exception: If there is an issue configuring the browser context.
+ """
+ ...
+
+ async def stop_session(self) -> None:
+ """
+ Stops the browser session.
+
+ Raises:
+ Exception: If there is an issue stopping the browser session.
+ """
+ ...
diff --git a/dendrite/async_api/_core/models/download_interface.py b/dendrite/browser/async_api/protocol/download_protocol.py
similarity index 99%
rename from dendrite/async_api/_core/models/download_interface.py
rename to dendrite/browser/async_api/protocol/download_protocol.py
index c38a486..bdb7ba9 100644
--- a/dendrite/async_api/_core/models/download_interface.py
+++ b/dendrite/browser/async_api/protocol/download_protocol.py
@@ -1,6 +1,7 @@
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Any, Union
+
from playwright.async_api import Download
diff --git a/dendrite/browser/async_api/protocol/page_protocol.py b/dendrite/browser/async_api/protocol/page_protocol.py
new file mode 100644
index 0000000..7716352
--- /dev/null
+++ b/dendrite/browser/async_api/protocol/page_protocol.py
@@ -0,0 +1,22 @@
+from typing import TYPE_CHECKING, Protocol
+
+from dendrite.logic import AsyncLogicEngine
+
+if TYPE_CHECKING:
+ from ..dendrite_browser import AsyncDendrite
+ from ..dendrite_page import AsyncPage
+
+
+class DendritePageProtocol(Protocol):
+ """
+ Protocol that specifies the required methods and attributes
+ for the `ExtractionMixin` to work.
+ """
+
+ @property
+ def logic_engine(self) -> AsyncLogicEngine: ...
+
+ @property
+ def dendrite_browser(self) -> "AsyncDendrite": ...
+
+ async def _get_page(self) -> "AsyncPage": ...
diff --git a/dendrite/browser/async_api/types.py b/dendrite/browser/async_api/types.py
new file mode 100644
index 0000000..1703b8c
--- /dev/null
+++ b/dendrite/browser/async_api/types.py
@@ -0,0 +1,15 @@
+import inspect
+from typing import Any, Dict, Literal, Type, TypeVar, Union
+
+from playwright.async_api import Page
+from pydantic import BaseModel
+
+Interaction = Literal["click", "fill", "hover"]
+
+T = TypeVar("T")
+PydanticModel = TypeVar("PydanticModel", bound=BaseModel)
+PrimitiveTypes = PrimitiveTypes = Union[Type[bool], Type[int], Type[float], Type[str]]
+JsonSchema = Dict[str, Any]
+TypeSpec = Union[PrimitiveTypes, PydanticModel, JsonSchema]
+
+PlaywrightPage = Page
diff --git a/dendrite/browser/remote/__init__.py b/dendrite/browser/remote/__init__.py
new file mode 100644
index 0000000..b37ef34
--- /dev/null
+++ b/dendrite/browser/remote/__init__.py
@@ -0,0 +1,8 @@
+from typing import Union
+
+from dendrite.browser.remote.browserbase_config import BrowserbaseConfig
+from dendrite.browser.remote.browserless_config import BrowserlessConfig
+
+Providers = Union[BrowserbaseConfig, BrowserlessConfig]
+
+__all__ = ["Providers", "BrowserbaseConfig"]
diff --git a/dendrite/remote/browserbase_config.py b/dendrite/browser/remote/browserbase_config.py
similarity index 99%
rename from dendrite/remote/browserbase_config.py
rename to dendrite/browser/remote/browserbase_config.py
index b526b52..f86b02c 100644
--- a/dendrite/remote/browserbase_config.py
+++ b/dendrite/browser/remote/browserbase_config.py
@@ -1,5 +1,6 @@
import os
from typing import Optional
+
from dendrite.exceptions import MissingApiKeyError
diff --git a/dendrite/remote/browserless_config.py b/dendrite/browser/remote/browserless_config.py
similarity index 88%
rename from dendrite/remote/browserless_config.py
rename to dendrite/browser/remote/browserless_config.py
index 88a4efe..7e3bcc8 100644
--- a/dendrite/remote/browserless_config.py
+++ b/dendrite/browser/remote/browserless_config.py
@@ -1,7 +1,7 @@
import os
from typing import Optional
-from dendrite._common._exceptions.dendrite_exception import MissingApiKeyError
+from dendrite.browser._common._exceptions.dendrite_exception import MissingApiKeyError
class BrowserlessConfig:
diff --git a/dendrite/remote/provider.py b/dendrite/browser/remote/provider.py
similarity index 92%
rename from dendrite/remote/provider.py
rename to dendrite/browser/remote/provider.py
index 8a5135f..fd615b0 100644
--- a/dendrite/remote/provider.py
+++ b/dendrite/browser/remote/provider.py
@@ -1,10 +1,8 @@
from pathlib import Path
from typing import Union
-
-from dendrite.remote import Providers
-from dendrite.remote.browserbase_config import BrowserbaseConfig
-
+from dendrite.browser.remote import Providers
+from dendrite.browser.remote.browserbase_config import BrowserbaseConfig
try:
import tomllib # type: ignore
diff --git a/dendrite/browser/sync_api/__init__.py b/dendrite/browser/sync_api/__init__.py
new file mode 100644
index 0000000..8beebcc
--- /dev/null
+++ b/dendrite/browser/sync_api/__init__.py
@@ -0,0 +1,6 @@
+from loguru import logger
+from .dendrite_browser import Dendrite
+from .dendrite_element import Element
+from .dendrite_page import Page
+
+__all__ = ["Dendrite", "Element", "Page"]
diff --git a/dendrite/sync_api/_common/event_sync.py b/dendrite/browser/sync_api/_event_sync.py
similarity index 91%
rename from dendrite/sync_api/_common/event_sync.py
rename to dendrite/browser/sync_api/_event_sync.py
index 162bb8e..4351eee 100644
--- a/dendrite/sync_api/_common/event_sync.py
+++ b/dendrite/browser/sync_api/_event_sync.py
@@ -1,7 +1,7 @@
import time
import time
-from typing import Generic, Optional, Type, TypeVar, Union, cast
-from playwright.sync_api import Page, Download, FileChooser
+from typing import Generic, Optional, Type, TypeVar
+from playwright.sync_api import Download, FileChooser, Page
Events = TypeVar("Events", Download, FileChooser)
mapping = {Download: "download", FileChooser: "filechooser"}
diff --git a/dendrite/browser/sync_api/_utils.py b/dendrite/browser/sync_api/_utils.py
new file mode 100644
index 0000000..24f2b8a
--- /dev/null
+++ b/dendrite/browser/sync_api/_utils.py
@@ -0,0 +1,123 @@
+import inspect
+from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union
+import tldextract
+from bs4 import BeautifulSoup
+from loguru import logger
+from playwright.sync_api import Error, Frame
+from pydantic import BaseModel
+from dendrite.models.selector import Selector
+from .dendrite_element import Element
+from .types import PlaywrightPage, TypeSpec
+
+if TYPE_CHECKING:
+ from .dendrite_page import Page
+from dendrite.logic.dom.strip import mild_strip_in_place
+from .js import GENERATE_DENDRITE_IDS_IFRAME_SCRIPT
+
+
+def get_domain_w_suffix(url: str) -> str:
+ parsed_url = tldextract.extract(url)
+ if parsed_url.suffix == "":
+ raise ValueError(f"Invalid URL: {url}")
+ return f"{parsed_url.domain}.{parsed_url.suffix}"
+
+
+def expand_iframes(page: PlaywrightPage, page_soup: BeautifulSoup):
+
+ def get_iframe_path(frame: Frame):
+ path_parts = []
+ current_frame = frame
+ while current_frame.parent_frame is not None:
+ iframe_element = current_frame.frame_element()
+ iframe_id = iframe_element.get_attribute("d-id")
+ if iframe_id is None:
+ return None
+ path_parts.insert(0, iframe_id)
+ current_frame = current_frame.parent_frame
+ return "|".join(path_parts)
+
+ for frame in page.frames:
+ if frame.parent_frame is None:
+ continue
+ try:
+ iframe_element = frame.frame_element()
+ iframe_id = iframe_element.get_attribute("d-id")
+ if iframe_id is None:
+ continue
+ iframe_path = get_iframe_path(frame)
+ except Error as e:
+ continue
+ if iframe_path is None:
+ continue
+ try:
+ frame.evaluate(
+ GENERATE_DENDRITE_IDS_IFRAME_SCRIPT, {"frame_path": iframe_path}
+ )
+ frame_content = frame.content()
+ frame_tree = BeautifulSoup(frame_content, "lxml")
+ mild_strip_in_place(frame_tree)
+ merge_iframe_to_page(iframe_id, page_soup, frame_tree)
+ except Error as e:
+ continue
+
+
+def merge_iframe_to_page(iframe_id: str, page: BeautifulSoup, iframe: BeautifulSoup):
+ iframe_element = page.find("iframe", {"d-id": iframe_id})
+ if iframe_element is None:
+ logger.debug(f"Could not find iframe with ID {iframe_id} in page soup")
+ return
+ iframe_element.replace_with(iframe)
+
+
+def _get_all_elements_from_selector_soup(
+ selector: str, soup: BeautifulSoup, page: "Page"
+) -> List[Element]:
+ dendrite_elements: List[Element] = []
+ elements = soup.select(selector)
+ for element in elements:
+ frame = page._get_context(element)
+ d_id = element.get("d-id", "")
+ locator = frame.locator(f"xpath=//*[@d-id='{d_id}']")
+ if not d_id:
+ continue
+ if isinstance(d_id, list):
+ d_id = d_id[0]
+ dendrite_elements.append(
+ Element(d_id, locator, page.dendrite_browser, page._browser_api_client)
+ )
+ return dendrite_elements
+
+
+def get_elements_from_selectors_soup(
+ page: "Page", soup: BeautifulSoup, selectors: List[Selector], only_one: bool
+) -> Union[Optional[Element], List[Element]]:
+ for selector in reversed(selectors):
+ dendrite_elements = _get_all_elements_from_selector_soup(
+ selector.selector, soup, page
+ )
+ if len(dendrite_elements) > 0:
+ return dendrite_elements[0] if only_one else dendrite_elements
+ return None
+
+
+def to_json_schema(type_spec: TypeSpec) -> Dict[str, Any]:
+ if isinstance(type_spec, dict):
+ return type_spec
+ if inspect.isclass(type_spec) and issubclass(type_spec, BaseModel):
+ return type_spec.model_json_schema()
+ if type_spec in (bool, int, float, str):
+ type_map = {bool: "boolean", int: "integer", float: "number", str: "string"}
+ return {"type": type_map[type_spec]}
+ raise ValueError(f"Unsupported type specification: {type_spec}")
+
+
+def convert_to_type_spec(type_spec: TypeSpec, return_data: Any) -> TypeSpec:
+ if isinstance(type_spec, type):
+ if issubclass(type_spec, BaseModel):
+ return type_spec.model_validate(return_data)
+ if type_spec in (str, float, bool, int):
+ return type_spec(return_data)
+ raise ValueError(f"Unsupported type: {type_spec}")
+ if isinstance(type_spec, dict):
+ return return_data
+ raise ValueError(f"Unsupported type specification: {type_spec}")
diff --git a/dendrite/sync_api/_ext_impl/__init__.py b/dendrite/browser/sync_api/browser_impl/__init__.py
similarity index 100%
rename from dendrite/sync_api/_ext_impl/__init__.py
rename to dendrite/browser/sync_api/browser_impl/__init__.py
diff --git a/dendrite/sync_api/_ext_impl/browserbase/__init__.py b/dendrite/browser/sync_api/browser_impl/browserbase/__init__.py
similarity index 100%
rename from dendrite/sync_api/_ext_impl/browserbase/__init__.py
rename to dendrite/browser/sync_api/browser_impl/browserbase/__init__.py
diff --git a/dendrite/sync_api/_ext_impl/browserbase/_client.py b/dendrite/browser/sync_api/browser_impl/browserbase/_client.py
similarity index 96%
rename from dendrite/sync_api/_ext_impl/browserbase/_client.py
rename to dendrite/browser/sync_api/browser_impl/browserbase/_client.py
index 5d862e2..ddc6831 100644
--- a/dendrite/sync_api/_ext_impl/browserbase/_client.py
+++ b/dendrite/browser/sync_api/browser_impl/browserbase/_client.py
@@ -1,10 +1,10 @@
import time
-from pathlib import Path
import time
+from pathlib import Path
from typing import Optional, Union
import httpx
from loguru import logger
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
class BrowserbaseClient:
diff --git a/dendrite/sync_api/_ext_impl/browserbase/_download.py b/dendrite/browser/sync_api/browser_impl/browserbase/_download.py
similarity index 91%
rename from dendrite/sync_api/_ext_impl/browserbase/_download.py
rename to dendrite/browser/sync_api/browser_impl/browserbase/_download.py
index e669ba1..c464c81 100644
--- a/dendrite/sync_api/_ext_impl/browserbase/_download.py
+++ b/dendrite/browser/sync_api/browser_impl/browserbase/_download.py
@@ -1,12 +1,12 @@
-from pathlib import Path
import re
import shutil
-from typing import Union
import zipfile
+from pathlib import Path
+from typing import Union
from loguru import logger
from playwright.sync_api import Download
-from dendrite.sync_api._core.models.download_interface import DownloadInterface
-from dendrite.sync_api._ext_impl.browserbase._client import BrowserbaseClient
+from dendrite.browser.sync_api.browser_impl.browserbase._client import BrowserbaseClient
+from dendrite.browser.sync_api.protocol.download_protocol import DownloadInterface
class BrowserbaseDownload(DownloadInterface):
diff --git a/dendrite/sync_api/_ext_impl/browserbase/_impl.py b/dendrite/browser/sync_api/browser_impl/browserbase/_impl.py
similarity index 78%
rename from dendrite/sync_api/_ext_impl/browserbase/_impl.py
rename to dendrite/browser/sync_api/browser_impl/browserbase/_impl.py
index 453c6b6..60ceaf3 100644
--- a/dendrite/sync_api/_ext_impl/browserbase/_impl.py
+++ b/dendrite/browser/sync_api/browser_impl/browserbase/_impl.py
@@ -1,18 +1,20 @@
from typing import TYPE_CHECKING, Optional
-from dendrite._common._exceptions.dendrite_exception import BrowserNotLaunchedError
-from dendrite.sync_api._core._impl_browser import ImplBrowser
-from dendrite.sync_api._core._type_spec import PlaywrightPage
-from dendrite.remote.browserbase_config import BrowserbaseConfig
+from dendrite.browser._common._exceptions.dendrite_exception import (
+ BrowserNotLaunchedError,
+)
+from dendrite.browser.sync_api.protocol.browser_protocol import BrowserProtocol
+from dendrite.browser.sync_api.types import PlaywrightPage
+from dendrite.browser.remote.browserbase_config import BrowserbaseConfig
if TYPE_CHECKING:
- from dendrite.sync_api._core.dendrite_browser import Dendrite
-from dendrite.sync_api._ext_impl.browserbase._client import BrowserbaseClient
-from playwright.sync_api import Playwright
+ from dendrite.browser.sync_api.dendrite_browser import Dendrite
from loguru import logger
-from dendrite.sync_api._ext_impl.browserbase._download import BrowserbaseDownload
+from playwright.sync_api import Playwright
+from ._client import BrowserbaseClient
+from ._download import BrowserbaseDownload
-class BrowserBaseImpl(ImplBrowser):
+class BrowserbaseImpl(BrowserProtocol):
def __init__(self, settings: BrowserbaseConfig) -> None:
self.settings = settings
diff --git a/dendrite/async_api/_core/__init__.py b/dendrite/browser/sync_api/browser_impl/browserless/__init__.py
similarity index 100%
rename from dendrite/async_api/_core/__init__.py
rename to dendrite/browser/sync_api/browser_impl/browserless/__init__.py
diff --git a/dendrite/sync_api/_ext_impl/browserless/_impl.py b/dendrite/browser/sync_api/browser_impl/browserless/_impl.py
similarity index 70%
rename from dendrite/sync_api/_ext_impl/browserless/_impl.py
rename to dendrite/browser/sync_api/browser_impl/browserless/_impl.py
index 5d888e6..822ed48 100644
--- a/dendrite/sync_api/_ext_impl/browserless/_impl.py
+++ b/dendrite/browser/sync_api/browser_impl/browserless/_impl.py
@@ -1,20 +1,24 @@
import json
from typing import TYPE_CHECKING, Optional
-from dendrite._common._exceptions.dendrite_exception import BrowserNotLaunchedError
-from dendrite.sync_api._core._impl_browser import ImplBrowser
-from dendrite.sync_api._core._type_spec import PlaywrightPage
-from dendrite.remote.browserless_config import BrowserlessConfig
+from dendrite.browser._common._exceptions.dendrite_exception import (
+ BrowserNotLaunchedError,
+)
+from dendrite.browser.sync_api.protocol.browser_protocol import BrowserProtocol
+from dendrite.browser.sync_api.types import PlaywrightPage
+from dendrite.browser.remote.browserless_config import BrowserlessConfig
if TYPE_CHECKING:
- from dendrite.sync_api._core.dendrite_browser import Dendrite
-from dendrite.sync_api._ext_impl.browserbase._client import BrowserbaseClient
-from playwright.sync_api import Playwright
-from loguru import logger
+ from dendrite.browser.sync_api.dendrite_browser import Dendrite
import urllib.parse
-from dendrite.sync_api._ext_impl.browserbase._download import BrowserbaseDownload
+from loguru import logger
+from playwright.sync_api import Playwright
+from dendrite.browser.sync_api.browser_impl.browserbase._client import BrowserbaseClient
+from dendrite.browser.sync_api.browser_impl.browserbase._download import (
+ BrowserbaseDownload,
+)
-class BrowserlessImpl(ImplBrowser):
+class BrowserlessImpl(BrowserProtocol):
def __init__(self, settings: BrowserlessConfig) -> None:
self.settings = settings
diff --git a/dendrite/browser/sync_api/browser_impl/impl_mapping.py b/dendrite/browser/sync_api/browser_impl/impl_mapping.py
new file mode 100644
index 0000000..d1e3d65
--- /dev/null
+++ b/dendrite/browser/sync_api/browser_impl/impl_mapping.py
@@ -0,0 +1,29 @@
+from typing import Dict, Optional, Type
+from dendrite.browser.remote import Providers
+from dendrite.browser.remote.browserbase_config import BrowserbaseConfig
+from dendrite.browser.remote.browserless_config import BrowserlessConfig
+from ..protocol.browser_protocol import BrowserProtocol
+from .browserbase._impl import BrowserbaseImpl
+from .browserless._impl import BrowserlessImpl
+from .local._impl import LocalImpl
+
+IMPL_MAPPING: Dict[Type[Providers], Type[BrowserProtocol]] = {
+ BrowserbaseConfig: BrowserbaseImpl,
+ BrowserlessConfig: BrowserlessImpl,
+}
+SETTINGS_CLASSES: Dict[str, Type[Providers]] = {
+ "browserbase": BrowserbaseConfig,
+ "browserless": BrowserlessConfig,
+}
+
+
+def get_impl(remote_provider: Optional[Providers]) -> BrowserProtocol:
+ if remote_provider is None:
+ return LocalImpl()
+ try:
+ provider_class = IMPL_MAPPING[type(remote_provider)]
+ except KeyError:
+ raise ValueError(
+ f"No implementation for {type(remote_provider)}. Available providers: {', '.join(map(lambda x: x.__name__, IMPL_MAPPING.keys()))}"
+ )
+ return provider_class(remote_provider)
diff --git a/dendrite/browser/sync_api/browser_impl/local/_impl.py b/dendrite/browser/sync_api/browser_impl/local/_impl.py
new file mode 100644
index 0000000..e995cc1
--- /dev/null
+++ b/dendrite/browser/sync_api/browser_impl/local/_impl.py
@@ -0,0 +1,45 @@
+from pathlib import Path
+from typing import TYPE_CHECKING, Optional, Union, overload
+from loguru import logger
+from typing_extensions import Literal
+from dendrite.browser._common.constants import STEALTH_ARGS
+
+if TYPE_CHECKING:
+ from dendrite.browser.sync_api.dendrite_browser import Dendrite
+import os
+import shutil
+import tempfile
+from playwright.sync_api import (
+ Browser,
+ BrowserContext,
+ Download,
+ Playwright,
+ StorageState,
+)
+from dendrite.browser.sync_api.protocol.browser_protocol import BrowserProtocol
+from dendrite.browser.sync_api.types import PlaywrightPage
+
+
+class LocalImpl(BrowserProtocol):
+
+ def __init__(self) -> None:
+ pass
+
+ def start_browser(
+ self,
+ playwright: Playwright,
+ pw_options: dict,
+ storage_state: Optional[StorageState] = None,
+ ) -> Browser:
+ return playwright.chromium.launch(**pw_options)
+
+ def get_download(
+ self, dendrite_browser: "Dendrite", pw_page: PlaywrightPage, timeout: float
+ ) -> Download:
+ return dendrite_browser._download_handler.get_data(pw_page, timeout)
+
+ def configure_context(self, browser: "Dendrite"):
+ pass
+
+ def stop_session(self):
+ pass
diff --git a/dendrite/sync_api/_core/dendrite_browser.py b/dendrite/browser/sync_api/dendrite_browser.py
similarity index 69%
rename from dendrite/sync_api/_core/dendrite_browser.py
rename to dendrite/browser/sync_api/dendrite_browser.py
index 259841e..6747a19 100644
--- a/dendrite/sync_api/_core/dendrite_browser.py
+++ b/dendrite/browser/sync_api/dendrite_browser.py
@@ -1,46 +1,45 @@
-from abc import ABC, abstractmethod
+import os
import pathlib
import re
-from typing import Any, List, Literal, Optional, Sequence, Union
+from abc import ABC
+from typing import Any, List, Optional, Sequence, Union
from uuid import uuid4
-import os
from loguru import logger
from playwright.sync_api import (
- sync_playwright,
- Playwright,
- BrowserContext,
- FileChooser,
Download,
Error,
+ FileChooser,
FilePayload,
+ StorageState,
+ sync_playwright,
)
-from dendrite.sync_api._api.dto.authenticate_dto import AuthenticateDTO
-from dendrite.sync_api._api.dto.upload_auth_session_dto import UploadAuthSessionDTO
-from dendrite.sync_api._common.event_sync import EventSync
-from dendrite.sync_api._core._impl_browser import ImplBrowser
-from dendrite.sync_api._core._impl_mapping import get_impl
-from dendrite.sync_api._core._managers.page_manager import PageManager
-from dendrite.sync_api._core._type_spec import PlaywrightPage
-from dendrite.sync_api._core.dendrite_page import Page
-from dendrite.sync_api._common.constants import STEALTH_ARGS
-from dendrite.sync_api._core.mixin.ask import AskMixin
-from dendrite.sync_api._core.mixin.click import ClickMixin
-from dendrite.sync_api._core.mixin.extract import ExtractionMixin
-from dendrite.sync_api._core.mixin.fill_fields import FillFieldsMixin
-from dendrite.sync_api._core.mixin.get_element import GetElementMixin
-from dendrite.sync_api._core.mixin.keyboard import KeyboardMixin
-from dendrite.sync_api._core.mixin.screenshot import ScreenshotMixin
-from dendrite.sync_api._core.mixin.wait_for import WaitForMixin
-from dendrite.sync_api._core.mixin.markdown import MarkdownMixin
-from dendrite.sync_api._core.models.authentication import AuthSession
-from dendrite.sync_api._core.models.api_config import APIConfig
-from dendrite.sync_api._api.browser_api_client import BrowserAPIClient
-from dendrite._common._exceptions.dendrite_exception import (
+from dendrite.browser._common._exceptions.dendrite_exception import (
BrowserNotLaunchedError,
DendriteException,
IncorrectOutcomeError,
)
-from dendrite.remote import Providers
+from dendrite.browser._common.constants import STEALTH_ARGS
+from dendrite.browser.sync_api._utils import get_domain_w_suffix
+from dendrite.browser.remote import Providers
+from dendrite.logic.config import Config
+from dendrite.logic import LogicEngine
+from ._event_sync import EventSync
+from .browser_impl.impl_mapping import get_impl
+from .dendrite_page import Page
+from .manager.page_manager import PageManager
+from .mixin import (
+ AskMixin,
+ ClickMixin,
+ ExtractionMixin,
+ FillFieldsMixin,
+ GetElementMixin,
+ KeyboardMixin,
+ MarkdownMixin,
+ ScreenshotMixin,
+ WaitForMixin,
+)
+from .protocol.browser_protocol import BrowserProtocol
+from .types import PlaywrightPage
class Dendrite(
@@ -80,44 +79,32 @@ class Dendrite(
def __init__(
self,
- auth: Optional[Union[str, List[str]]] = None,
- dendrite_api_key: Optional[str] = None,
- openai_api_key: Optional[str] = None,
- anthropic_api_key: Optional[str] = None,
playwright_options: Any = {"headless": False, "args": STEALTH_ARGS},
remote_config: Optional[Providers] = None,
+ config: Optional[Config] = None,
+ auth: Optional[Union[List[str], str]] = None,
):
"""
- Initializes Dendrite with API keys and Playwright options.
+ Initialize Dendrite with optional domain authentication.
Args:
- auth (Optional[Union[str, List[str]]]): The domains on which the browser should try and authenticate.
- dendrite_api_key (Optional[str]): The Dendrite API key. If not provided, it's fetched from the environment variables.
- openai_api_key (Optional[str]): Your own OpenAI API key, provide it, along with other custom API keys, if you wish to use Dendrite without paying for a license.
- anthropic_api_key (Optional[str]): The own Anthropic API key, provide it, along with other custom API keys, if you wish to use Dendrite without paying for a license.
- playwright_options (Any): Options for configuring Playwright. Defaults to running in non-headless mode with stealth arguments.
-
- Raises:
- MissingApiKeyError: If the Dendrite API key is not provided or found in the environment variables.
+ playwright_options: Options for configuring Playwright
+ remote_config: Remote browser provider configuration
+ config: Configuration object
+ auth: List of domains or single domain to load authentication state for
"""
- api_config = APIConfig(
- dendrite_api_key=dendrite_api_key or os.environ.get("DENDRITE_API_KEY"),
- openai_api_key=openai_api_key,
- anthropic_api_key=anthropic_api_key,
- )
self._impl = self._get_impl(remote_config)
- self.api_config = api_config
- self.playwright: Optional[Playwright] = None
- self.browser_context: Optional[BrowserContext] = None
- self._id = uuid4().hex
self._playwright_options = playwright_options
+ self._config = config or Config()
+ auth_url = [auth] if isinstance(auth, str) else auth or []
+ self._auth_domains = [get_domain_w_suffix(url) for url in auth_url]
+ self._id = uuid4().hex
self._active_page_manager: Optional[PageManager] = None
self._user_id: Optional[str] = None
self._upload_handler = EventSync(event_type=FileChooser)
self._download_handler = EventSync(event_type=Download)
self.closed = False
- self._auth = auth
- self._browser_api_client = BrowserAPIClient(api_config, self._id)
+ self._browser_api_client: LogicEngine = LogicEngine(self._config)
@property
def pages(self) -> List[Page]:
@@ -136,10 +123,12 @@ def _get_page(self) -> Page:
active_page = self.get_active_page()
return active_page
- def _get_browser_api_client(self) -> BrowserAPIClient:
+ @property
+ def logic_engine(self) -> LogicEngine:
return self._browser_api_client
- def _get_dendrite_browser(self) -> "Dendrite":
+ @property
+ def dendrite_browser(self) -> "Dendrite":
return self
def __enter__(self):
@@ -148,14 +137,9 @@ def __enter__(self):
def __exit__(self, exc_type, exc_val, exc_tb):
self.close()
- def _get_impl(self, remote_provider: Optional[Providers]) -> ImplBrowser:
+ def _get_impl(self, remote_provider: Optional[Providers]) -> BrowserProtocol:
return get_impl(remote_provider)
- def _get_auth_session(self, domains: Union[str, list[str]]):
- dto = AuthenticateDTO(domains=domains)
- auth_session: AuthSession = self._browser_api_client.authenticate(dto)
- return auth_session
-
def get_active_page(self) -> Page:
"""
Retrieves the currently active page managed by the PageManager.
@@ -268,13 +252,15 @@ def _launch(self):
"""
os.environ["PW_TEST_SCREENSHOT_NO_FONTS_READY"] = "1"
self._playwright = sync_playwright().start()
+ storage_states = []
+ for domain in self._auth_domains:
+ state = self._get_domain_storage_state(domain)
+ if state:
+ storage_states.append(state)
browser = self._impl.start_browser(self._playwright, self._playwright_options)
- if self._auth:
- auth_session = self._get_auth_session(self._auth)
- self.browser_context = browser.new_context(
- storage_state=auth_session.to_storage_state(),
- user_agent=auth_session.user_agent,
- )
+ if storage_states:
+ merged_state = self._merge_storage_states(storage_states)
+ self.browser_context = browser.new_context(storage_state=merged_state)
else:
self.browser_context = (
browser.contexts[0]
@@ -301,27 +287,22 @@ def add_cookies(self, cookies):
def close(self):
"""
- Closes the browser and uploads authentication session data if available.
+ Closes the browser and updates storage states for authenticated domains before cleanup.
- This method stops the Playwright instance, closes the browser context, and uploads any
- stored authentication session data if applicable.
+ This method updates the storage states for authenticated domains, stops the Playwright
+ instance, and closes the browser context.
Returns:
None
Raises:
- Exception: If there is an issue closing the browser or uploading session data.
+ Exception: If there is an issue closing the browser or updating session data.
"""
self.closed = True
try:
- if self.browser_context:
- if self._auth:
- auth_session = self._get_auth_session(self._auth)
- storage_state = self.browser_context.storage_state()
- dto = UploadAuthSessionDTO(
- auth_data=auth_session, storage_state=storage_state
- )
- self._browser_api_client.upload_auth_session(dto)
+ if self.browser_context and self._auth_domains:
+ for domain in self._auth_domains:
+ self.save_auth(domain)
self._impl.stop_session()
self.browser_context.close()
except Error:
@@ -329,9 +310,7 @@ def close(self):
try:
if self._playwright:
self._playwright.stop()
- except AttributeError:
- pass
- except Exception:
+ except (AttributeError, Exception):
pass
def _is_launched(self):
@@ -426,3 +405,81 @@ def _get_filechooser(
Exception: If there is an issue uploading files.
"""
return self._upload_handler.get_data(pw_page, timeout=timeout)
+
+ def save_auth(self, url: str) -> None:
+ """
+ Save authentication state for a specific domain.
+
+ Args:
+ domain (str): Domain to save authentication for (e.g., "github.com")
+ """
+ if not self.browser_context:
+ raise DendriteException("Browser context not initialized")
+ domain = get_domain_w_suffix(url)
+ storage_state = self.browser_context.storage_state()
+ filtered_state = {
+ "origins": [
+ origin
+ for origin in storage_state.get("origins", [])
+ if domain in origin.get("origin", "")
+ ],
+ "cookies": [
+ cookie
+ for cookie in storage_state.get("cookies", [])
+ if domain in cookie.get("domain", "")
+ ],
+ }
+ self._config.storage_cache.set(
+ {"domain": domain}, StorageState(**filtered_state)
+ )
+
+ def setup_auth(
+ self,
+ url: str,
+ message: str = "Please log in to the website. Once done, press Enter to continue...",
+ ) -> None:
+ """
+ Set up authentication for a specific URL.
+
+ Args:
+ url (str): URL to navigate to for login
+ message (str): Message to show while waiting for user input
+ """
+ domain = get_domain_w_suffix(url)
+ try:
+ self._playwright = sync_playwright().start()
+ browser = self._impl.start_browser(
+ self._playwright, {**self._playwright_options, "headless": False}
+ )
+ self.browser_context = browser.new_context()
+ self._active_page_manager = PageManager(self, self.browser_context)
+ self.goto(url)
+ print(message)
+ input()
+ self.save_auth(domain)
+ finally:
+ self.close()
+
+ def _get_domain_storage_state(self, domain: str) -> Optional[StorageState]:
+ """Get storage state for a specific domain"""
+ return self._config.storage_cache.get({"domain": domain}, index=0)
+
+ def _merge_storage_states(self, states: List[StorageState]) -> StorageState:
+ """Merge multiple storage states into one"""
+ merged = {"origins": [], "cookies": []}
+ seen_origins = set()
+ seen_cookies = set()
+ for state in states:
+ for origin in state.get("origins", []):
+ origin_key = origin.get("origin", "")
+ if origin_key not in seen_origins:
+ merged["origins"].append(origin)
+ seen_origins.add(origin_key)
+ for cookie in state.get("cookies", []):
+ cookie_key = (
+ f"{cookie.get('name')}:{cookie.get('domain')}:{cookie.get('path')}"
+ )
+ if cookie_key not in seen_cookies:
+ merged["cookies"].append(cookie)
+ seen_cookies.add(cookie_key)
+ return StorageState(**merged)
diff --git a/dendrite/sync_api/_core/dendrite_element.py b/dendrite/browser/sync_api/dendrite_element.py
similarity index 85%
rename from dendrite/sync_api/_core/dendrite_element.py
rename to dendrite/browser/sync_api/dendrite_element.py
index d73e788..2ef67a6 100644
--- a/dendrite/sync_api/_core/dendrite_element.py
+++ b/dendrite/browser/sync_api/dendrite_element.py
@@ -6,16 +6,17 @@
from typing import TYPE_CHECKING, Optional
from loguru import logger
from playwright.sync_api import Locator
-from dendrite.sync_api._api.browser_api_client import BrowserAPIClient
-from dendrite._common._exceptions.dendrite_exception import IncorrectOutcomeError
+from dendrite.browser._common._exceptions.dendrite_exception import (
+ IncorrectOutcomeError,
+)
+from dendrite.logic import LogicEngine
if TYPE_CHECKING:
- from dendrite.sync_api._core.dendrite_browser import Dendrite
-from dendrite.sync_api._core._managers.navigation_tracker import NavigationTracker
-from dendrite.sync_api._core.models.page_diff_information import PageDiffInformation
-from dendrite.sync_api._core._type_spec import Interaction
-from dendrite.sync_api._api.response.interaction_response import InteractionResponse
-from dendrite.sync_api._api.dto.make_interaction_dto import MakeInteractionDTO
+ from .dendrite_browser import Dendrite
+from dendrite.models.dto.make_interaction_dto import VerifyActionDTO
+from dendrite.models.response.interaction_response import InteractionResponse
+from .manager.navigation_tracker import NavigationTracker
+from .types import Interaction
def perform_action(interaction_type: Interaction):
@@ -40,29 +41,28 @@ def wrapper(self: Element, *args, **kwargs) -> InteractionResponse:
if not expected_outcome:
func(self, *args, **kwargs)
return InteractionResponse(status="success", message="")
- api_config = self._dendrite_browser.api_config
page_before = self._dendrite_browser.get_active_page()
page_before_info = page_before.get_page_information()
+ soup = page_before._get_previous_soup()
+ screenshot_before = page_before_info.screenshot_base64
+ tag_name = soup.find(attrs={"d-id": self.dendrite_id})
func(self, *args, expected_outcome=expected_outcome, **kwargs)
self._wait_for_page_changes(page_before.url)
page_after = self._dendrite_browser.get_active_page()
- page_after_info = page_after.get_page_information()
- page_delta_information = PageDiffInformation(
- page_before=page_before_info, page_after=page_after_info
- )
- dto = MakeInteractionDTO(
+ screenshot_after = page_after.screenshot_manager.take_full_page_screenshot()
+ dto = VerifyActionDTO(
url=page_before.url,
dendrite_id=self.dendrite_id,
interaction_type=interaction_type,
expected_outcome=expected_outcome,
- page_delta_information=page_delta_information,
- api_config=api_config,
+ screenshot_before=screenshot_before,
+ screenshot_after=screenshot_after,
+ tag_name=str(tag_name),
)
- res = self._browser_api_client.make_interaction(dto)
+ res = self._browser_api_client.verify_action(dto)
if res.status == "failed":
raise IncorrectOutcomeError(
- message=res.message,
- screenshot_base64=page_delta_information.page_after.screenshot_base64,
+ message=res.message, screenshot_base64=screenshot_after
)
return res
@@ -84,7 +84,7 @@ def __init__(
dendrite_id: str,
locator: Locator,
dendrite_browser: Dendrite,
- browser_api_client: BrowserAPIClient,
+ browser_api_client: LogicEngine,
):
"""
Initialize a Element.
diff --git a/dendrite/sync_api/_core/dendrite_page.py b/dendrite/browser/sync_api/dendrite_page.py
similarity index 88%
rename from dendrite/sync_api/_core/dendrite_page.py
rename to dendrite/browser/sync_api/dendrite_page.py
index b9cd048..d6f2d01 100644
--- a/dendrite/sync_api/_core/dendrite_page.py
+++ b/dendrite/browser/sync_api/dendrite_page.py
@@ -1,30 +1,30 @@
-import re
import time
import pathlib
+import re
import time
from typing import TYPE_CHECKING, Any, List, Literal, Optional, Sequence, Union
from bs4 import BeautifulSoup, Tag
from loguru import logger
-from playwright.sync_api import FrameLocator, Keyboard, Download, FilePayload
-from dendrite.sync_api._api.browser_api_client import BrowserAPIClient
-from dendrite.sync_api._core._js import GENERATE_DENDRITE_IDS_SCRIPT
-from dendrite.sync_api._core._type_spec import PlaywrightPage
-from dendrite.sync_api._core.dendrite_element import Element
-from dendrite.sync_api._core.mixin.ask import AskMixin
-from dendrite.sync_api._core.mixin.click import ClickMixin
-from dendrite.sync_api._core.mixin.extract import ExtractionMixin
-from dendrite.sync_api._core.mixin.fill_fields import FillFieldsMixin
-from dendrite.sync_api._core.mixin.get_element import GetElementMixin
-from dendrite.sync_api._core.mixin.keyboard import KeyboardMixin
-from dendrite.sync_api._core.mixin.markdown import MarkdownMixin
-from dendrite.sync_api._core.mixin.wait_for import WaitForMixin
-from dendrite.sync_api._core.models.page_information import PageInformation
+from playwright.sync_api import Download, FilePayload, FrameLocator, Keyboard
+from dendrite.logic import LogicEngine
+from dendrite.models.page_information import PageInformation
+from .dendrite_element import Element
+from .js import GENERATE_DENDRITE_IDS_SCRIPT
+from .mixin.ask import AskMixin
+from .mixin.click import ClickMixin
+from .mixin.extract import ExtractionMixin
+from .mixin.fill_fields import FillFieldsMixin
+from .mixin.get_element import GetElementMixin
+from .mixin.keyboard import KeyboardMixin
+from .mixin.markdown import MarkdownMixin
+from .mixin.wait_for import WaitForMixin
+from .types import PlaywrightPage
if TYPE_CHECKING:
- from dendrite.sync_api._core.dendrite_browser import Dendrite
-from dendrite.sync_api._core._managers.screenshot_manager import ScreenshotManager
-from dendrite._common._exceptions.dendrite_exception import DendriteException
-from dendrite.sync_api._core._utils import expand_iframes
+ from .dendrite_browser import Dendrite
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
+from ._utils import expand_iframes
+from .manager.screenshot_manager import ScreenshotManager
class Page(
@@ -48,14 +48,14 @@ def __init__(
self,
page: PlaywrightPage,
dendrite_browser: "Dendrite",
- browser_api_client: "BrowserAPIClient",
+ browser_api_client: LogicEngine,
):
self.playwright_page = page
self.screenshot_manager = ScreenshotManager(page)
- self.dendrite_browser = dendrite_browser
self._browser_api_client = browser_api_client
self._last_main_frame_url = page.url
self._last_frame_navigated_timestamp = time.time()
+ self._dendrite_browser = dendrite_browser
self.playwright_page.on("framenavigated", self._on_frame_navigated)
def _on_frame_navigated(self, frame):
@@ -63,6 +63,10 @@ def _on_frame_navigated(self, frame):
self._last_main_frame_url = frame.url
self._last_frame_navigated_timestamp = time.time()
+ @property
+ def dendrite_browser(self) -> "Dendrite":
+ return self._dendrite_browser
+
@property
def url(self):
"""
@@ -86,10 +90,8 @@ def keyboard(self) -> Keyboard:
def _get_page(self) -> "Page":
return self
- def _get_dendrite_browser(self) -> "Dendrite":
- return self.dendrite_browser
-
- def _get_browser_api_client(self) -> BrowserAPIClient:
+ @property
+ def logic_engine(self) -> LogicEngine:
return self._browser_api_client
def goto(
@@ -236,7 +238,7 @@ def _generate_dendrite_ids(self):
return
except Exception as e:
self.playwright_page.wait_for_load_state(state="load", timeout=3000)
- logger.debug(
+ logger.exception(
f"Failed to generate dendrite IDs: {e}, attempt {tries + 1}/3"
)
tries += 1
diff --git a/dendrite/sync_api/_core/_js/__init__.py b/dendrite/browser/sync_api/js/__init__.py
similarity index 100%
rename from dendrite/sync_api/_core/_js/__init__.py
rename to dendrite/browser/sync_api/js/__init__.py
diff --git a/dendrite/sync_api/_core/_js/eventListenerPatch.js b/dendrite/browser/sync_api/js/eventListenerPatch.js
similarity index 100%
rename from dendrite/sync_api/_core/_js/eventListenerPatch.js
rename to dendrite/browser/sync_api/js/eventListenerPatch.js
diff --git a/dendrite/async_api/_core/_js/generateDendriteIDs.js b/dendrite/browser/sync_api/js/generateDendriteIDs.js
similarity index 97%
rename from dendrite/async_api/_core/_js/generateDendriteIDs.js
rename to dendrite/browser/sync_api/js/generateDendriteIDs.js
index 1d4b348..d03b8cd 100644
--- a/dendrite/async_api/_core/_js/generateDendriteIDs.js
+++ b/dendrite/browser/sync_api/js/generateDendriteIDs.js
@@ -9,6 +9,7 @@ var hashCode = (str) => {
return hash;
}
+
const getElementIndex = (element) => {
let index = 1;
let sibling = element.previousElementSibling;
@@ -42,7 +43,8 @@ const usedHashes = new Map();
var markHidden = (hidden_element) => {
// Mark the hidden element itself
- hidden
+ hidden_element.setAttribute('data-hidden', 'true');
+
}
document.querySelectorAll('*').forEach((element, index) => {
diff --git a/dendrite/sync_api/_core/_js/generateDendriteIDsIframe.js b/dendrite/browser/sync_api/js/generateDendriteIDsIframe.js
similarity index 100%
rename from dendrite/sync_api/_core/_js/generateDendriteIDsIframe.js
rename to dendrite/browser/sync_api/js/generateDendriteIDsIframe.js
diff --git a/dendrite/async_api/_core/_managers/__init__.py b/dendrite/browser/sync_api/manager/__init__.py
similarity index 100%
rename from dendrite/async_api/_core/_managers/__init__.py
rename to dendrite/browser/sync_api/manager/__init__.py
diff --git a/dendrite/sync_api/_core/_managers/navigation_tracker.py b/dendrite/browser/sync_api/manager/navigation_tracker.py
similarity index 97%
rename from dendrite/sync_api/_core/_managers/navigation_tracker.py
rename to dendrite/browser/sync_api/manager/navigation_tracker.py
index 8735d05..d789796 100644
--- a/dendrite/sync_api/_core/_managers/navigation_tracker.py
+++ b/dendrite/browser/sync_api/manager/navigation_tracker.py
@@ -3,7 +3,7 @@
from typing import TYPE_CHECKING, Dict, Optional
if TYPE_CHECKING:
- from dendrite.sync_api._core.dendrite_page import Page
+ from ..dendrite_page import Page
class NavigationTracker:
diff --git a/dendrite/sync_api/_core/_managers/page_manager.py b/dendrite/browser/sync_api/manager/page_manager.py
similarity index 81%
rename from dendrite/sync_api/_core/_managers/page_manager.py
rename to dendrite/browser/sync_api/manager/page_manager.py
index b8e77d8..52b5782 100644
--- a/dendrite/sync_api/_core/_managers/page_manager.py
+++ b/dendrite/browser/sync_api/manager/page_manager.py
@@ -1,11 +1,11 @@
-from typing import Optional, TYPE_CHECKING
+from typing import TYPE_CHECKING, Optional
from loguru import logger
from playwright.sync_api import BrowserContext, Download, FileChooser
if TYPE_CHECKING:
- from dendrite.sync_api._core.dendrite_browser import Dendrite
-from dendrite.sync_api._core._type_spec import PlaywrightPage
-from dendrite.sync_api._core.dendrite_page import Page
+ from ..dendrite_browser import Dendrite
+from ..dendrite_page import Page
+from ..types import PlaywrightPage
class PageManager:
@@ -15,13 +15,21 @@ def __init__(self, dendrite_browser, browser_context: BrowserContext):
self.active_page: Optional[Page] = None
self.browser_context = browser_context
self.dendrite_browser: Dendrite = dendrite_browser
+ existing_pages = browser_context.pages
+ if existing_pages:
+ for page in existing_pages:
+ client = self.dendrite_browser.logic_engine
+ dendrite_page = Page(page, self.dendrite_browser, client)
+ self.pages.append(dendrite_page)
+ if self.active_page is None:
+ self.active_page = dendrite_page
browser_context.on("page", self._page_on_open_handler)
def new_page(self) -> Page:
new_page = self.browser_context.new_page()
if self.active_page and new_page == self.active_page.playwright_page:
return self.active_page
- client = self.dendrite_browser._get_browser_api_client()
+ client = self.dendrite_browser.logic_engine
dendrite_page = Page(new_page, self.dendrite_browser, client)
self.pages.append(dendrite_page)
self.active_page = dendrite_page
@@ -68,7 +76,7 @@ def _page_on_open_handler(self, page: PlaywrightPage):
page.on("crash", self._page_on_crash_handler)
page.on("download", self._page_on_download_handler)
page.on("filechooser", self._page_on_filechooser_handler)
- client = self.dendrite_browser._get_browser_api_client()
+ client = self.dendrite_browser.logic_engine
dendrite_page = Page(page, self.dendrite_browser, client)
self.pages.append(dendrite_page)
self.active_page = dendrite_page
diff --git a/dendrite/sync_api/_core/_managers/screenshot_manager.py b/dendrite/browser/sync_api/manager/screenshot_manager.py
similarity index 96%
rename from dendrite/sync_api/_core/_managers/screenshot_manager.py
rename to dendrite/browser/sync_api/manager/screenshot_manager.py
index a6f36b1..7f4fd33 100644
--- a/dendrite/sync_api/_core/_managers/screenshot_manager.py
+++ b/dendrite/browser/sync_api/manager/screenshot_manager.py
@@ -1,7 +1,7 @@
import base64
import os
from uuid import uuid4
-from dendrite.sync_api._core._type_spec import PlaywrightPage
+from ..types import PlaywrightPage
class ScreenshotManager:
diff --git a/dendrite/browser/sync_api/mixin/__init__.py b/dendrite/browser/sync_api/mixin/__init__.py
new file mode 100644
index 0000000..046a61c
--- /dev/null
+++ b/dendrite/browser/sync_api/mixin/__init__.py
@@ -0,0 +1,21 @@
+from .ask import AskMixin
+from .click import ClickMixin
+from .extract import ExtractionMixin
+from .fill_fields import FillFieldsMixin
+from .get_element import GetElementMixin
+from .keyboard import KeyboardMixin
+from .markdown import MarkdownMixin
+from .screenshot import ScreenshotMixin
+from .wait_for import WaitForMixin
+
+__all__ = [
+ "AskMixin",
+ "ClickMixin",
+ "ExtractionMixin",
+ "FillFieldsMixin",
+ "GetElementMixin",
+ "KeyboardMixin",
+ "MarkdownMixin",
+ "ScreenshotMixin",
+ "WaitForMixin",
+]
diff --git a/dendrite/sync_api/_core/mixin/ask.py b/dendrite/browser/sync_api/mixin/ask.py
similarity index 93%
rename from dendrite/sync_api/_core/mixin/ask.py
rename to dendrite/browser/sync_api/mixin/ask.py
index ca028f8..57f4a56 100644
--- a/dendrite/sync_api/_core/mixin/ask.py
+++ b/dendrite/browser/sync_api/mixin/ask.py
@@ -2,16 +2,11 @@
import time
from typing import Optional, Type, overload
from loguru import logger
-from dendrite.sync_api._api.dto.ask_page_dto import AskPageDTO
-from dendrite.sync_api._core._type_spec import (
- JsonSchema,
- PydanticModel,
- TypeSpec,
- convert_to_type_spec,
- to_json_schema,
-)
-from dendrite.sync_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
+from dendrite.browser.sync_api._utils import convert_to_type_spec, to_json_schema
+from dendrite.models.dto.ask_page_dto import AskPageDTO
+from ..protocol.page_protocol import DendritePageProtocol
+from ..types import JsonSchema, PydanticModel, TypeSpec
TIMEOUT_INTERVAL = [150, 450, 1000]
@@ -129,7 +124,6 @@ def ask(
Raises:
DendriteException: If the request fails, the exception includes the failure message and a screenshot.
"""
- api_config = self._get_dendrite_browser().api_config
start_time = time.time()
attempt_start = start_time
attempt = -1
@@ -165,12 +159,11 @@ def ask(
entire_prompt = prompt + time_prompt
dto = AskPageDTO(
page_information=page_information,
- api_config=api_config,
prompt=entire_prompt,
return_schema=schema,
)
try:
- res = self._get_browser_api_client().ask_page(dto)
+ res = self.logic_engine.ask_page(dto)
logger.debug(f"Got response in {time.time() - attempt_start} seconds")
if res.status == "error":
logger.warning(
diff --git a/dendrite/sync_api/_core/mixin/click.py b/dendrite/browser/sync_api/mixin/click.py
similarity index 84%
rename from dendrite/sync_api/_core/mixin/click.py
rename to dendrite/browser/sync_api/mixin/click.py
index 097eccb..2f8461b 100644
--- a/dendrite/sync_api/_core/mixin/click.py
+++ b/dendrite/browser/sync_api/mixin/click.py
@@ -1,9 +1,8 @@
-import time
-from typing import Any, Optional
-from dendrite.sync_api._api.response.interaction_response import InteractionResponse
-from dendrite.sync_api._core.mixin.get_element import GetElementMixin
-from dendrite.sync_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from typing import Optional
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
+from dendrite.models.response.interaction_response import InteractionResponse
+from ..mixin.get_element import GetElementMixin
+from ..protocol.page_protocol import DendritePageProtocol
class ClickMixin(GetElementMixin, DendritePageProtocol):
diff --git a/dendrite/browser/sync_api/mixin/extract.py b/dendrite/browser/sync_api/mixin/extract.py
new file mode 100644
index 0000000..e5bf411
--- /dev/null
+++ b/dendrite/browser/sync_api/mixin/extract.py
@@ -0,0 +1,279 @@
+import time
+import time
+from typing import Any, Callable, List, Optional, Type, overload
+from loguru import logger
+from dendrite.browser.sync_api._utils import convert_to_type_spec, to_json_schema
+from dendrite.logic.code.code_session import execute
+from dendrite.models.dto.cached_extract_dto import CachedExtractDTO
+from dendrite.models.dto.extract_dto import ExtractDTO
+from dendrite.models.response.extract_response import ExtractResponse
+from dendrite.models.scripts import Script
+from ..manager.navigation_tracker import NavigationTracker
+from ..protocol.page_protocol import DendritePageProtocol
+from ..types import JsonSchema, PydanticModel, TypeSpec
+
+CACHE_TIMEOUT = 5
+
+
+class ExtractionMixin(DendritePageProtocol):
+ """
+ Mixin that provides extraction functionality for web pages.
+
+ This mixin provides various `extract` methods that allow extracting
+ different types of data (e.g., bool, int, float, string, Pydantic models, etc.)
+ from a web page based on a given prompt.
+ """
+
+ @overload
+ def extract(
+ self,
+ prompt: str,
+ type_spec: Type[bool],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> bool: ...
+
+ @overload
+ def extract(
+ self,
+ prompt: str,
+ type_spec: Type[int],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> int: ...
+
+ @overload
+ def extract(
+ self,
+ prompt: str,
+ type_spec: Type[float],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> float: ...
+
+ @overload
+ def extract(
+ self,
+ prompt: str,
+ type_spec: Type[str],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> str: ...
+
+ @overload
+ def extract(
+ self,
+ prompt: Optional[str],
+ type_spec: Type[PydanticModel],
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> PydanticModel: ...
+
+ @overload
+ def extract(
+ self,
+ prompt: Optional[str],
+ type_spec: JsonSchema,
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> JsonSchema: ...
+
+ @overload
+ def extract(
+ self,
+ prompt: str,
+ type_spec: None = None,
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> Any: ...
+
+ def extract(
+ self,
+ prompt: Optional[str],
+ type_spec: Optional[TypeSpec] = None,
+ use_cache: bool = True,
+ timeout: int = 180,
+ ) -> TypeSpec:
+ """
+ Extract data from a web page based on a prompt and optional type specification.
+ Args:
+ prompt (Optional[str]): The prompt to describe the information to extract.
+ type_spec (Optional[TypeSpec], optional): The type specification for the extracted data.
+ use_cache (bool, optional): Whether to use cached results. Defaults to True.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached scripts before falling back to the
+ extraction agent for the remaining time that will attempt to generate a new script. Defaults to 15000 (15 seconds).
+
+ Returns:
+ ExtractResponse: The extracted data wrapped in a ExtractResponse object.
+ Raises:
+ TimeoutError: If the extraction process exceeds the specified timeout.
+ """
+ logger.info(f"Starting extraction with prompt: {prompt}")
+ json_schema = None
+ if type_spec:
+ json_schema = to_json_schema(type_spec)
+ logger.debug(f"Type specification converted to JSON schema: {json_schema}")
+ if prompt is None:
+ prompt = ""
+ start_time = time.time()
+ page = self._get_page()
+ navigation_tracker = NavigationTracker(page)
+ navigation_tracker.start_nav_tracking()
+ if use_cache:
+ logger.info("Testing cache")
+ cached_result = self._try_cached_extraction(prompt, json_schema)
+ if cached_result:
+ return convert_and_return_result(cached_result, type_spec)
+ logger.info(
+ "Using extraction agent to perform extraction, since no cache was found or failed."
+ )
+ result = self._extract_with_agent(
+ prompt, json_schema, timeout - (time.time() - start_time)
+ )
+ if result:
+ return convert_and_return_result(result, type_spec)
+ logger.error(f"Extraction failed after {time.time() - start_time:.2f} seconds")
+ return None
+
+ def _try_cached_extraction(
+ self, prompt: str, json_schema: Optional[JsonSchema]
+ ) -> Optional[ExtractResponse]:
+ """
+ Attempts to extract data using cached scripts with exponential backoff.
+ Only tries up to 5 most recent scripts.
+
+ Args:
+ prompt: The prompt describing what to extract
+ json_schema: Optional JSON schema for type validation
+
+ Returns:
+ ExtractResponse if successful, None otherwise
+ """
+ page = self._get_page()
+ dto = CachedExtractDTO(url=page.url, prompt=prompt)
+ scripts = self.logic_engine.get_cached_scripts(dto)
+ logger.debug(f"Found {len(scripts)} scripts in cache, {scripts}")
+ if len(scripts) == 0:
+ logger.debug(
+ f"No scripts found in cache for prompt: {prompt} in domain: {page.url}"
+ )
+ return None
+
+ def try_cached_extract():
+ page = self._get_page()
+ soup = page._get_soup()
+ recent_scripts = scripts[-min(5, len(scripts)) :]
+ for script in recent_scripts:
+ res = test_script(script, str(soup), json_schema)
+ if res is not None:
+ return ExtractResponse(
+ status="success",
+ message="Re-used a preexisting script from cache with the same specifications.",
+ return_data=res,
+ created_script=script.script,
+ )
+ return None
+
+ return _attempt_with_backoff_helper(
+ "cached_extraction", try_cached_extract, CACHE_TIMEOUT
+ )
+
+ def _extract_with_agent(
+ self, prompt: str, json_schema: Optional[JsonSchema], remaining_timeout: float
+ ) -> Optional[ExtractResponse]:
+ """
+ Attempts to extract data using the extraction agent with exponential backoff.
+
+ Args:
+ prompt: The prompt describing what to extract
+ json_schema: Optional JSON schema for type validation
+ remaining_timeout: Maximum time to spend on extraction
+
+ Returns:
+ ExtractResponse if successful, None otherwise
+ """
+
+ def try_extract_with_agent():
+ page = self._get_page()
+ page_information = page.get_page_information(include_screenshot=True)
+ extract_dto = ExtractDTO(
+ page_information=page_information,
+ prompt=prompt,
+ return_data_json_schema=json_schema,
+ use_screenshot=True,
+ )
+ res: ExtractResponse = self.logic_engine.extract(extract_dto)
+ if res.status == "impossible":
+ logger.error(f"Impossible to extract data. Reason: {res.message}")
+ return None
+ if res.status == "success":
+ logger.success(f"Extraction successful: '{res.message}'")
+ return res
+ return None
+
+ return _attempt_with_backoff_helper(
+ "extraction_agent", try_extract_with_agent, remaining_timeout
+ )
+
+
+def _attempt_with_backoff_helper(
+ operation_name: str,
+ operation: Callable,
+ timeout: float,
+ backoff_intervals: List[float] = [0.15, 0.45, 1.0, 2.0, 4.0, 8.0],
+) -> Optional[Any]:
+ """
+ Generic helper function that implements exponential backoff for operations.
+
+ Args:
+ operation_name: Name of the operation for logging
+ operation: Async function to execute
+ timeout: Maximum time to spend attempting the operation
+ backoff_intervals: List of timeouts between attempts
+
+ Returns:
+ The result of the operation if successful, None otherwise
+ """
+ total_elapsed_time = 0
+ start_time = time.time()
+ for i, current_timeout in enumerate(backoff_intervals):
+ if total_elapsed_time >= timeout:
+ logger.error(f"Timeout reached after {total_elapsed_time:.2f} seconds")
+ return None
+ request_start_time = time.time()
+ result = operation()
+ request_duration = time.time() - request_start_time
+ if result:
+ return result
+ sleep_duration = max(0, current_timeout - request_duration)
+ logger.info(
+ f"{operation_name} attempt {i + 1} failed. Sleeping for {sleep_duration:.2f} seconds"
+ )
+ time.sleep(sleep_duration)
+ total_elapsed_time = time.time() - start_time
+ logger.error(
+ f"All {operation_name} attempts failed after {total_elapsed_time:.2f} seconds"
+ )
+ return None
+
+
+def convert_and_return_result(
+ res: ExtractResponse, type_spec: Optional[TypeSpec]
+) -> TypeSpec:
+ converted_res = res.return_data
+ if type_spec is not None:
+ logger.debug("Converting extraction result to specified type")
+ converted_res = convert_to_type_spec(type_spec, res.return_data)
+ logger.info("Extraction process completed successfully")
+ return converted_res
+
+
+def test_script(
+ script: Script, raw_html: str, return_data_json_schema: Any
+) -> Optional[Any]:
+ try:
+ res = execute(script.script, raw_html, return_data_json_schema)
+ return res
+ except Exception as e:
+ logger.debug(f"Script failed with error: {str(e)} ")
diff --git a/dendrite/sync_api/_core/mixin/fill_fields.py b/dendrite/browser/sync_api/mixin/fill_fields.py
similarity index 90%
rename from dendrite/sync_api/_core/mixin/fill_fields.py
rename to dendrite/browser/sync_api/mixin/fill_fields.py
index 792ab24..4a4880f 100644
--- a/dendrite/sync_api/_core/mixin/fill_fields.py
+++ b/dendrite/browser/sync_api/mixin/fill_fields.py
@@ -1,9 +1,9 @@
import time
from typing import Any, Dict, Optional
-from dendrite.sync_api._api.response.interaction_response import InteractionResponse
-from dendrite.sync_api._core.mixin.get_element import GetElementMixin
-from dendrite.sync_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
+from dendrite.models.response.interaction_response import InteractionResponse
+from ..mixin.get_element import GetElementMixin
+from ..protocol.page_protocol import DendritePageProtocol
class FillFieldsMixin(GetElementMixin, DendritePageProtocol):
diff --git a/dendrite/browser/sync_api/mixin/get_element.py b/dendrite/browser/sync_api/mixin/get_element.py
new file mode 100644
index 0000000..84e2b37
--- /dev/null
+++ b/dendrite/browser/sync_api/mixin/get_element.py
@@ -0,0 +1,251 @@
+import time
+import time
+from typing import (
+ TYPE_CHECKING,
+ Any,
+ Callable,
+ Dict,
+ List,
+ Literal,
+ Optional,
+ Union,
+ overload,
+)
+from bs4 import BeautifulSoup
+from loguru import logger
+from .._utils import _get_all_elements_from_selector_soup
+from ..dendrite_element import Element
+
+if TYPE_CHECKING:
+ from ..dendrite_page import Page
+from dendrite.models.dto.cached_selector_dto import CachedSelectorDTO
+from dendrite.models.dto.get_elements_dto import GetElementsDTO
+from ..protocol.page_protocol import DendritePageProtocol
+
+CACHE_TIMEOUT = 5
+
+
+class GetElementMixin(DendritePageProtocol):
+
+ def get_element(
+ self, prompt: str, use_cache=True, timeout=15000
+ ) -> Optional[Element]:
+ """
+ Retrieves a single Dendrite element based on the provided prompt.
+
+ Args:
+ prompt (str): The prompt describing the element to be retrieved.
+ use_cache (bool, optional): Whether to use cached results. Defaults to True.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached selectors before falling back to the
+ find element agent for the remaining time. Defaults to 15000 (15 seconds).
+
+ Returns:
+ Element: The retrieved element.
+ """
+ return self._get_element(
+ prompt, only_one=True, use_cache=use_cache, timeout=timeout / 1000
+ )
+
+ @overload
+ def _get_element(
+ self, prompt_or_elements: str, only_one: Literal[True], use_cache: bool, timeout
+ ) -> Optional[Element]:
+ """
+ Retrieves a single Dendrite element based on the provided prompt.
+
+ Args:
+ prompt (Union[str, Dict[str, str]]): The prompt describing the element to be retrieved.
+ only_one (Literal[True]): Indicates that only one element should be retrieved.
+ use_cache (bool): Whether to use cached results.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached selectors before falling back to the
+ find element agent for the remaining time. Defaults to 15000 (15 seconds).
+
+ Returns:
+ Element: The retrieved element.
+ """
+
+ @overload
+ def _get_element(
+ self,
+ prompt_or_elements: str,
+ only_one: Literal[False],
+ use_cache: bool,
+ timeout,
+ ) -> List[Element]:
+ """
+ Retrieves a list of Dendrite elements based on the provided prompt.
+
+ Args:
+ prompt (str): The prompt describing the elements to be retrieved.
+ only_one (Literal[False]): Indicates that multiple elements should be retrieved.
+ use_cache (bool): Whether to use cached results.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached selectors before falling back to the
+ find element agent for the remaining time. Defaults to 15000 (15 seconds).
+
+ Returns:
+ List[Element]: A list of retrieved elements.
+ """
+
+ def _get_element(
+ self, prompt_or_elements: str, only_one: bool, use_cache: bool, timeout: float
+ ) -> Union[Optional[Element], List[Element]]:
+ """
+ Retrieves Dendrite elements based on the provided prompt, either a single element or a list of elements.
+
+ This method sends a request with the prompt and retrieves the elements based on the `only_one` flag.
+
+ Args:
+ prompt_or_elements (Union[str, Dict[str, str]]): The prompt or dictionary of prompts for element retrieval.
+ only_one (bool): Whether to retrieve only one element or a list of elements.
+ use_cache (bool): Whether to use cached results.
+ timeout (int, optional): Maximum time in milliseconds for the entire operation. If use_cache=True,
+ up to 5000ms will be spent attempting to use cached selectors before falling back to the
+ find element agent for the remaining time. Defaults to 15000 (15 seconds).
+
+ Returns:
+ Union[Element, List[Element], ElementsResponse]: The retrieved element, list of elements, or response object.
+ """
+ logger.info(f"Getting element for prompt: '{prompt_or_elements}'")
+ start_time = time.time()
+ page = self._get_page()
+ soup = page._get_soup()
+ if use_cache:
+ cached_elements = self._try_cached_selectors(
+ page, soup, prompt_or_elements, only_one
+ )
+ if cached_elements:
+ return cached_elements
+ logger.info(
+ "Proceeding to use the find element agent to find the requested elements."
+ )
+ res = try_get_element(
+ self,
+ prompt_or_elements,
+ only_one,
+ remaining_timeout=timeout - (time.time() - start_time),
+ )
+ if res:
+ return res
+ logger.error(
+ f"Failed to retrieve elements within the specified timeout of {timeout} seconds"
+ )
+ return None
+
+ def _try_cached_selectors(
+ self, page: "Page", soup: BeautifulSoup, prompt: str, only_one: bool
+ ) -> Union[Optional[Element], List[Element]]:
+ """
+ Attempts to retrieve elements using cached selectors with exponential backoff.
+
+ Args:
+ page: The current page object
+ soup: The BeautifulSoup object of the current page
+ prompt: The prompt to search for
+ only_one: Whether to return only one element
+
+ Returns:
+ The found elements if successful, None otherwise
+ """
+ dto = CachedSelectorDTO(url=page.url, prompt=prompt)
+ selectors = self.logic_engine.get_cached_selectors(dto)
+ if len(selectors) == 0:
+ logger.debug("No cached selectors found")
+ return None
+ logger.debug("Attempting to use cached selectors with backoff")
+ recent_selectors = selectors[-min(5, len(selectors)) :]
+ str_selectors = list(map(lambda x: x.selector, recent_selectors))
+
+ def try_cached_selectors():
+ return get_elements_from_selectors_soup(page, soup, str_selectors, only_one)
+
+ return _attempt_with_backoff_helper(
+ "cached_selectors", try_cached_selectors, timeout=CACHE_TIMEOUT
+ )
+
+
+def _attempt_with_backoff_helper(
+ operation_name: str,
+ operation: Callable,
+ timeout: float,
+ backoff_intervals: List[float] = [0.15, 0.45, 1.0, 2.0, 4.0, 8.0],
+) -> Optional[Any]:
+ """
+ Generic helper function that implements exponential backoff for operations.
+
+ Args:
+ operation_name: Name of the operation for logging
+ operation: Async function to execute
+ timeout: Maximum time to spend attempting the operation
+ backoff_intervals: List of timeouts between attempts
+
+ Returns:
+ The result of the operation if successful, None otherwise
+ """
+ total_elapsed_time = 0
+ start_time = time.time()
+ for i, current_timeout in enumerate(backoff_intervals):
+ if total_elapsed_time >= timeout:
+ logger.error(f"Timeout reached after {total_elapsed_time:.2f} seconds")
+ return None
+ request_start_time = time.time()
+ result = operation()
+ request_duration = time.time() - request_start_time
+ if result:
+ return result
+ sleep_duration = max(0, current_timeout - request_duration)
+ logger.info(
+ f"{operation_name} attempt {i + 1} failed. Sleeping for {sleep_duration:.2f} seconds"
+ )
+ time.sleep(sleep_duration)
+ total_elapsed_time = time.time() - start_time
+ logger.error(
+ f"All {operation_name} attempts failed after {total_elapsed_time:.2f} seconds"
+ )
+ return None
+
+
+def try_get_element(
+ obj: DendritePageProtocol,
+ prompt_or_elements: Union[str, Dict[str, str]],
+ only_one: bool,
+ remaining_timeout: float,
+) -> Union[Optional[Element], List[Element]]:
+
+ def _try_get_element():
+ page = obj._get_page()
+ page_information = page.get_page_information()
+ dto = GetElementsDTO(
+ page_information=page_information,
+ prompt=prompt_or_elements,
+ only_one=only_one,
+ )
+ res = obj.logic_engine.get_element(dto)
+ if res.status == "impossible":
+ logger.error(
+ f"Impossible to get elements for '{prompt_or_elements}'. Reason: {res.message}"
+ )
+ return None
+ if res.status == "success":
+ logger.success(f"d[id]: {res.d_id} Selectors:{res.selectors}")
+ if res.selectors is not None:
+ return get_elements_from_selectors_soup(
+ page, page._get_previous_soup(), res.selectors, only_one
+ )
+ return None
+
+ return _attempt_with_backoff_helper(
+ "find_element_agent", _try_get_element, remaining_timeout
+ )
+
+
+def get_elements_from_selectors_soup(
+ page: "Page", soup: BeautifulSoup, selectors: List[str], only_one: bool
+) -> Union[Optional[Element], List[Element]]:
+ for selector in reversed(selectors):
+ dendrite_elements = _get_all_elements_from_selector_soup(selector, soup, page)
+ if len(dendrite_elements) > 0:
+ return dendrite_elements[0] if only_one else dendrite_elements
+ return None
diff --git a/dendrite/sync_api/_core/mixin/keyboard.py b/dendrite/browser/sync_api/mixin/keyboard.py
similarity index 90%
rename from dendrite/sync_api/_core/mixin/keyboard.py
rename to dendrite/browser/sync_api/mixin/keyboard.py
index e3ed73a..2f1c882 100644
--- a/dendrite/sync_api/_core/mixin/keyboard.py
+++ b/dendrite/browser/sync_api/mixin/keyboard.py
@@ -1,6 +1,6 @@
-from typing import Any, Union, Literal
-from dendrite.sync_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import DendriteException
+from typing import Literal, Union
+from dendrite.browser._common._exceptions.dendrite_exception import DendriteException
+from ..protocol.page_protocol import DendritePageProtocol
class KeyboardMixin(DendritePageProtocol):
diff --git a/dendrite/sync_api/_core/mixin/markdown.py b/dendrite/browser/sync_api/mixin/markdown.py
similarity index 88%
rename from dendrite/sync_api/_core/mixin/markdown.py
rename to dendrite/browser/sync_api/mixin/markdown.py
index f094330..193a1bb 100644
--- a/dendrite/sync_api/_core/mixin/markdown.py
+++ b/dendrite/browser/sync_api/mixin/markdown.py
@@ -1,9 +1,9 @@
+import re
from typing import Optional
from bs4 import BeautifulSoup
-import re
-from dendrite.sync_api._core.mixin.extract import ExtractionMixin
-from dendrite.sync_api._core.protocol.page_protocol import DendritePageProtocol
from markdownify import markdownify as md
+from ..mixin.extract import ExtractionMixin
+from ..protocol.page_protocol import DendritePageProtocol
class MarkdownMixin(ExtractionMixin, DendritePageProtocol):
diff --git a/dendrite/sync_api/_core/mixin/screenshot.py b/dendrite/browser/sync_api/mixin/screenshot.py
similarity index 88%
rename from dendrite/sync_api/_core/mixin/screenshot.py
rename to dendrite/browser/sync_api/mixin/screenshot.py
index 3495b4c..5cc621e 100644
--- a/dendrite/sync_api/_core/mixin/screenshot.py
+++ b/dendrite/browser/sync_api/mixin/screenshot.py
@@ -1,4 +1,4 @@
-from dendrite.sync_api._core.protocol.page_protocol import DendritePageProtocol
+from ..protocol.page_protocol import DendritePageProtocol
class ScreenshotMixin(DendritePageProtocol):
diff --git a/dendrite/sync_api/_core/mixin/wait_for.py b/dendrite/browser/sync_api/mixin/wait_for.py
similarity index 87%
rename from dendrite/sync_api/_core/mixin/wait_for.py
rename to dendrite/browser/sync_api/mixin/wait_for.py
index 76cac15..ccc5dfd 100644
--- a/dendrite/sync_api/_core/mixin/wait_for.py
+++ b/dendrite/browser/sync_api/mixin/wait_for.py
@@ -1,10 +1,12 @@
import time
import time
-from dendrite.sync_api._core.mixin.ask import AskMixin
-from dendrite.sync_api._core.protocol.page_protocol import DendritePageProtocol
-from dendrite._common._exceptions.dendrite_exception import PageConditionNotMet
-from dendrite._common._exceptions.dendrite_exception import DendriteException
from loguru import logger
+from dendrite.browser._common._exceptions.dendrite_exception import (
+ DendriteException,
+ PageConditionNotMet,
+)
+from ..mixin.ask import AskMixin
+from ..protocol.page_protocol import DendritePageProtocol
class WaitForMixin(AskMixin, DendritePageProtocol):
diff --git a/dendrite/async_api/_core/models/__init__.py b/dendrite/browser/sync_api/protocol/__init__.py
similarity index 100%
rename from dendrite/async_api/_core/models/__init__.py
rename to dendrite/browser/sync_api/protocol/__init__.py
diff --git a/dendrite/browser/sync_api/protocol/browser_protocol.py b/dendrite/browser/sync_api/protocol/browser_protocol.py
new file mode 100644
index 0000000..f708e61
--- /dev/null
+++ b/dendrite/browser/sync_api/protocol/browser_protocol.py
@@ -0,0 +1,61 @@
+from typing import TYPE_CHECKING, Optional, Protocol, Union
+from typing_extensions import Literal
+from dendrite.browser.remote import Providers
+
+if TYPE_CHECKING:
+ from ..dendrite_browser import Dendrite
+from playwright.sync_api import Browser, Download, Playwright
+from ..types import PlaywrightPage
+
+
+class BrowserProtocol(Protocol):
+
+ def __init__(self, settings: Providers) -> None: ...
+
+ def get_download(
+ self, dendrite_browser: "Dendrite", pw_page: PlaywrightPage, timeout: float
+ ) -> Download:
+ """
+ Retrieves the download event from the browser.
+
+ Returns:
+ Download: The download event.
+
+ Raises:
+ Exception: If there is an issue retrieving the download event.
+ """
+ ...
+
+ def start_browser(self, playwright: Playwright, pw_options: dict) -> Browser:
+ """
+ Starts the browser session.
+
+ Args:
+ playwright: The playwright instance
+ pw_options: Playwright launch options
+
+ Returns:
+ Browser: A Browser instance
+ """
+ ...
+
+ def configure_context(self, browser: "Dendrite") -> None:
+ """
+ Configures the browser context.
+
+ Args:
+ browser (Dendrite): The browser to configure.
+
+ Raises:
+ Exception: If there is an issue configuring the browser context.
+ """
+ ...
+
+ def stop_session(self) -> None:
+ """
+ Stops the browser session.
+
+ Raises:
+ Exception: If there is an issue stopping the browser session.
+ """
+ ...
diff --git a/dendrite/sync_api/_core/models/download_interface.py b/dendrite/browser/sync_api/protocol/download_protocol.py
similarity index 100%
rename from dendrite/sync_api/_core/models/download_interface.py
rename to dendrite/browser/sync_api/protocol/download_protocol.py
diff --git a/dendrite/browser/sync_api/protocol/page_protocol.py b/dendrite/browser/sync_api/protocol/page_protocol.py
new file mode 100644
index 0000000..d12b839
--- /dev/null
+++ b/dendrite/browser/sync_api/protocol/page_protocol.py
@@ -0,0 +1,21 @@
+from typing import TYPE_CHECKING, Protocol
+from dendrite.logic import LogicEngine
+
+if TYPE_CHECKING:
+ from ..dendrite_browser import Dendrite
+ from ..dendrite_page import Page
+
+
+class DendritePageProtocol(Protocol):
+ """
+ Protocol that specifies the required methods and attributes
+ for the `ExtractionMixin` to work.
+ """
+
+ @property
+ def logic_engine(self) -> LogicEngine: ...
+
+ @property
+ def dendrite_browser(self) -> "Dendrite": ...
+
+ def _get_page(self) -> "Page": ...
diff --git a/dendrite/browser/sync_api/types.py b/dendrite/browser/sync_api/types.py
new file mode 100644
index 0000000..de26bef
--- /dev/null
+++ b/dendrite/browser/sync_api/types.py
@@ -0,0 +1,12 @@
+import inspect
+from typing import Any, Dict, Literal, Type, TypeVar, Union
+from playwright.sync_api import Page
+from pydantic import BaseModel
+
+Interaction = Literal["click", "fill", "hover"]
+T = TypeVar("T")
+PydanticModel = TypeVar("PydanticModel", bound=BaseModel)
+PrimitiveTypes = PrimitiveTypes = Union[Type[bool], Type[int], Type[float], Type[str]]
+JsonSchema = Dict[str, Any]
+TypeSpec = Union[PrimitiveTypes, PydanticModel, JsonSchema]
+PlaywrightPage = Page
diff --git a/dendrite/exceptions/__init__.py b/dendrite/exceptions/__init__.py
index fa5ff25..ad0fbf7 100644
--- a/dendrite/exceptions/__init__.py
+++ b/dendrite/exceptions/__init__.py
@@ -1,11 +1,11 @@
-from .._common._exceptions.dendrite_exception import (
+from ..browser._common._exceptions.dendrite_exception import (
BaseDendriteException,
+ BrowserNotLaunchedError,
DendriteException,
IncorrectOutcomeError,
InvalidAuthSessionError,
MissingApiKeyError,
PageConditionNotMet,
- BrowserNotLaunchedError,
)
__all__ = [
diff --git a/dendrite/logic/__init__.py b/dendrite/logic/__init__.py
new file mode 100644
index 0000000..4c2737c
--- /dev/null
+++ b/dendrite/logic/__init__.py
@@ -0,0 +1,4 @@
+from .async_logic_engine import AsyncLogicEngine
+from .sync_logic_engine import LogicEngine
+
+__all__ = ["LogicEngine", "AsyncLogicEngine"]
diff --git a/dendrite/async_api/_dom/__init__.py b/dendrite/logic/ask/__init__.py
similarity index 100%
rename from dendrite/async_api/_dom/__init__.py
rename to dendrite/logic/ask/__init__.py
diff --git a/dendrite/logic/ask/ask.py b/dendrite/logic/ask/ask.py
new file mode 100644
index 0000000..af7e71e
--- /dev/null
+++ b/dendrite/logic/ask/ask.py
@@ -0,0 +1,231 @@
+import re
+from typing import List
+
+import json_repair
+from jsonschema import validate
+from openai.types.chat.chat_completion_content_part_param import (
+ ChatCompletionContentPartParam,
+)
+
+from dendrite.logic.config import Config
+from dendrite.logic.llm.agent import Agent, Message
+from dendrite.models.dto.ask_page_dto import AskPageDTO
+from dendrite.models.response.ask_page_response import AskPageResponse
+
+from .image import segment_image
+
+
+async def ask_page_action(ask_page_dto: AskPageDTO, config: Config) -> AskPageResponse:
+ image_segments = segment_image(
+ ask_page_dto.page_information.screenshot_base64, segment_height=2000
+ )
+
+ agent = Agent(config.llm_config.get("ask_page_agent"))
+ scrolled_to_segment_i = 0
+ content = generate_ask_page_prompt(ask_page_dto, image_segments)
+ messages: List[Message] = [
+ {"role": "user", "content": content},
+ ]
+
+ max_iterations = len(image_segments) + 5
+ iteration = 0
+ while iteration < max_iterations:
+ iteration += 1
+
+ text = await agent.call_llm(messages)
+ messages.append(
+ {
+ "role": "assistant",
+ "content": text,
+ }
+ )
+
+ json_pattern = r"```json(.*?)```"
+
+ if not text:
+ continue
+
+ json_matches = re.findall(json_pattern, text, re.DOTALL)
+
+ if len(json_matches) == 0:
+ continue
+
+ extracted_json = json_matches[0].strip()
+ data_dict = json_repair.loads(extracted_json)
+
+ if not isinstance(data_dict, dict):
+ content = "Your message doesn't contain a correctly formatted json object, try again."
+ messages.append({"role": "user", "content": content})
+ continue
+
+ if "scroll_down" in data_dict:
+ next = scrolled_to_segment_i + 1
+ if next < len(image_segments):
+ content = generate_scroll_prompt(image_segments, next)
+ else:
+ content = "You cannot scroll any further."
+ messages.append({"role": "user", "content": content})
+ continue
+
+ elif "return_data" in data_dict and "description" in data_dict:
+ return_data = data_dict["return_data"]
+ try:
+ if ask_page_dto.return_schema:
+ validate(instance=return_data, schema=ask_page_dto.return_schema)
+ except Exception as e:
+ err_message = "Your return data doesn't match the requested return json schema, try again. Exception: {e}"
+ messages.append(
+ {
+ "role": "user",
+ "content": err_message,
+ }
+ )
+ continue
+
+ return AskPageResponse(
+ status="success",
+ return_data=data_dict["return_data"],
+ description=data_dict["description"],
+ )
+
+ elif "error" in data_dict:
+ was_blocked = data_dict.get("was_blocked_by_recaptcha", False)
+ return AskPageResponse(
+ status="error",
+ return_data=data_dict["error"],
+ description=f'{data_dict["error"]}, was_blocked_by_recaptcha: {was_blocked}',
+ )
+
+ else:
+ err_message = (
+ "Your message doesn't contain a correctly formatted action, try again."
+ )
+ messages.append(
+ {
+ "role": "user",
+ "content": err_message,
+ }
+ )
+
+ return AskPageResponse(
+ status="error",
+ return_data="Scrolled through the entire page without finding the requested data.",
+ description="",
+ )
+
+
+def generate_ask_page_prompt(
+ ask_page_dto: AskPageDTO, image_segments: list, scrolled_to_segment_i: int = 0
+) -> List[ChatCompletionContentPartParam]:
+ # Generate scroll down hint based on number of segments
+ scroll_down_hint = (
+ ""
+ if len(image_segments) == 1
+ else """
+
+If you think need to scroll further down, output an object with the key scroll down and nothing else:
+
+Action Message:
+[Short reasoning first]
+```json
+{
+ "scroll_down": true
+}
+```
+
+You can keep scrolling down, noting important details, until you are ready to return the requested data, which you would do in a separate message."""
+ )
+
+ # Get return schema prompt
+ return_schema_prompt = (
+ str(ask_page_dto.return_schema)
+ if ask_page_dto.return_schema
+ else "No schema specified by the user"
+ )
+
+ # Construct the main prompt content
+ content: List[ChatCompletionContentPartParam] = [
+ {
+ "type": "text",
+ "text": f"""Please look at the page and return data that matches the requested schema and prompt.
+
+
+{ask_page_dto.prompt}
+
+
+
+{return_schema_prompt}
+
+
+Look the viewport and decide on the next action:
+
+If you can solve the prompt and return the requested data from the viewport, output a message with tripple backticks and 'json' like in the example below. Make sure `return_data` matches the requested return schema:
+
+Action Message:
+[Short reasoning first]
+```json
+{{
+ "description": "E.g There is a red button with the text 'get started' positoned underneath the title 'welcome!'",
+ "return_data": {{"element_exists": true, "foo": "bar"}},
+}}
+```
+
+Remember, `return_data` should be json that matches the structure of the requested json schema if available. Don't forget to include a description.{scroll_down_hint}
+
+In case you think the data is not available on the current page and the task does not describe how to handle the non-available data, or the page is blocked by a captcha puzzle or similar, output a json with a short error message, like this:
+
+Action Message:
+[Short reasoning first.]
+```json
+{{
+ "error": "reason why the task cannot be completed here",
+ "was_blocked_by_recaptcha": true/false
+}}
+```
+
+Here is a screenshot of the viewport:""",
+ },
+ {
+ "type": "image_url",
+ "image_url": {
+ "url": f"data:image/jpeg;base64,{image_segments[scrolled_to_segment_i]}"
+ },
+ },
+ ]
+
+ return content
+
+
+def generate_scroll_prompt(
+ image_segments: list, next_segment: int
+) -> List[ChatCompletionContentPartParam]:
+ """
+ Generates the prompt for scrolling to next segment.
+
+ Args:
+ image_segments: List of image segments
+ next_segment: Index of next segment
+
+ Returns:
+ List of message content blocks
+ """
+ last_segment_reminder = (
+ " You won't be able to scroll further now."
+ if next_segment == len(image_segments) - 1
+ else ""
+ )
+
+ content = [
+ {
+ "type": "text",
+ "text": f"""You have scrolled down. You are viewing segment {next_segment+1}/{len(image_segments)}.{last_segment_reminder} Here is the new viewport:""",
+ },
+ {
+ "type": "image_url",
+ "image_url": {
+ "url": f"data:image/jpeg;base64,{image_segments[next_segment]}"
+ },
+ },
+ ]
+
+ return content
diff --git a/dendrite/logic/ask/image.py b/dendrite/logic/ask/image.py
new file mode 100644
index 0000000..6a61566
--- /dev/null
+++ b/dendrite/logic/ask/image.py
@@ -0,0 +1,35 @@
+import base64
+import io
+from typing import List
+
+from loguru import logger
+from PIL import Image
+
+
+def segment_image(
+ base64_image: str,
+ segment_height: int = 7900,
+) -> List[str]:
+ if len(base64_image) < 100:
+ raise Exception("Failed to segment image since it is too small / glitched.")
+
+ image_data = base64.b64decode(base64_image)
+ image = Image.open(io.BytesIO(image_data))
+ width, height = image.size
+ segments = []
+
+ for i in range(0, height, segment_height):
+ # Define the box for cropping (left, upper, right, lower)
+ box = (0, i, width, min(i + segment_height, height))
+ segment = image.crop(box)
+
+ # Convert RGBA to RGB if necessary
+ if segment.mode == "RGBA":
+ segment = segment.convert("RGB")
+
+ buffer = io.BytesIO()
+ segment.save(buffer, format="JPEG")
+ segment_data = buffer.getvalue()
+ segments.append(base64.b64encode(segment_data).decode())
+
+ return segments
diff --git a/dendrite/logic/async_logic_engine.py b/dendrite/logic/async_logic_engine.py
new file mode 100644
index 0000000..38915bd
--- /dev/null
+++ b/dendrite/logic/async_logic_engine.py
@@ -0,0 +1,42 @@
+from typing import List, Optional, Protocol
+
+from dendrite.logic.ask import ask
+from dendrite.logic.config import Config
+from dendrite.logic.extract import extract
+from dendrite.logic.get_element import get_element
+from dendrite.logic.verify_interaction import verify_interaction
+from dendrite.models.dto.ask_page_dto import AskPageDTO
+from dendrite.models.dto.cached_extract_dto import CachedExtractDTO
+from dendrite.models.dto.cached_selector_dto import CachedSelectorDTO
+from dendrite.models.dto.extract_dto import ExtractDTO
+from dendrite.models.dto.get_elements_dto import GetElementsDTO
+from dendrite.models.dto.make_interaction_dto import VerifyActionDTO
+from dendrite.models.response.ask_page_response import AskPageResponse
+from dendrite.models.response.extract_response import ExtractResponse
+from dendrite.models.response.get_element_response import GetElementResponse
+from dendrite.models.response.interaction_response import InteractionResponse
+from dendrite.models.scripts import Script
+from dendrite.models.selector import Selector
+
+
+class AsyncLogicEngine:
+ def __init__(self, config: Config):
+ self._config = config
+
+ async def get_element(self, dto: GetElementsDTO) -> GetElementResponse:
+ return await get_element.get_element(dto, self._config)
+
+ async def get_cached_selectors(self, dto: CachedSelectorDTO) -> List[Selector]:
+ return await get_element.get_cached_selector(dto, self._config)
+
+ async def get_cached_scripts(self, dto: CachedExtractDTO) -> List[Script]:
+ return await extract.get_cached_scripts(dto, self._config)
+
+ async def extract(self, dto: ExtractDTO) -> ExtractResponse:
+ return await extract.extract(dto, self._config)
+
+ async def verify_action(self, dto: VerifyActionDTO) -> InteractionResponse:
+ return await verify_interaction.verify_action(dto, self._config)
+
+ async def ask_page(self, dto: AskPageDTO) -> AskPageResponse:
+ return await ask.ask_page_action(dto, self._config)
diff --git a/dendrite/sync_api/_api/__init__.py b/dendrite/logic/cache/__init__.py
similarity index 100%
rename from dendrite/sync_api/_api/__init__.py
rename to dendrite/logic/cache/__init__.py
diff --git a/dendrite/logic/cache/file_cache.py b/dendrite/logic/cache/file_cache.py
new file mode 100644
index 0000000..b56bc18
--- /dev/null
+++ b/dendrite/logic/cache/file_cache.py
@@ -0,0 +1,179 @@
+import json
+import threading
+from hashlib import md5
+from pathlib import Path
+from typing import (
+ Any,
+ Dict,
+ Generic,
+ List,
+ Mapping,
+ Type,
+ TypeVar,
+ Union,
+ Optional,
+ overload,
+)
+
+from pydantic import BaseModel
+
+T = TypeVar("T", bound=Union[BaseModel, Mapping[Any, Any]])
+
+
+class FileCache(Generic[T]):
+ def __init__(
+ self, model_class: Type[T], filepath: Union[str, Path] = "./cache.json"
+ ):
+ self.filepath = Path(filepath)
+ self.model_class = model_class
+ self.lock = threading.RLock()
+ self.cache: Dict[str, List[T]] = {}
+
+ # Create file if it doesn't exist
+ if not self.filepath.exists():
+ self.filepath.parent.mkdir(parents=True, exist_ok=True)
+ self._save_cache({})
+ else:
+ self._load_cache()
+
+ def _load_cache(self) -> None:
+ """Load cache from file into memory"""
+ with self.lock:
+ try:
+ json_string = self.filepath.read_text()
+ raw_dict = json.loads(json_string)
+
+ # Convert each entry based on model_class type
+ self.cache = {}
+ for k, v_list in raw_dict.items():
+ if not isinstance(v_list, list):
+ v_list = [v_list] # Convert old single-value format to list
+
+ self.cache[k] = []
+ for v in v_list:
+ if issubclass(self.model_class, BaseModel):
+ self.cache[k].append(
+ self.model_class.model_validate_json(json.dumps(v))
+ )
+ else:
+ # For any Mapping type (dict, TypedDict, etc)
+ self.cache[k].append(v)
+ except (json.JSONDecodeError, FileNotFoundError):
+ self.cache = {}
+
+ def _save_cache(self, cache_dict: Dict[str, List[T]]) -> None:
+ """Save cache to file"""
+ with self.lock:
+ # Convert entries based on their type
+ serializable_dict = {}
+ for k, v_list in cache_dict.items():
+ serializable_dict[k] = []
+ for v in v_list:
+ if isinstance(v, BaseModel):
+ serializable_dict[k].append(json.loads(v.model_dump_json()))
+ elif isinstance(v, Mapping):
+ serializable_dict[k].append(
+ dict(v)
+ ) # Convert any Mapping to dict
+ else:
+ raise ValueError(f"Unsupported type for cache value: {type(v)}")
+
+ self.filepath.write_text(json.dumps(serializable_dict, indent=2))
+
+ @overload
+ def get(
+ self, key: Union[str, Dict[str, str]], index: None = None
+ ) -> Optional[List[T]]: ...
+
+ @overload
+ def get(self, key: Union[str, Dict[str, str]], index: int) -> Optional[T]: ...
+
+ def get(
+ self, key: Union[str, Dict[str, str]], index: Optional[int] = None
+ ) -> Union[T, List[T], None]:
+ """
+ Get cached values for a key. If index is provided, returns that specific item.
+ If index is None, returns the full list of items.
+ Returns None if key doesn't exist or index is out of range.
+ """
+ hashed_key = self.hash(key)
+ values = self.cache.get(hashed_key, [])
+
+ if index is not None:
+ return values[index] if 0 <= index < len(values) else None
+ return values if values else None
+
+ def set(self, key: Union[str, Dict[str, str]], values: Union[T, List[T]]) -> None:
+ """
+ Replace all values for a key with new value(s).
+ If a single value is provided, it will be wrapped in a list.
+ """
+ hashed_key = self.hash(key)
+ with self.lock:
+ if isinstance(values, list):
+ self.cache[hashed_key] = values
+ else:
+ self.cache[hashed_key] = [values]
+ self._save_cache(self.cache)
+
+ def append(self, key: Union[str, Dict[str, str]], value: T) -> None:
+ """
+ Append a single value to the list of values for a key.
+ Creates a new list if the key doesn't exist.
+ """
+ hashed_key = self.hash(key)
+ with self.lock:
+ if hashed_key not in self.cache:
+ self.cache[hashed_key] = []
+ self.cache[hashed_key].append(value)
+ self._save_cache(self.cache)
+
+ def delete(self, key: str, index: Optional[int] = None) -> None:
+ """
+ Delete cached value(s). If index is provided, only that item is deleted.
+ If index is None, all items for the key are deleted.
+ """
+ hashed_key = self.hash(key)
+ with self.lock:
+ if hashed_key in self.cache:
+ if index is not None and 0 <= index < len(self.cache[hashed_key]):
+ del self.cache[hashed_key][index]
+ if not self.cache[hashed_key]: # Remove key if list is empty
+ del self.cache[hashed_key]
+ else:
+ del self.cache[hashed_key]
+ self._save_cache(self.cache)
+
+ def hash(self, key: Union[str, Dict]) -> str:
+ """
+ Create a deterministic hash from a string or dictionary.
+ Handles nested structures and different value types.
+ """
+
+ def normalize_value(v):
+ if isinstance(v, dict):
+ return self.hash(v)
+ elif isinstance(v, (list, tuple)):
+ return "[" + ",".join(normalize_value(x) for x in v) + "]"
+ elif v is None:
+ return "null"
+ elif isinstance(v, bool):
+ return str(v).lower()
+ else:
+ return str(v).strip()
+
+ if isinstance(key, dict):
+ try:
+ # Sort by normalized string keys
+ sorted_pairs = [
+ f"{str(k).strip()}∴{normalize_value(v)}" # Using a rare Unicode character as delimiter
+ for k, v in sorted(key.items(), key=lambda x: str(x[0]).strip())
+ ]
+ key = "❘".join(sorted_pairs) # Using another rare Unicode character
+ except Exception as e:
+ raise ValueError(f"Failed to process dictionary key: {e}")
+
+ try:
+ return md5(str(key).encode("utf-8")).hexdigest()
+ except Exception as e:
+ raise ValueError(f"Failed to create hash: {e}")
diff --git a/dendrite/sync_api/_api/dto/__init__.py b/dendrite/logic/code/__init__.py
similarity index 100%
rename from dendrite/sync_api/_api/dto/__init__.py
rename to dendrite/logic/code/__init__.py
diff --git a/dendrite/logic/code/code_session.py b/dendrite/logic/code/code_session.py
new file mode 100644
index 0000000..fcb7300
--- /dev/null
+++ b/dendrite/logic/code/code_session.py
@@ -0,0 +1,166 @@
+import json # Important to keep since it is used inside the scripts
+import re # Important to keep since it is used inside the scripts
+import sys
+import traceback
+from datetime import datetime # Important to keep since it is used inside the scripts
+from typing import Any, List, Optional
+
+from bs4 import BeautifulSoup
+from jsonschema import validate
+from loguru import logger
+
+from ..dom.truncate import truncate_long_string
+
+
+class InterpreterError(Exception):
+ pass
+
+
+def custom_exec(
+ cmd,
+ globals=None,
+ locals=None,
+):
+ try:
+ exec(cmd, globals, locals)
+ except SyntaxError as err:
+ error_class = err.__class__.__name__
+ detail = err.args[0]
+ line_number = err.lineno
+ except Exception as err:
+ error_class = err.__class__.__name__
+ detail = err.args[0]
+ cl, exc, tb = sys.exc_info()
+ line_number = traceback.extract_tb(tb)[-1][1]
+ else:
+ return
+
+ traceback_desc = traceback.format_exc()
+ raise InterpreterError(
+ f"{error_class} at line {line_number}. Detail: {detail}. Exception: {traceback_desc}"
+ )
+
+
+class CodeSession:
+ def __init__(self):
+ self.local_vars = {"soup": None, "html_string": "", "datetime": datetime}
+
+ def get_local_var(self, name: str) -> Any:
+ try:
+ return self.local_vars[name]
+ except Exception as e:
+ return f"Error: Couldn't get local var with name {name}. Exception: {e}"
+
+ def add_variable(self, name: str, value: Any):
+ self.local_vars[name] = value
+
+ def exec_code(
+ self,
+ code: str,
+ soup: Optional[BeautifulSoup] = None,
+ html_string: Optional[str] = None,
+ ):
+ try:
+ self.local_vars["soup"] = soup
+ self.local_vars["html_string"] = html_string
+ self.local_vars["datetime"] = datetime
+
+ copied_vars = self.local_vars.copy()
+
+ try:
+ exec(code, globals(), copied_vars)
+ except SyntaxError as err:
+ error_class = err.__class__.__name__
+ detail = err.args[0]
+ line_number = err.lineno
+ raise InterpreterError(
+ "%s at line %d, detail: %s" % (error_class, line_number, detail)
+ )
+ except Exception as err:
+ error_class = err.__class__.__name__
+ detail = err.args[0]
+ _, _, tb = sys.exc_info()
+ line_number = traceback.extract_tb(tb)[-1][1]
+ traceback_desc = traceback.format_exc()
+ raise InterpreterError(
+ "%s at line %d, detail: %s"
+ % (error_class, line_number, traceback_desc)
+ )
+
+ created_vars = {
+ k: v for k, v in copied_vars.items() if k not in self.local_vars
+ }
+
+ self.local_vars = copied_vars
+ return created_vars
+
+ except Exception as e:
+ raise Exception(f"Code failed to run. Exception: {e}")
+
+ def validate_response(self, return_data_json_schema: Any, response_data: Any):
+ if return_data_json_schema != None:
+ try:
+ validate(
+ instance=response_data,
+ schema=return_data_json_schema,
+ )
+ except Exception as e:
+ raise e
+
+ def llm_readable_exec_res(
+ self, variables, prompt: str, attempts: int, max_attempts: int
+ ):
+ response = "Code executed.\n\n"
+
+ if len(variables) == 0:
+ response += "No new variables were created."
+ else:
+ response += "Newly created variables:"
+ for var_name, var_value in variables.items():
+ show_length = 600 if var_name == "response_data" else 300
+
+ try:
+ if var_value is None:
+ str_value = "None"
+ else:
+ str_value = str(var_value)
+
+ except Exception as e:
+ logger.error(
+ f"Error converting to string for display: {e},\nvar_name: {var_name} | var_value{var_value}"
+ )
+ str_value = ""
+
+ truncated = truncate_long_string(
+ str_value, max_len_end=show_length, max_len_start=show_length
+ )
+ extra_info = ""
+ if isinstance(var_value, List):
+ extra_info = f"\n{var_name}'s length is {len(var_value)}."
+ response += f"\n\n`{var_name}={truncated}`{extra_info}"
+
+ response += f"\n\nDo these variables match the expected values? Remember, this is what the user asked for:\n\n{prompt}\n\nIf not, try again and remember, if one approach fails several times you might need to reinspect the DOM and try a different approach. You have {max_attempts - attempts} attempts left to try and complete the task. If you are happy with the results, output a success message."
+
+ return response
+
+
+def execute(script: str, raw_html: str, return_data_json_schema) -> Any:
+ code_session = CodeSession()
+ soup = BeautifulSoup(raw_html, "lxml")
+ try:
+
+ created_variables = code_session.exec_code(script, soup, raw_html)
+
+ if "response_data" in created_variables:
+ response_data = created_variables["response_data"]
+
+ try:
+ code_session.validate_response(return_data_json_schema, response_data)
+ except Exception as e:
+ raise Exception(f"Failed to validate response data. Exception: {e}")
+
+ return response_data
+ else:
+ raise Exception("No return data available for this script.")
+ except Exception as e:
+ raise e
diff --git a/dendrite/logic/config.py b/dendrite/logic/config.py
new file mode 100644
index 0000000..b69daf7
--- /dev/null
+++ b/dendrite/logic/config.py
@@ -0,0 +1,27 @@
+from pathlib import Path
+from typing import Optional, Union
+
+from playwright.async_api import StorageState
+
+from dendrite.logic.cache.file_cache import FileCache
+from dendrite.logic.llm.config import LLMConfig
+from dendrite.models.scripts import Script
+from dendrite.models.selector import Selector
+
+
+class Config:
+ def __init__(
+ self,
+ root_path: Union[str, Path] = ".dendrite",
+ cache_path: Union[str, Path] = "cache",
+ auth_session_path: Union[str, Path] = "auth",
+ llm_config: Optional[LLMConfig] = None,
+ ):
+ self.cache_path = root_path / Path(cache_path)
+ self.llm_config = llm_config or LLMConfig()
+ self.extract_cache = FileCache(Script, self.cache_path / "extract.json")
+ self.element_cache = FileCache(Selector, self.cache_path / "get_element.json")
+ self.storage_cache = FileCache(
+ StorageState, self.cache_path / "storage_state.json"
+ )
+ self.auth_session_path = root_path / Path(auth_session_path)
diff --git a/dendrite/sync_api/_api/response/__init__.py b/dendrite/logic/dom/__init__.py
similarity index 100%
rename from dendrite/sync_api/_api/response/__init__.py
rename to dendrite/logic/dom/__init__.py
diff --git a/dendrite/logic/dom/css.py b/dendrite/logic/dom/css.py
new file mode 100644
index 0000000..8555df4
--- /dev/null
+++ b/dendrite/logic/dom/css.py
@@ -0,0 +1,185 @@
+from typing import Optional
+
+from bs4 import BeautifulSoup, Tag
+from loguru import logger
+
+
+def find_css_selector(ele: Tag, soup: BeautifulSoup) -> str:
+ logger.debug(f"Finding selector for element: {ele.name} with attrs: {ele.attrs}")
+
+ # Add this debug block
+ final_selector = "" # Track the selector being built
+ matches = [] # Track matching elements
+
+ def debug_selector(selector: str) -> None:
+ nonlocal matches
+ try:
+ matches = soup.select(selector)
+ logger.debug(f"Selector '{selector}' matched {len(matches)} elements")
+ except Exception as e:
+ logger.error(f"Invalid selector '{selector}': {e}")
+
+ # Check for inherently unique elements
+ if ele.name in ["html", "head", "body"]:
+ return ele.name
+
+ # List of attributes to check for unique selectors
+ priority_attrs = [
+ "id",
+ "name",
+ "data-testid",
+ "data-cy",
+ "data-qa",
+ "aria-label",
+ "aria-labelledby",
+ "for",
+ "href",
+ "alt",
+ "title",
+ "role",
+ "placeholder",
+ ]
+
+ # Try attrs
+ for attr in priority_attrs:
+ if attr_selector := check_unique_attribute(ele, soup, attr, ele.name):
+ return attr_selector
+
+ # Try class combinations
+ if class_selector := find_unique_class_combination(ele, soup):
+ return class_selector
+
+ # If still not unique, use parent selector with nth-child
+ parent_selector = find_selector_with_parent(ele, soup)
+
+ return parent_selector
+
+
+def check_unique_attribute(
+ ele: Tag, soup: BeautifulSoup, attr: str, tag_name: str
+) -> str:
+ attr_value = ele.get(attr)
+ if attr_value:
+ attr_value = css_escape(attr_value)
+ attr = css_escape(attr)
+ selector = f'{css_escape(tag_name)}[{attr}="{attr_value}"]'
+ if check_if_selector_successful(selector, soup, True):
+ return selector
+ return ""
+
+
+def find_unique_class_combination(ele: Tag, soup: BeautifulSoup) -> str:
+ classes = ele.get("class", [])
+
+ if isinstance(classes, str):
+ classes = [classes]
+
+ if not classes:
+ return ""
+
+ tag_name = css_escape(ele.name)
+
+ # Try single classes first
+ for cls in classes:
+ selector = f"{tag_name}.{css_escape(cls)}"
+ if check_if_selector_successful(selector, soup, True):
+ return selector
+
+ # If single classes don't work, try the full combination
+ full_selector = f"{tag_name}{'.'.join([''] + [css_escape(c) for c in classes])}"
+ if check_if_selector_successful(full_selector, soup, True):
+ return full_selector
+
+ return ""
+
+
+def find_selector_with_parent(ele: Tag, soup: BeautifulSoup) -> str:
+ parent = ele.find_parent()
+ if parent is None or parent == soup:
+ return f"{css_escape(ele.name)}"
+
+ parent_selector = find_css_selector(parent, soup)
+ siblings_of_same_type = parent.find_all(ele.name, recursive=False)
+
+ if len(siblings_of_same_type) == 1:
+ return f"{parent_selector} > {css_escape(ele.name)}"
+ else:
+ index = position_in_node_list(ele, parent)
+ return f"{parent_selector} > {css_escape(ele.name)}:nth-child({index})"
+
+
+def position_in_node_list(element: Tag, parent: Tag):
+ for index, child in enumerate(parent.find_all(recursive=False)):
+ if child == element:
+ return index + 1
+ return -1
+
+
+# https://github.com/mathiasbynens/CSS.escape
+def css_escape(value):
+ if len(str(value)) == 0:
+ raise TypeError("`CSS.escape` requires an argument.")
+
+ string = str(value)
+ length = len(string)
+ result = ""
+ first_code_unit = ord(string[0]) if length > 0 else None
+
+ if length == 1 and first_code_unit == 0x002D:
+ return "\\" + string
+
+ for index in range(length):
+ code_unit = ord(string[index])
+
+ if code_unit == 0x0000:
+ result += "\uFFFD"
+ continue
+
+ if (
+ (0x0001 <= code_unit <= 0x001F)
+ or code_unit == 0x007F
+ or (index == 0 and 0x0030 <= code_unit <= 0x0039)
+ or (
+ index == 1
+ and 0x0030 <= code_unit <= 0x0039
+ and first_code_unit == 0x002D
+ )
+ ):
+ result += "\\" + format(code_unit, "x") + " "
+ continue
+
+ if (
+ code_unit >= 0x0080
+ or code_unit == 0x002D
+ or code_unit == 0x005F
+ or 0x0030 <= code_unit <= 0x0039
+ or 0x0041 <= code_unit <= 0x005A
+ or 0x0061 <= code_unit <= 0x007A
+ ):
+ result += string[index]
+ continue
+
+ result += "\\" + string[index]
+
+ return result
+
+
+def check_if_selector_successful(
+ selector: str,
+ bs4: BeautifulSoup,
+ only_one: bool,
+) -> Optional[str]:
+
+ els = None
+ try:
+ els = bs4.select(selector)
+ except Exception as e:
+ logger.warning(f"Error selecting {selector}: {e}")
+
+ if els:
+ if only_one and len(els) == 1:
+ return selector
+ elif not only_one and len(els) >= 1:
+ return selector
+
+ return None
diff --git a/dendrite/logic/dom/strip.py b/dendrite/logic/dom/strip.py
new file mode 100644
index 0000000..fb4dc43
--- /dev/null
+++ b/dendrite/logic/dom/strip.py
@@ -0,0 +1,158 @@
+import copy
+from typing import List, Union, overload
+
+from bs4 import BeautifulSoup, Comment, Doctype, Tag
+
+
+def mild_strip(soup: Tag, keep_d_id: bool = True) -> BeautifulSoup:
+ new_soup = BeautifulSoup(str(soup), "html.parser")
+ _mild_strip(new_soup, keep_d_id)
+ return new_soup
+
+
+def mild_strip_in_place(soup: BeautifulSoup, keep_d_id: bool = True) -> None:
+ _mild_strip(soup, keep_d_id)
+
+
+def _mild_strip(soup: BeautifulSoup, keep_d_id: bool = True) -> None:
+ for element in soup(text=lambda text: isinstance(text, Comment)):
+ element.extract()
+
+ # for text in soup.find_all(text=lambda text: isinstance(text, NavigableString)):
+ # if len(text) > 200:
+ # text.replace_with(text[:200] + f"... [{len(text)-200} more chars]")
+
+ for tag in soup(
+ ["head", "script", "style", "path", "polygon", "defs", "svg", "br", "Doctype"]
+ ):
+ tag.extract()
+
+ for element in soup.contents:
+ if isinstance(element, Doctype):
+ element.extract()
+
+ # for tag in soup.find_all(True):
+ # tag.attrs = {
+ # attr: (value[:100] if isinstance(value, str) else value)
+ # for attr, value in tag.attrs.items()
+ # }
+ # if keep_d_id == False:
+ # del tag["d-id"]
+ for tag in soup.find_all(True):
+ if tag.attrs.get("is-interactable-d_id") == "true":
+ continue
+
+ tag.attrs = {
+ attr: (value[:100] if isinstance(value, str) else value)
+ for attr, value in tag.attrs.items()
+ }
+ if keep_d_id == False:
+ del tag["d-id"]
+
+ # if browser != None:
+ # for elem in list(soup.descendants):
+ # if isinstance(elem, Tag) and not browser.element_is_visible(elem):
+ # elem.extract()
+
+
+@overload
+def shorten_attr_val(value: str, limit: int = 50) -> str: ...
+
+
+@overload
+def shorten_attr_val(value: List[str], limit: int = 50) -> List[str]: ...
+
+
+def shorten_attr_val(
+ value: Union[str, List[str]], limit: int = 50
+) -> Union[str, List[str]]:
+ if isinstance(value, str):
+ return value[:limit]
+
+ char_count = sum(map(len, value))
+ if char_count <= limit:
+ return value
+
+ while len(value) > 1 and char_count > limit:
+ char_count -= len(value.pop())
+
+ if len(value) == 1:
+ return value[0][:limit]
+
+ return value
+
+
+def clear_attrs(element: Tag):
+
+ salient_attributes = [
+ "d-id",
+ "class",
+ "id",
+ "type",
+ "alt",
+ "aria-describedby",
+ "aria-label",
+ "contenteditable",
+ "aria-role",
+ "input-checked",
+ "label",
+ "name",
+ "option_selected",
+ "placeholder",
+ "readonly",
+ "text-value",
+ "title",
+ "value",
+ "href",
+ "role",
+ "action",
+ "method",
+ ]
+ attrs = {
+ attr: shorten_attr_val(value, limit=200)
+ for attr, value in element.attrs.items()
+ if attr in salient_attributes
+ }
+ element.attrs = attrs
+
+
+def strip_soup(soup: BeautifulSoup) -> BeautifulSoup:
+ # Create a copy of the soup to avoid modifying the original
+ stripped_soup = BeautifulSoup(str(soup), "html.parser")
+
+ for tag in stripped_soup(
+ [
+ "head",
+ "script",
+ "style",
+ "path",
+ "polygon",
+ "defs",
+ "br",
+ "Doctype",
+ ] # add noscript?
+ ):
+ tag.extract()
+
+ # Remove comments
+ comments = stripped_soup.find_all(text=lambda text: isinstance(text, Comment))
+ for comment in comments:
+ comment.extract()
+
+ # Clear non-salient attributes
+ for element in stripped_soup.find_all(True):
+ if isinstance(element, Doctype):
+ element.extract()
+ else:
+ clear_attrs(element)
+
+ return stripped_soup
+
+
+def remove_hidden_elements(soup: BeautifulSoup):
+ # data-hidden is added by DendriteBrowser when an element is not visible
+ new_soup = copy.copy(soup)
+ elems = new_soup.find_all(attrs={"data-hidden": True})
+ for elem in elems:
+ elem.extract()
+ return new_soup
diff --git a/dendrite/logic/dom/truncate.py b/dendrite/logic/dom/truncate.py
new file mode 100644
index 0000000..fa1bfd8
--- /dev/null
+++ b/dendrite/logic/dom/truncate.py
@@ -0,0 +1,73 @@
+import re
+
+
+def truncate_long_string(
+ val: str,
+ max_len_start: int = 150,
+ max_len_end: int = 150,
+ trucate_desc: str = "chars truncated for readability",
+):
+ return (
+ val
+ if len(val) < max_len_start + max_len_end
+ else val[:max_len_start]
+ + f"... [{len(val)-max_len_start-max_len_end} {trucate_desc}] ..."
+ + val[-max_len_end:]
+ )
+
+
+def truncate_long_string_w_words(
+ val: str,
+ max_len_start: int = 150,
+ max_len_end: int = 150,
+ trucate_desc: str = "words truncated for readability",
+ show_more_words_for_longer_val: bool = True,
+):
+ if len(val) < max_len_start + max_len_end:
+ return val
+ else:
+ if show_more_words_for_longer_val:
+ max_len_end += int(len(val) / 100)
+ max_len_end += int(len(val) / 100)
+
+ truncate_start_pos = max_len_start
+ steps_taken_start = 0
+ while (
+ truncate_start_pos > 0
+ and val[truncate_start_pos] not in [" ", "\n"]
+ and steps_taken_start < 20
+ ):
+ truncate_start_pos -= 1
+ steps_taken_start += 1
+
+ truncate_end_pos = len(val) - max_len_end
+ steps_taken_end = 0
+ while (
+ truncate_end_pos < len(val)
+ and val[truncate_end_pos] not in [" ", "\n"]
+ and steps_taken_end < 20
+ ):
+ truncate_end_pos += 1
+ steps_taken_end += 1
+
+ if steps_taken_start >= 20 or steps_taken_end >= 20:
+ # Return simple truncation if we've looped further than 20 chars
+ return truncate_long_string(val, max_len_start, max_len_end, trucate_desc)
+ else:
+ return (
+ val[:truncate_start_pos]
+ + f" [...{len(val[truncate_start_pos:truncate_end_pos].split())} {trucate_desc}...] "
+ + val[truncate_end_pos:]
+ )
+
+
+def remove_excessive_whitespace(text: str, max_whitespaces=1):
+ return re.sub(r"\s{2,}", " " * max_whitespaces, text)
+
+
+def truncate_and_remove_whitespace(text, max_len_start=100, max_len_end=100):
+ return truncate_long_string_w_words(
+ remove_excessive_whitespace(text),
+ max_len_start=max_len_start,
+ max_len_end=max_len_end,
+ )
diff --git a/dendrite/sync_api/_common/__init__.py b/dendrite/logic/extract/__init__.py
similarity index 100%
rename from dendrite/sync_api/_common/__init__.py
rename to dendrite/logic/extract/__init__.py
diff --git a/dendrite/logic/extract/cache.py b/dendrite/logic/extract/cache.py
new file mode 100644
index 0000000..36aab75
--- /dev/null
+++ b/dendrite/logic/extract/cache.py
@@ -0,0 +1,55 @@
+from datetime import datetime
+from typing import Any, List, Optional, Tuple
+from urllib.parse import urlparse
+
+from loguru import logger
+
+from dendrite.logic.cache.file_cache import FileCache
+from dendrite.logic.code.code_session import execute
+from dendrite.logic.config import Config
+from dendrite.models.dto.cached_extract_dto import CachedExtractDTO
+from dendrite.models.scripts import Script
+
+
+def save_script(code: str, prompt: str, url: str, cache: FileCache[Script]):
+ domain = urlparse(url).netloc
+ script = Script(
+ url=url, domain=domain, script=code, created_at=datetime.now().isoformat()
+ )
+ cache.append({"prompt": prompt, "domain": domain}, script)
+
+
+def get_scripts(
+ prompt: str, url: str, cache: FileCache[Script]
+) -> Optional[List[Script]]:
+ domain = urlparse(url).netloc
+ return cache.get({"prompt": prompt, "domain": domain})
+
+
+async def get_working_cached_script(
+ prompt: str, raw_html: str, url: str, return_data_json_schema: Any, config: Config
+) -> Optional[Tuple[Script, Any]]:
+
+ if len(url) == 0:
+ raise Exception("Domain must be specified")
+
+ scripts = get_scripts(prompt, url, config.extract_cache)
+ if scripts is None or len(scripts) == 0:
+ return None
+ logger.debug(
+ f"Found {len(scripts)} scripts in cache | Prompt: {prompt} in domain: {url}"
+ )
+
+ for script in scripts:
+ try:
+ res = execute(script.script, raw_html, return_data_json_schema)
+ return script, res
+ except Exception as e:
+ logger.debug(
+ f"Script failed with error: {str(e)} | Prompt: {prompt} in domain: {url}"
+ )
+ continue
+
+ raise Exception(
+ f"No working script found in cache even though {len(scripts)} scripts were available | Prompt: '{prompt}' in domain: '{url}'"
+ )
diff --git a/dendrite/logic/extract/compress_html.py b/dendrite/logic/extract/compress_html.py
new file mode 100644
index 0000000..f7fb3fc
--- /dev/null
+++ b/dendrite/logic/extract/compress_html.py
@@ -0,0 +1,490 @@
+import re
+import time
+from collections import Counter
+from typing import List, Optional, Tuple, TypedDict, Union
+
+from bs4 import BeautifulSoup, NavigableString, PageElement
+from bs4.element import Tag
+
+from dendrite.logic.dom.truncate import (
+ truncate_and_remove_whitespace,
+ truncate_long_string_w_words,
+)
+from dendrite.logic.llm.token_count import token_count
+
+MAX_REPEATING_ELEMENT_AMOUNT = 6
+
+
+class FollowableListInfo(TypedDict):
+ expanded_elements: List[Tag]
+ amount: int
+ parent_element_d_id: str
+ first_element_d_id: str
+
+
+class CompressHTML:
+ def __init__(
+ self,
+ root_soup: Union[BeautifulSoup, Tag],
+ ids_to_expand: List[str] = [],
+ compression_multiplier: float = 1,
+ exclude_dendrite_ids=False,
+ max_token_size: int = 80000,
+ max_size_per_element: int = 6000,
+ focus_on_text=False,
+ ) -> None:
+ if exclude_dendrite_ids == True:
+ for tag in root_soup.find_all():
+ if "d-id" in tag.attrs:
+ del tag["d-id"]
+
+ self.orginal_size = len(str(root_soup))
+ self.root = BeautifulSoup(str(root_soup), "html.parser")
+ self.original_root = BeautifulSoup(str(root_soup), "html.parser")
+ self.ids_to_expand = ids_to_expand
+ self.expand_crawlable_list = False
+ self.compression_multiplier = compression_multiplier
+ self.lists_with_followable_urls: List[FollowableListInfo] = []
+ self.max_token_size = max_token_size
+ self.max_size_per_element = max_size_per_element
+ self.focus_on_text = focus_on_text
+ self.search_terms = []
+
+ def get_lists_with_followable_urls(self):
+ return self.lists_with_followable_urls
+
+ def _remove_consecutive_newlines(self, text: str, max_newlines=1):
+ cleaned_text = re.sub(r"\n{2,}", "\n" * max_newlines, text)
+ return cleaned_text
+
+ def _parent_is_explicitly_expanded(self, tag: Tag) -> bool:
+ for tag in tag.parents:
+ if tag.get("d-id", None) in self.ids_to_expand:
+ return True
+ return False
+
+ def _should_expand_anyways(self, tag: Tag) -> bool:
+ curr_id = tag.get("d-id", None)
+
+ if curr_id in self.ids_to_expand:
+ return True
+
+ tag_descendants = [
+ descendant for descendant in tag.descendants if isinstance(descendant, Tag)
+ ]
+ for tag in tag_descendants:
+ id = tag.get("d-id", None)
+ if id in self.ids_to_expand:
+ return True
+
+ for parent in tag.parents:
+ id = parent.get("d-id", None)
+ if id in self.ids_to_expand:
+ return True
+ # Expand the children of expanded elements if the expanded element isn't too big
+ if len(str(parent)) > 4000:
+ return False
+
+ return False
+
+ def clear_attrs(self, element: Tag, unique_class_names: List[str]):
+ attrs = {}
+ class_attr = element.get("class", [])
+ salient_attributes = [
+ "type" "alt",
+ "aria-describedby",
+ "aria-label",
+ "aria-role",
+ "input-checked",
+ "label",
+ "name",
+ "option_selected",
+ "placeholder",
+ "readonly",
+ "text-value",
+ "title",
+ "value",
+ "href",
+ ]
+
+ attrs = {
+ attr: (str(value)[:100] if len(str(value)) > 100 else str(value))
+ for attr, value in element.attrs.items()
+ if attr in salient_attributes
+ }
+
+ if class_attr:
+ if isinstance(class_attr, str):
+ class_attr = class_attr.split(" ")
+
+ class_name_len = 0
+ class_max_len = 200
+ classes_to_show = []
+ for class_name in class_attr:
+ if class_name_len + len(class_name) < class_max_len:
+ classes_to_show.append(class_name)
+ class_name_len += len(class_name)
+
+ if len(classes_to_show) > 0:
+ attrs = {**attrs, "class": " ".join(classes_to_show)}
+
+ id = element.get("id")
+ d_id = element.get("d-id")
+
+ if isinstance(id, str):
+ attrs = {**attrs, "id": id}
+
+ if d_id:
+ attrs = {**attrs, "d-id": d_id}
+
+ element.attrs = attrs
+
+ def extract_crawlable_list(
+ self, repeating_element_sequence_ids: List[str], amount_repeating_left: int
+ ):
+ items: List[Tag] = []
+ parent_element_d_id: str = ""
+ first_element_d_id = repeating_element_sequence_ids[0]
+
+ for d_id in repeating_element_sequence_ids:
+
+ el = self.original_root.find(attrs={"d-id": str(d_id)})
+ if (
+ parent_element_d_id == ""
+ and isinstance(el, Tag)
+ and isinstance(el.parent, Tag)
+ ):
+ parent_element_d_id = str(el.parent.get("d-id", ""))
+
+ original = BeautifulSoup(str(el), "html.parser")
+ link = original.find("a")
+ if link and isinstance(original, Tag):
+ items.append(original)
+
+ if (
+ len(items) == len(repeating_element_sequence_ids)
+ and len(items) >= MAX_REPEATING_ELEMENT_AMOUNT
+ and parent_element_d_id != ""
+ ):
+ self.lists_with_followable_urls.append(
+ {
+ "amount": len(items) + amount_repeating_left,
+ "expanded_elements": items,
+ "parent_element_d_id": parent_element_d_id,
+ "first_element_d_id": first_element_d_id,
+ }
+ )
+
+ def get_html_display(self) -> str:
+ def collapse(element: PageElement) -> str:
+ chars_to_keep = 2000 if self.focus_on_text else 100
+
+ if isinstance(element, Tag):
+ if element.get("d-id", "") == "-1":
+ return ""
+
+ text = element.get_text()
+ if text:
+ element.attrs["is-compressed"] = "true"
+ element.attrs["d-id"] = str(element.get("d-id", ""))
+ element.clear()
+ element.append(
+ truncate_and_remove_whitespace(
+ text, max_len_start=chars_to_keep, max_len_end=chars_to_keep
+ )
+ )
+ return str(element)
+ else:
+ return ""
+ elif isinstance(element, NavigableString):
+ return truncate_and_remove_whitespace(
+ element, max_len_start=chars_to_keep, max_len_end=chars_to_keep
+ )
+ else:
+ return ""
+
+ start_time = time.time()
+ class_names = [
+ name for tag in self.root.find_all() for name in tag.get("class", [])
+ ]
+
+ counts = Counter(class_names)
+ unique_class_names = [name for name, count in counts.items() if count == 1]
+
+ def get_repeating_element_info(el: Tag) -> Tuple[str, List[str]]:
+ return (
+ el.name,
+ [el.name for el in el.children if isinstance(el, Tag)],
+ )
+
+ def is_repeating_element(
+ previous_element_info: Optional[Tuple[str, List[str]]], element: Tag
+ ) -> bool:
+ if previous_element_info:
+ repeat_element_info = get_repeating_element_info(element)
+ return (
+ previous_element_info == repeat_element_info
+ and element.name != "div"
+ )
+
+ return False
+
+ # children_size += token_count(str(child))
+ # if children_size > 400:
+ # children_left = {}
+ # for c in child.next_siblings:
+ # if isinstance(c, Tag):
+ # if c.name in children_left:
+ # children_left[c.name] += 1
+ # else:
+ # children_left[c.name] = 0
+ # desc = ""
+ # for c_name in children_left.keys():
+ # desc = f"{children_left[c_name]} {c_name} tag(s) truncated for readability"
+ # child.replace_with(f"[...{desc}...]")
+ # break
+
+ def traverse(tag: Union[BeautifulSoup, Tag]):
+ previous_element_info: Optional[Tuple[str, List[str]]] = None
+ repeating_element_sequence_ids = []
+ has_placed_truncation = False
+ same_element_repeat_amount: int = 0
+
+ tag_children = (child for child in tag.children if isinstance(child, Tag))
+
+ total_token_size = 0
+ for index, child in enumerate(tag_children):
+
+ total_token_size += len(str(child))
+ # if total_token_size > self.max_size_per_element * 4 and index > 60:
+ # names = {}
+ # for next_sibling in child.next_siblings:
+ # if isinstance(next_sibling, Tag):
+ # if next_sibling.name in names:
+ # names[next_sibling.name] += 1
+ # else:
+ # names[next_sibling.name] = 1
+
+ # removable = [sib for sib in child.next_siblings]
+ # for sib in removable:
+ # try:
+ # sib.replace_with("")
+ # except:
+ # print("failed to replace sib: ", str(sib))
+
+ # truncation_message = []
+ # for element_name, amount_hidden in names.items():
+ # truncation_message.append(
+ # f"{amount_hidden} `{element_name}` element(s)"
+ # )
+
+ # child.replace_with(
+ # f"[...{','.join(truncation_message)} hidden for readablity ...]"
+ # )
+ # break
+
+ repeating_element_sequence_ids.append(child.get("d-id", "None"))
+
+ if is_repeating_element(previous_element_info, child):
+ same_element_repeat_amount += 1
+
+ if (
+ same_element_repeat_amount > MAX_REPEATING_ELEMENT_AMOUNT
+ and self._parent_is_explicitly_expanded(child) == False
+ ):
+ amount_repeating = 0
+ if isinstance(child, Tag):
+ for sibling in child.next_siblings:
+ if isinstance(sibling, Tag) and is_repeating_element(
+ previous_element_info, sibling
+ ):
+ amount_repeating += 1
+
+ if has_placed_truncation == False and amount_repeating >= 1:
+ child.replace_with(
+ f"[...{amount_repeating} repeating `{child.name}` elements collapsed for readability...]"
+ )
+ has_placed_truncation = True
+
+ self.extract_crawlable_list(
+ repeating_element_sequence_ids, amount_repeating
+ )
+
+ if self.expand_crawlable_list == True:
+ for d_id in repeating_element_sequence_ids:
+ sequence_element = self.root.find(
+ attrs={"d-id": str(d_id)}
+ )
+
+ if isinstance(sequence_element, Tag):
+ original = BeautifulSoup(
+ str(
+ self.original_root.find(
+ attrs={"d-id": str(d_id)}
+ )
+ ),
+ "html.parser",
+ )
+ links = original.find_all("a")
+ for link in links:
+
+ self.ids_to_expand.append(
+ str(link.get("d-id", "None"))
+ )
+ sequence_element.replace_with(original)
+ traverse(sequence_element)
+
+ repeating_element_sequence_ids = []
+ else:
+ child.replace_with("")
+ continue
+
+ else:
+ has_placed_truncation = False
+ previous_element_info = get_repeating_element_info(child)
+ same_element_repeat_amount = 0
+
+ # If a parent is expanded, allow larger element until collapsing
+ compression_mod = self.compression_multiplier
+ if self._parent_is_explicitly_expanded(child):
+ compression_mod = 0.5
+
+ if len(str(child)) < self.orginal_size // 300 * compression_mod:
+ if self._should_expand_anyways(child):
+ traverse(child)
+ else:
+ chars_to_keep = 2000 if self.focus_on_text else 80
+ truncated_text = truncate_long_string_w_words(
+ child.get_text().replace("\n", ""),
+ max_len_start=chars_to_keep,
+ max_len_end=chars_to_keep,
+ )
+ if truncated_text.strip():
+ child.attrs = {
+ "is-compressed": "true",
+ "d-id": str(child.get("d-id", "")),
+ }
+ child.string = truncated_text
+ else:
+ child.replace_with("")
+ elif len(str(child)) > self.orginal_size // 10 * compression_mod:
+ traverse(child)
+ else:
+ if self._should_expand_anyways(child):
+ traverse(child)
+ else:
+ replacement = collapse(child)
+ child.replace_with(BeautifulSoup(replacement, "html.parser"))
+
+ # total_token_size += len(str(child))
+ # print("total_token_size: ", total_token_size)
+
+ # if total_token_size > 2000:
+ # next_element_tags = [
+ # sibling.name for sibling in child.next_siblings if isinstance(sibling, Tag)]
+ # child.replace_with(
+ # f"[...{', '.join(next_element_tags)} tags collapsed for readability...]")
+
+ def remove_double_nested(soup):
+ for tag in soup.find_all(True):
+ # If a tag only contains a single child of the same type
+ if len(tag.find_all(True, recursive=False)) == 1 and isinstance(
+ tag.contents[0], Tag
+ ):
+ child_tag = tag.contents[0]
+ # move the contents of the child tag up to the parent
+ tag.clear()
+ tag.extend(child_tag.contents)
+ if len(tag.find_all(True, recursive=False)) == 1 and isinstance(
+ tag.contents[0], Tag
+ ):
+ remove_double_nested(tag)
+
+ return soup
+
+ def is_effectively_empty(element):
+ if element.name and not element.attrs:
+ if not element.contents or all(
+ isinstance(child, NavigableString) and len(child.strip()) < 3
+ for child in element.contents
+ ):
+ return True
+ return False
+
+ start_time = time.time()
+ for i in range(10):
+ for element in self.root.find_all(is_effectively_empty):
+ element.decompose()
+
+ for tag in self.root.find_all():
+ self.clear_attrs(tag, unique_class_names)
+
+ if len(str(self.root)) < 1500:
+ return self.root.prettify()
+
+ # print("time: ", end_time - start_time)
+
+ # remove_double_nested(self.root)
+ # clean_attributes(root, keep_dendrite_id=False)
+ traverse(self.root)
+ # print("traverse time: ", end_time - start_time)
+
+ return self.root.prettify()
+
+ def get_compression_level(self) -> Tuple[str, int]:
+ if self.orginal_size > 100000:
+ return "4/4 (Extremely compressed)", 4
+ elif self.orginal_size > 40000:
+ return "3/4 (Very compressed)", 3
+ elif self.orginal_size > 4000:
+ return "2/4 (Slightly compressed)", 2
+ elif self.orginal_size > 400:
+ return "1/4 (Very mild compression)", 1
+ else:
+ return "0/4 (no compression)", 0
+
+ async def compress(self, search_terms: List[str] = []) -> str:
+ iterations = 0
+ pretty = ""
+ self.search_terms = search_terms
+
+ while token_count(pretty) > self.max_token_size or pretty == "":
+ iterations += 1
+ if iterations > 5:
+ break
+ compression_level_desc, _ = self.get_compression_level()
+ # Show elements with relevant search terms more
+ if len(self.search_terms) > 0:
+
+ def contains_text(element):
+ if element:
+ # Check only direct text content, not including nested elements
+ direct_text = "".join(
+ child
+ for child in element.children
+ if isinstance(child, NavigableString)
+ ).lower()
+ return any(
+ term.lower() in direct_text for term in self.search_terms
+ )
+ return False
+
+ matching_elements = self.original_root.find_all(contains_text)
+ for element in matching_elements:
+ print(f"Element contains search word: {str(element)[:400]}")
+ d_id = element.get("d-id")
+ if d_id:
+ self.ids_to_expand.append(d_id)
+
+ # print("old: ", self.orginal_size)
+ md = self.get_html_display()
+ md = self._remove_consecutive_newlines(md)
+ pretty = BeautifulSoup(md, "html.parser").prettify()
+ end = time.time()
+ # print("pretty: ", pretty)
+ # print("new: ", token_count(pretty))
+ # print("took: ", end - start)
+ # print("compression_level: ", compression_level_desc)
+ self.compression_multiplier *= 2
+
+ return pretty
diff --git a/dendrite/logic/extract/extract.py b/dendrite/logic/extract/extract.py
new file mode 100644
index 0000000..e44664f
--- /dev/null
+++ b/dendrite/logic/extract/extract.py
@@ -0,0 +1,156 @@
+import asyncio
+import hashlib
+from typing import List, Optional
+from urllib.parse import urlparse
+
+from loguru import logger
+
+from dendrite.logic.config import Config
+from dendrite.logic.extract.cache import get_scripts, get_working_cached_script
+from dendrite.logic.extract.extract_agent import ExtractAgent
+from dendrite.models.dto.cached_extract_dto import CachedExtractDTO
+from dendrite.models.dto.extract_dto import ExtractDTO
+from dendrite.models.response.extract_response import ExtractResponse
+from dendrite.models.scripts import Script
+
+
+async def get_cached_scripts(dto: CachedExtractDTO, config: Config) -> List[Script]:
+ return get_scripts(dto.prompt, dto.url, config.extract_cache) or []
+
+
+async def test_cache(
+ extract_dto: ExtractDTO, config: Config
+) -> Optional[ExtractResponse]:
+ try:
+
+ cached_script_res = await get_working_cached_script(
+ extract_dto.combined_prompt,
+ extract_dto.page_information.raw_html,
+ extract_dto.page_information.url,
+ extract_dto.return_data_json_schema,
+ config,
+ )
+
+ if cached_script_res is None:
+ return None
+
+ script, script_exec_res = cached_script_res
+ return ExtractResponse(
+ status="success",
+ message="Re-used a preexisting script from cache with the same specifications.",
+ return_data=script_exec_res,
+ created_script=script.script,
+ )
+
+ except Exception as e:
+ return ExtractResponse(
+ status="failed",
+ message=str(e),
+ )
+
+
+class InMemoryLockManager:
+ # Class-level dictionaries to keep track of locks and events
+ locks = {}
+ events = {}
+ global_lock = asyncio.Lock()
+
+ def __init__(
+ self,
+ extract_page_dto: ExtractDTO,
+ ):
+ self.key = self.generate_key(extract_page_dto)
+
+ def generate_key(self, extract_page_dto: ExtractDTO) -> str:
+ domain = urlparse(extract_page_dto.page_information.url).netloc
+ key_data = f"{domain}:{extract_page_dto.combined_prompt}"
+ return hashlib.sha256(key_data.encode()).hexdigest()
+
+ async def acquire_lock(self, timeout: int = 60) -> bool:
+ async with InMemoryLockManager.global_lock:
+ if self.key in InMemoryLockManager.locks:
+ # Lock is already acquired
+ return False
+ else:
+ # Acquire the lock
+ InMemoryLockManager.locks[self.key] = True
+ return True
+
+ async def release_lock(self):
+ async with InMemoryLockManager.global_lock:
+ InMemoryLockManager.locks.pop(self.key, None)
+ InMemoryLockManager.events.pop(self.key, None)
+
+ async def publish(self, message: str):
+ async with InMemoryLockManager.global_lock:
+ event = InMemoryLockManager.events.get(self.key)
+ if event:
+ event.set()
+
+ async def subscribe(self):
+ async with InMemoryLockManager.global_lock:
+ if self.key not in InMemoryLockManager.events:
+ InMemoryLockManager.events[self.key] = asyncio.Event()
+ # No need to assign to self.event; return the event instead
+ return InMemoryLockManager.events[self.key]
+
+ async def wait_for_notification(
+ self, event: asyncio.Event, timeout: float = 1600.0
+ ) -> bool:
+ try:
+ await asyncio.wait_for(event.wait(), timeout)
+ return True
+ except asyncio.TimeoutError as e:
+ logger.error(f"Timeout error: {e}")
+ return False
+ finally:
+ # Clean up event
+ async with InMemoryLockManager.global_lock:
+ InMemoryLockManager.events.pop(self.key, None)
+
+
+async def extract(extract_page_dto: ExtractDTO, config: Config) -> ExtractResponse:
+
+ lock_manager = InMemoryLockManager(extract_page_dto)
+ lock_acquired = await lock_manager.acquire_lock()
+
+ if lock_acquired:
+ return await generate_script(extract_page_dto, lock_manager, config)
+ else:
+ res = await wait_for_script_generation(extract_page_dto, lock_manager, config)
+
+ if res:
+ return res
+ # Else create a working script since page is different
+ extract_agent = ExtractAgent(extract_page_dto.page_information, config=config)
+ res = await extract_agent.write_and_run_script(extract_page_dto)
+ return res
+
+
+async def generate_script(
+ extract_page_dto: ExtractDTO, lock_manager: InMemoryLockManager, config: Config
+) -> ExtractResponse:
+ try:
+ extract_agent = ExtractAgent(extract_page_dto.page_information, config=config)
+ res = await extract_agent.write_and_run_script(extract_page_dto)
+ await lock_manager.publish("done")
+ return res
+ except Exception as e:
+ await lock_manager.publish("failed")
+ raise e
+ finally:
+ await lock_manager.release_lock()
+
+
+async def wait_for_script_generation(
+ extract_page_dto: ExtractDTO, lock_manager: InMemoryLockManager, config: Config
+) -> Optional[ExtractResponse]:
+ event = await lock_manager.subscribe()
+ logger.info("Waiting for script to be generated")
+ notification_received = await lock_manager.wait_for_notification(event)
+
+ # If script was created after waiting
+ if notification_received:
+ res = await test_cache(extract_page_dto, config)
+ if res:
+ return res
diff --git a/dendrite/logic/extract/extract_agent.py b/dendrite/logic/extract/extract_agent.py
new file mode 100644
index 0000000..6173351
--- /dev/null
+++ b/dendrite/logic/extract/extract_agent.py
@@ -0,0 +1,289 @@
+import json
+import re
+import sys
+from typing import List, Union
+
+from bs4 import BeautifulSoup
+
+from dendrite import logger
+
+from dendrite.logic.config import Config
+from dendrite.logic.dom.strip import mild_strip
+from dendrite.logic.extract.cache import save_script
+from dendrite.logic.extract.prompts import (
+ LARGE_HTML_CHAR_TRUNCATE_LEN,
+ create_script_prompt_segmented_html,
+)
+from dendrite.logic.extract.scroll_agent import ScrollAgent
+from dendrite.logic.get_element.hanifi_search import get_expanded_dom
+from dendrite.logic.llm.agent import Agent, Message
+from dendrite.logic.llm.token_count import token_count
+from dendrite.models.dto.extract_dto import ExtractDTO
+from dendrite.models.page_information import PageInformation
+from dendrite.models.response.extract_response import ExtractResponse
+
+from ..ask.image import segment_image
+from ..code.code_session import CodeSession
+
+
+class ExtractAgent(Agent):
+ def __init__(self, page_information: PageInformation, config: Config) -> None:
+ super().__init__(config.llm_config.get("extract_agent"))
+ self.page_information = page_information
+ self.soup = BeautifulSoup(page_information.raw_html, "lxml")
+ self.messages = []
+ self.current_segment = 0
+ self.config = config
+
+ async def write_and_run_script(
+ self, extract_page_dto: ExtractDTO
+ ) -> ExtractResponse:
+ mild_soup = mild_strip(self.soup)
+
+ segments = segment_image(
+ extract_page_dto.page_information.screenshot_base64, segment_height=4000
+ )
+
+ scroll_agent = ScrollAgent(
+ self.page_information, llm_config=self.config.llm_config
+ )
+ scroll_result = await scroll_agent.scroll_through_page(
+ extract_page_dto.combined_prompt,
+ image_segments=segments,
+ )
+
+ if scroll_result.status == "error":
+ return ExtractResponse(
+ status="impossible",
+ message=str(scroll_result.message),
+ )
+
+ if scroll_result.status == "loading":
+ return ExtractResponse(
+ status="loading",
+ message="This page is still loading. Please wait a bit longer.",
+ )
+
+ expanded_html = None
+
+ if scroll_result.element_to_inspect_html:
+ combined_prompt = (
+ "Get these elements (make sure you only return element that you are confident that these are the correct elements, it's OK to not select any elements):\n- "
+ + "\n- ".join(scroll_result.element_to_inspect_html)
+ )
+ expanded = await get_expanded_dom(
+ mild_soup, combined_prompt, self.config.llm_config
+ )
+ if expanded:
+ expanded_html = expanded[0]
+
+ if expanded_html:
+ return await self.code_script_from_found_expanded_html_tags(
+ extract_page_dto, expanded_html
+ )
+
+ raise Exception("Failed to extract data from the page") # TODO: skriv bättre
+
+ def segment_large_tag(self, tag):
+ segments = []
+ current_segment = ""
+ current_tokens = 0
+ for line in tag.split("\n"):
+ line_tokens = token_count(line)
+ if current_tokens + line_tokens > 4000:
+ segments.append(current_segment)
+ current_segment = line
+ current_tokens = line_tokens
+ else:
+ current_segment += line + "\n"
+ current_tokens += line_tokens
+ if current_segment:
+ segments.append(current_segment)
+ return segments
+
+ async def code_script_from_found_expanded_html_tags(
+ self, extract_page_dto: ExtractDTO, expanded_html
+ ):
+
+ agent_logger = logger.bind(scope="extract", step="generate_code")
+
+ user_prompt = create_script_prompt_segmented_html(
+ extract_page_dto.combined_prompt,
+ expanded_html,
+ self.page_information.url,
+ )
+ # agent_logger.debug(f"User prompt created: {user_prompt[:100]}...")
+
+ content = {
+ "type": "text",
+ "text": user_prompt,
+ }
+
+ messages: List[Message] = [
+ {"role": "user", "content": user_prompt},
+ ]
+
+ iterations = 0
+ max_retries = 10
+
+ for iterations in range(max_retries):
+ agent_logger.debug(f"Code generation | Iteration: {iterations}")
+
+ text = await self.call_llm(messages)
+ messages.append({"role": "assistant", "content": text})
+
+ json_pattern = r"```json(.*?)```"
+ code_pattern = r"```python(.*?)```"
+
+ if text is None:
+ content = "Error: Failed to generate content."
+ messages.append({"role": "user", "content": content})
+ continue
+
+ json_matches = re.findall(json_pattern, text, re.DOTALL)
+ code_matches = re.findall(code_pattern, text, re.DOTALL)
+
+ if len(json_matches) + len(code_matches) > 1:
+ content = "Error: Please output only one action at a time (either JSON or Python code, not both)."
+ messages.append({"role": "user", "content": content})
+ continue
+
+ if code_matches:
+ self.generated_script = code_matches[0].strip()
+ result = await self._handle_code_match(
+ code_matches[0].strip(),
+ messages,
+ iterations,
+ max_retries,
+ extract_page_dto,
+ agent_logger,
+ )
+
+ messages.extend(result)
+ continue
+
+ elif json_matches:
+ result = self._handle_json_match(json_matches[0], expanded_html)
+ if isinstance(result, ExtractResponse):
+ save_script(
+ self.generated_script,
+ extract_page_dto.combined_prompt,
+ self.page_information.url,
+ cache=self.config.extract_cache,
+ )
+ return result
+ elif isinstance(result, list):
+ messages.extend(result)
+ continue
+ else:
+ # If neither code nor json matches found, send error message
+ content = "Error: Could not find valid code or JSON in the assistant's response."
+ messages.append({"role": "user", "content": content})
+ continue
+
+ # agent_logger.warning("Failed to create script after retrying several times")
+ return ExtractResponse(
+ status="failed",
+ message="Failed to create script after retrying several times.",
+ return_data=None,
+ created_script=self.generated_script,
+ )
+
+ async def _handle_code_match(
+ self,
+ generated_script: str,
+ messages: List[Message],
+ iterations,
+ max_retries,
+ extract_page_dto: ExtractDTO,
+ agent_logger,
+ ) -> List[Message]:
+ temp_code_session = CodeSession()
+
+ try:
+ variables = temp_code_session.exec_code(
+ generated_script, self.soup, self.page_information.raw_html
+ )
+
+ if "response_data" not in variables:
+ return [
+ {
+ "role": "user",
+ "content": "Error: You need to add the variable 'response_data'",
+ }
+ ]
+
+ self.response_data = variables["response_data"]
+
+ if extract_page_dto.return_data_json_schema:
+ temp_code_session.validate_response(
+ extract_page_dto.return_data_json_schema, self.response_data
+ )
+
+ llm_readable_exec_res = temp_code_session.llm_readable_exec_res(
+ variables,
+ extract_page_dto.combined_prompt,
+ iterations,
+ max_retries,
+ )
+
+ return [{"role": "user", "content": llm_readable_exec_res}]
+
+ except Exception as e:
+ return [{"role": "user", "content": f"Error: {str(e)}"}]
+
+ def _handle_json_match(
+ self, json_str: str, expanded_html: str
+ ) -> Union[ExtractResponse, List[Message]]:
+ try:
+ data_dict = json.loads(json_str)
+
+ if "request_more_html" in data_dict:
+ return self._handle_more_html_request(expanded_html)
+
+ if "error" in data_dict:
+ raise Exception(data_dict["error"])
+
+ if "success" in data_dict:
+ return ExtractResponse(
+ status="success",
+ message=data_dict["success"],
+ return_data=self.response_data,
+ created_script=self.generated_script,
+ )
+ return [
+ {
+ "role": "user",
+ "content": "Error: JSON response does not specify a valid action.",
+ }
+ ]
+
+ except Exception as e:
+ return [{"role": "user", "content": f"Error: {str(e)}"}]
+
+ def _handle_more_html_request(self, expanded_html: str) -> List[Message]:
+
+ if LARGE_HTML_CHAR_TRUNCATE_LEN * (self.current_segment + 1) >= len(
+ expanded_html
+ ):
+ return [{"role": "user", "content": "There is no more HTML to show."}]
+
+ self.current_segment += 1
+ start = LARGE_HTML_CHAR_TRUNCATE_LEN * self.current_segment
+ end = min(
+ LARGE_HTML_CHAR_TRUNCATE_LEN * (self.current_segment + 1),
+ len(expanded_html),
+ )
+
+ content = (
+ f"""Here is more of the HTML:\n```html\n{expanded_html[start:end]}\n```"""
+ )
+
+ if len(expanded_html) > end:
+ content += (
+ "\nThere is still more HTML to see. You can request more if needed."
+ )
+ else:
+ content += "\nThis is the end of the HTML content."
+
+ return [{"role": "user", "content": content}]
diff --git a/dendrite/logic/extract/prompts.py b/dendrite/logic/extract/prompts.py
new file mode 100644
index 0000000..35c1037
--- /dev/null
+++ b/dendrite/logic/extract/prompts.py
@@ -0,0 +1,230 @@
+def get_script_prompt(final_compressed_html: str, prompt: str, current_url: str):
+ return f"""Compressed HTML:
+{final_compressed_html}
+
+Please look at the HTML DOM above and use execute_code to accomplish the user's task.
+
+Don't use the attributes 'is-compressed' and 'd-id' inside your script.
+
+Prefer using soup.select() over soup.find_all().
+
+If you are asked to fetch text from an article or similar it's generally a good idea to find the element(s) containing the article text and extracting the text from those. You'll also need to remove unwanted text from elements that isn't article text.
+
+All elements with the attribute is-compressed="true" are collapsed and may contain hidden elements. If you need to use an element that is compressed you have to call expand_html_further, example:
+
+expand_html_further({{"prompt": "I need to understand the structure of at least one product to create a script that fetches each product, since all the products are compressed I'll expand the first two ones. I'll also expand the pagenation controls since they are relevant for the task.", "d-ids_to_expand": "3uy9v2, 3uy9d2, -29ahd"}})
+
+When scraping a list of items make sure at least one of the items is fully expanded to understand each items' structure before you code. You don't need to expand all items if you can see that there is a repeating structure.
+
+You code must be a full implementation that solves the user's task.
+
+Try to make your scripts as general as possible. They should work for different pages with a similar html structure if possible. No hard-coded values that'll only work for the page above.
+
+Finally, the script must contain a variable called 'response_data'. This variable is sent back to the user and it must match the match the specification inside their prompt listed below.
+
+Current URL: {current_url}
+User's Prompt:
+{prompt}"""
+
+
+def expand_futher_prompt(
+ compressed_html: str,
+ max_iterations: int,
+ iterations: int,
+ reasoning_prompt: str,
+ question: str,
+):
+ return f"""{compressed_html}
+
+Please look at the compressed HTML above and output a comma separated of elements that need to be de-compressed so that the task can be solved.
+
+Task: '{question}'
+
+Every element with the attribute is-compressed="true" can be de-compressed. Compressed elements may contain hidden elements such as anchor tags and buttons, so it's really important that relevant element to the task are expanded.
+
+You'll get max {max_iterations} interations to explore the HTML DOM Tree.
+
+You are currently on iteration {iterations}. Try to expand the DOM in relevant places at least three times.
+
+{reasoning_prompt}
+
+It's really important that you expand ALL the elements you believe could be useful for the task! However, in situations where you have repeating elements, such as products elements in a product list or sections of paragraphs in an article, you only need to expand a few of the repeating elements to be able to understand the others' structure.
+
+Now you may output:
+- Ids to inspect further prefixed by some short reasoning (Don't expand irrelevant element and avoid outputting many IDs since that increases the token size of the HTML preview)
+- "Done" once every relevant element is expanded.
+- An error message if the task is too vauge or not possible to complete. A common use-case for the error message is when a page loads incorrectly and none of the task's data is available.
+
+See the examples below to see each outputs format:
+
+EXAMPLE OUTPUT
+Reasoning: Most of the important elements are expanded, but I still need to understand the article's headings' HTML structure. To do this I'll expand the first section heading with the text 'hello kitty' and the d-id adh2ia. I'll also expand the related infobox with the id -s29as. By expanding these I'll be able to understand all the article's titles.
+Ids: adh2ia, -s29as
+END EXAMPLE OUTPUT
+
+EXAMPLE OUTPUT
+Reasoning: To understand the structure of the compressed product cards in the product list I'll expand the three first ones with the d-ids -7ap2j1, -7ap288 and -7ap2au. I'll also the pagenation controls at the bottom of the product list since pagenation can be useful for the task, this includes the page buttons for '1', '2' and '3' button with the d-ids j02ajd, j20had, j9dwh9 and the 'next page' button with the id j9dwss.
+Ids: -7ap2j1, -7ap288, -7ap2au, j02ajd, j20had, j9dwh9, j9dwss
+END EXAMPLE OUTPUT
+
+EXAMPLE OUTPUT
+Done
+END EXAMPLE OUTPUT
+
+EXAMPLE OUTPUT
+Error: I don't understand what is mean with 'extract the page text', this page is completely empty.
+END EXAMPLE OUTPUT"""
+
+
+def generate_prompt_extract_compressed_html(
+ combined_prompt: str,
+ expanded_html: str,
+ current_url: str,
+):
+ return f"""You are a web scraping agent that runs one action at a time by outputting a message with either elements to decompress, code to run or a status message. Never run several actions in the same message.
+
+Code a bs4 or regex script that can solve the task listed below for the webpage I'll specify below. First, inspect relevant areas of the DOM.
+
+
+{combined_prompt}
+
+
+Here is a compressed version of the webpage's HTML:
+
+```html
+{expanded_html}
+```
+
+
+Important: Every element with the attribute `is-compressed="true"` is compressed – compressed elements may contain hidden elements such as anchor tags and buttons, so you need to decompress them to fully understand their structure before you write a script!
+
+Below are your available functions and how to use them:
+
+Start by outputting one or more d-ids of elements you'd like to decompress before you right a script. Focus on decompressing elements that look relevant to the task. If possible, expand one d-id at a time. Output in a format like this:
+
+[Short reasoning first.]
+```json
+{{
+ "d-ids": ["xxx", "yyy"]
+}}
+```
+
+Once you have decompressed the DOM at least one time in separate messages and have a good enough understanding of the page's structure, write some python code to extract the required data using bs4 or regex. `from datetime import datetime` is available.
+
+Your code will be ran inside exec() so don't use a return statement, just create variables.
+
+To scrape information from the current page use the predefined variable `html_string` (all the page's html as a string) or `soup` (current page's root's bs4 object). Don't use 'd-id' and 'is_compressed' in your script since these are temporary. Use selectors native to the site.
+
+The script must contain a variable called 'response_data' and it's structure must match the task listed above.
+
+Don't return a response_data with hardcoded values that only work for the current page. The script must be general and work for similar pages with the same structure.
+
+Unless specified, return an exception if a expected value cannot be extracted.
+
+The current URL is: {current_url}
+
+Here's how you can do it in a message:
+
+[Do some reasoning first]
+```python
+# Simple bs4 code that fetches all the page's hrefs
+response_data = [a.get('href') for a in soup.find_all('a')] # Uses the predefined soup variable
+```
+
+If the task isn't possible to complete (maybe because the task is too vauge, the page contains an error or the page failed to load) don't try and create a script with many assumptions. Instead, output an error like this:
+
+```json
+{{
+ "error": "error message"
+}}
+```
+
+Once you've created and ran a script and you are happy with response_data, output a short success message (max one paragraph) containing json like this, the response_data will automatically be returned to the user once you send this message, you don't need to output it:
+
+```json
+{{
+ "success": "Write one-two sentences about how your the script works and how you ended up with the result you got."
+}}
+```
+
+Don't include both the python code and json object in the same message.
+
+Be sure that the the script has been execucted and you have seen the response_data in a previous message before you output the success message."""
+
+
+LARGE_HTML_CHAR_TRUNCATE_LEN = 40000
+
+
+def create_script_prompt_segmented_html(
+ combined_prompt: str,
+ expanded_html: str,
+ current_url: str,
+):
+ if len(expanded_html) / 4 > LARGE_HTML_CHAR_TRUNCATE_LEN:
+ html_prompt = f"""```html
+ {expanded_html[:LARGE_HTML_CHAR_TRUNCATE_LEN]}
+```
+This HTML is truncated to {LARGE_HTML_CHAR_TRUNCATE_LEN} characters since it was too large. If you need to see more of the HTML, output a message like this:
+```json
+{{
+ "request_more_html": true
+}}
+```
+"""
+ else:
+ html_prompt = f"""```html
+ {expanded_html}
+```
+"""
+
+ return f"""You are a web scraping agent that analyzes HTML and writes Python scripts to extract data. Your task is to solve the following request for the webpage specified below.
+
+
+{combined_prompt}
+
+
+Current URL: {current_url}
+
+Here is a truncated version of the HTML that focuses on relevant parts of the webpage (some elements are have been replaced with their text contents):
+{html_prompt}
+
+Instructions:
+1. Analyze the provided HTML segments carefully.
+
+2. Use bs4 or regex. `from datetime import datetime` is available.
+- Your code will be ran inside exec() so don't use a return statement, just create variables.
+- To scrape information from the current page use the predefined variable `html_string` (all the page's html as a string) or `soup` (current page's root's bs4 object). Don't use 'd-id' and 'is_compressed' in your script since these are temporary. Use selectors native to the site.
+- The script must contain a variable called 'response_data' and it's structure must match the task listed above.
+- Don't return a response_data with hardcoded values that only work for the current page. The script must be general and work for similar pages with the same structure.
+- Unless specified, return an exception if a expected value cannot be extracted.
+
+3. Output your Python script in this format:
+[Do some reasoning first]
+```python
+# Simple bs4 code that fetches all the page's hrefs
+response_data = [a.get('href') for a in soup.find_all('a')] # Uses the predefined soup variable
+```
+
+Don't output an explaination of the script after the code. Just do some short reasoning before.
+
+4. If the task isn't possible to complete, output an error message like this:
+```json
+{{
+ "error": "Detailed error message explaining why the task can't be completed"
+}}
+```
+
+5. Once you've successfully created and ran a script, seen that the output is correct and you're happy with it, output a short success message:
+```json
+{{
+ "success": "Brief explanation of how your script works and how you arrived at the result"
+}}
+```
+Remember:
+- Only output one action at a time (element index to expand, Python code, or status message).
+- Don't include both Python code and JSON objects in the same message.
+- Ensure the script has been executed and you've seen the `response_data` before sending the success message.
+- Do short reasoning before you output an action, max one-two sentences.
+- Never include a success message in the same output as your Python code. Always output the success message after you've seen the result of your code.
+
+You may now begin by analyzing the HTML or requesting to expand specific elements if needed."""
diff --git a/dendrite/logic/extract/scroll_agent.py b/dendrite/logic/extract/scroll_agent.py
new file mode 100644
index 0000000..9038557
--- /dev/null
+++ b/dendrite/logic/extract/scroll_agent.py
@@ -0,0 +1,232 @@
+import json
+import re
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from typing import List, Literal, Optional
+
+from loguru import logger
+from openai.types.chat.chat_completion_content_part_param import (
+ ChatCompletionContentPartParam,
+)
+
+from dendrite.logic.llm.agent import Agent, Message
+from dendrite.logic.llm.config import LLMConfig
+from dendrite.models.page_information import PageInformation
+
+ScrollActionStatus = Literal["done", "scroll_down", "loading", "error"]
+
+
+@dataclass
+class ScrollResult:
+ element_to_inspect_html: List[str]
+ segment_index: int
+ status: ScrollActionStatus
+ message: Optional[str] = None
+
+
+class ScrollRes(ABC):
+ @abstractmethod
+ def parse(self, data_dict: dict, segment_i: int) -> Optional[ScrollResult]:
+ pass
+
+
+class ElementPromptsAction(ScrollRes):
+ def parse(self, data_dict: dict, segment_i: int) -> Optional[ScrollResult]:
+ if "element_to_inspect_html" in data_dict:
+
+ status = (
+ "scroll_down"
+ if not data_dict.get("continue_scrolling", False)
+ else "done"
+ )
+
+ return ScrollResult(data_dict["element_to_inspect_html"], segment_i, status)
+ return None
+
+
+class LoadingAction(ScrollRes):
+ def parse(self, data_dict: dict, segment_i: int) -> Optional[ScrollResult]:
+ if data_dict.get("is_loading", False):
+ return ScrollResult([], segment_i, "loading")
+ return None
+
+
+class ErrorRes(ScrollRes):
+ def parse(self, data_dict: dict, segment_i: int) -> Optional[ScrollResult]:
+ if "error" in data_dict:
+ return ScrollResult(
+ [],
+ segment_i,
+ "error",
+ data_dict["error"],
+ )
+ return None
+
+
+class ScrollAgent(Agent):
+ def __init__(self, page_information: PageInformation, llm_config: LLMConfig):
+ super().__init__(llm_config.get("scroll_agent"))
+ self.page_information = page_information
+ self.choices: List[ScrollRes] = [
+ ElementPromptsAction(),
+ LoadingAction(),
+ ErrorRes(),
+ ]
+
+ self.logger = logger.bind(agent="scroll_agent")
+
+ async def scroll_through_page(
+ self,
+ combined_prompt: str,
+ image_segments: List[str],
+ ) -> ScrollResult:
+ messages = [self.create_initial_message(combined_prompt, image_segments[0])]
+ all_elements_to_inspect_html = []
+ current_segment = 0
+
+ while current_segment < len(image_segments):
+ data_dict = await self.process_segment(messages)
+
+ for choice in self.choices:
+ result = choice.parse(data_dict, current_segment)
+ if result:
+ if result.element_to_inspect_html:
+ all_elements_to_inspect_html.extend(
+ result.element_to_inspect_html
+ )
+ return result
+
+ if "element_to_inspect_html" in data_dict:
+ all_elements_to_inspect_html.extend(
+ data_dict["element_to_inspect_html"]
+ )
+
+ if self.should_continue_scrolling(
+ data_dict, current_segment, len(image_segments)
+ ):
+ current_segment += 1
+ scroll_message = self.create_scroll_message(
+ image_segments[current_segment]
+ )
+ messages.append(scroll_message)
+ else:
+ break
+
+ return ScrollResult(all_elements_to_inspect_html, current_segment, "done")
+
+ async def process_segment(self, messages: List[Message]) -> dict:
+
+ text = await self.call_llm(messages)
+ messages.append({"role": "assistant", "content": text})
+
+ json_pattern = r"```json(.*?)```"
+
+ json_matches = re.findall(json_pattern, text, re.DOTALL)
+
+ if len(json_matches) > 1:
+ logger.warning("Agent output multiple actions in one message")
+ error_message = "Error: Please output only one action at a time."
+ messages.append({"role": "user", "content": error_message})
+ raise Exception(error_message)
+ elif json_matches:
+ return json.loads(json_matches[0].strip())
+
+ error_message = "No valid JSON found in the response"
+ logger.error(error_message)
+ messages.append({"role": "user", "content": error_message})
+ raise Exception(error_message)
+
+ def create_initial_message(self, combined_prompt: str, first_image: str) -> Message:
+ content: List[ChatCompletionContentPartParam] = [
+ {
+ "type": "text",
+ "text": f"""You are a web scraping agent that can code scripts to solve the web scraping tasks listed below for the webpage I'll specify. Before we start coding, we need to inspect the html of the page closer.
+
+This is the web scraping task:
+
+{combined_prompt}
+
+
+Analyze the viewport and decide on the next action:
+
+1. Identify elements that we want to inspect closer so we can write the script. Do this by outputting a message with a list of prompts to find the relevant element(s).
+
+Output as few elements as possible, but it should be enought to gain a proper understanding of the DOM for our script.
+
+If a list of items need to be extracted, consider getting a few unique examples of items from the list that differ slightly so we can create code that accounts for their differences. Avoid listing several elements that are very similar since we can infer the structure of one or two of them to the rest.
+
+Don't get several different parts of one relevant element, just get the whole element since it's easier to just inspect the whole element.
+
+Avoid selecting very large elements that contain a lot of html since it can be very overwhelming to inspect.
+
+Always be specific about the element you are thinking of, don't write 'get a item', write 'get the item with the text "Item Name"'.
+
+Here's an example of a good output:
+[Short reasoning first, max one paragraph]
+```json
+{{
+ "element_to_inspect_html": ["The small container containing the weekly amount of downloads, labeled 'Weekly Downloads'", "The element containing the main body of article text, the title is 'React Router DOM'."],
+ "continue_scrolling": true/false (only scroll down if you think more relevant elements are further down the page, only do this if you need to)
+}}
+```
+
+2. If you can't see relevant elements just yet, but you think more data might be available further down the page, output:
+[Short reasoning first, max one paragraph]
+```json
+{{
+ "scroll_down": true
+}}
+```
+
+3. This page was first loaded {round(self.page_information.time_since_frame_navigated, 2)} second(s) ago. If the page is blank or the data is not available on the current page it could be because the page is still loading. If you believe this is the case, output:
+[Short reasoning first, max one paragraph]
+```json
+{{
+ "is_loading": true
+}}
+```
+
+4. In case you the data is not available on the current page and the task does not describe how to handle the non-available data, or there seems to be some kind of mistake, output a json with a short error message, like this:
+[Short reasoning first, max one paragraph]
+```json
+{{
+ "error": "This page doesn't contain any package data, welcome page for 'dendrite.systems', it won't be possible to code a script to extract the requested data.",
+ "was_blocked_by_recaptcha": true/false
+}}
+```
+
+Continue scrolling and accumulating element prompts until you feel like we have enough elements to inspect to create an excellent script.
+
+Important: Only output one json object per message.
+
+Below is a screenshot of the current page, if it looks blank or empty it could still be loading. If this is the case, don't guess what elements to inspect, respond with is loading.""",
+ },
+ {
+ "type": "image_url",
+ "image_url": {"url": f"data:image/jpeg;base64,{first_image}"},
+ },
+ ]
+
+ msg: Message = {"role": "user", "content": content}
+ return msg
+
+ def create_scroll_message(self, image: str) -> Message:
+ return {
+ "role": "user",
+ "content": [
+ {"type": "text", "text": "Scrolled down, here is the new viewport:"},
+ {
+ "type": "image_url",
+ "image_url": {
+ "url": f"data:image/jpeg;base64,{image}",
+ },
+ },
+ ],
+ }
+
+ def should_continue_scrolling(
+ self, data_dict: dict, current_index: int, total_segments: int
+ ) -> bool:
+ return (
+ "scroll_down" in data_dict or data_dict.get("continue_scrolling", False)
+ ) and current_index + 1 < total_segments
diff --git a/dendrite/sync_api/_core/__init__.py b/dendrite/logic/get_element/__init__.py
similarity index 100%
rename from dendrite/sync_api/_core/__init__.py
rename to dendrite/logic/get_element/__init__.py
diff --git a/dendrite/logic/get_element/agents/prompts/__init__.py b/dendrite/logic/get_element/agents/prompts/__init__.py
new file mode 100644
index 0000000..b6ffe51
--- /dev/null
+++ b/dendrite/logic/get_element/agents/prompts/__init__.py
@@ -0,0 +1,199 @@
+SEGMENT_PROMPT = """You are an agent that is given the task to find candidate elements that match the element that the user is looking for. You will get multiple segments of the html of the page and a description of the element that the user is looking for.
+The description can be the text that the element contains, the type of element. You might get both short and long descriptions.
+Don't only look for the exact match of the text.
+
+Look at aria-label if there are any as they are helpful in identifying the elements.
+
+You will get the information in the following format:
+
+
+ DESCRIPTION
+
+
+
+ HTML CONTENT
+
+...
+
+ HTML CONTENT
+
+
+Each element will have an attribute called d-id which you should refer to if you can find the elements that the user is looking for. There might be multiple elements that are fit the user's request, if so include multiple d_id:s.
+If you've selected an element you should NOT select another element that is a child of the element you've selected.
+Be sure to include a reason for why you selected the elements that you did. Think step by step, what made you choose this element over the others.
+Your response should include 2-3 sentences of reasoning and a code block containing json including the backticks, the reason text is just a placeholder. Always include a sentence of reasoning in the output:
+
+```json
+{
+ "reason": ,
+ "d_id": ["125292", "9541ad"],
+ "status": "success"
+}
+```
+
+If no element seems to match the user's request, or you think the page is still loading, output the following with 2-3 sentences of reasoning in the output:
+
+```json
+{
+ "reason": ,
+ "status": "failed" or "loading"
+}
+```
+
+Here are some examples to help you understand the task (your response is the content under "Assistant:"):
+
+Example 1:
+
+USER: Can you get the d_id of the element that matches this description?
+
+
+ pull requests count
+
+
+
+
+
+
+ Pull requests
+
+
+ 14
+
+
+
+
+
+ASSISTANT:
+
+```json
+{
+ "reason": "I selected this element because it has the class Counter and is a number next to the pull requests text.",
+ "d_id": ["235512"],
+ "status": "success"
+}
+```
+
+Example 2:
+
+USER: Can you get the d_id of the element that matches this description?
+
+
+ search bar
+
+
+
+