🔨 Note: This repository is still in development. Contributions and feedback are welcome!
- Requirements:
pip install -r requirements.txt
- Install browser:
python -m playwright install
- This project uses Playwright to control the browser. You can install the browser of your choice using the command above.
- Write your environment variables in a
.env
file (see.env.test
) - Install OmniParser
- For webpage analysis, we use the OmniParser model from Hugging Face. You'll need to host it via an API locally.
- Finding issues on a github repo
- Finding live events
task = "Find 2 recent issues from PyTorch repository."
class IssueModel(BaseModel):
date: str
title: str
author: str
description: str
class OutputModel(BaseModel):
issues: list[IssueModel]
scraper = WebScraper(task, None, OutputModel)
scraper.run()
start_url = "https://in.bookmyshow.com/"
task = "Find 5 events happening in Bangalore this week."
class EventsModel(BaseModel):
name: str
date: str
location: str
class OutputModel(BaseModel):
events: list[EventsModel]
scraper = WebScraper(task, start_url, OutputModel)
scraper.run()
Server:
pip install fastapi[all]
uvicorn server:app --reload
Client:
import requests
url = "http://0.0.0.0:8000/scrape"
payload = {
"start_url": "http://example.com",
"task": "Scrape the website for data",
"schema": {
"title": (str, ...),
"description": (str, ...)
}
}
response = requests.post(url, json=payload)
print(response.status_code)
print(response.json())
💡 Tip: For a hosted solution with a lightning fast Zig based browser, worldwide proxy support, and job queuing system, check out onequery.app.
In the works
- ✅ Basic functionality
- 🛠️ Testing
- 🛠️ Documentation
(needs to be revised)
graph TD;
A[Text Query] --> B[WebLLM];
B --> C[Browser Instructions];
C --> D[Browser Execution];
D --> E[OmniParser];
E --> F[Screenshot & Structured Info];
F --> G[AI];
C --> G;
G --> H[JSON Output];
- Browser: Playwright
- VLLM: OmniParser