Screenshot API for AI Agents: Self-Hosted, No API Keys
Give your agents eyes — without rate limits or third-party dependencies.
POST /api/convert/url → receives PNG bytes → passes image to vision model → acts on what it sees. Self-hosted Openkova keeps this loop fast, private, and free of usage caps.Why AI agents need screenshots
Modern agent frameworks — LangChain, AutoGen, CrewAI, Magentic-One — increasingly use multimodal models. An agent that can see a web page can verify that a form submitted correctly, detect UI regressions, extract structured data from a rendered table, or complete browser-based tasks without relying purely on HTML parsing.
Screenshots are more reliable than HTML for agents because:
- Rendered output reflects JavaScript execution, CSS layout, and lazy-loaded content
- Vision models are trained on visual content, not raw DOM strings
- A PNG is a stable, language-agnostic interchange format
Why self-hosted matters for agent workflows
SaaS screenshot APIs have per-request pricing. An agent that takes 50 screenshots per task run, running 100 times a day, generates 5,000 requests daily. At typical SaaS prices, that cost accumulates fast.
More importantly, agents often need to screenshot pages that a SaaS API cannot reach:
- Internal tools on a private VPN
- Staging environments not exposed to the public internet
- Localhost during development
- Authenticated dashboards where you'd need to pass session cookies
A self-hosted API running on the same network has access to all of these.
Calling Openkova from a Python agent
import httpx, json, base64
async def screenshot_url(url: str) -> bytes:
"""Return PNG bytes for a given URL."""
async with httpx.AsyncClient(timeout=60) as client:
async with client.stream(
"POST",
"http://localhost:3000/api/convert/url",
json={"url": url, "depth": 1},
) as response:
file_path = None
async for line in response.aiter_lines():
if line.startswith("data: "):
event = json.loads(line[6:])
if event.get("type") == "done":
file_path = event["filePath"]
return file_path # path inside the containerIn practice, mount a shared volume so the agent can read the PNG directly from the filesystem, or add a GET /api/files/:id endpoint behind a reverse proxy to serve files over HTTP.
Calling from a Node.js / TypeScript agent
async function screenshotUrl(url: string): Promise<string> {
const res = await fetch('http://localhost:3000/api/convert/url', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url, depth: 1 }),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let filePath = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
for (const line of decoder.decode(value).split('\n')) {
if (line.startsWith('data: ')) {
const event = JSON.parse(line.slice(6));
if (event.type === 'done') filePath = event.filePath;
}
}
}
return filePath;
}Using screenshots with a vision model
Once you have the PNG, pass it to a multimodal model. Example with the Anthropic Claude API:
import anthropic, base64, asyncio
client = anthropic.Anthropic()
async def describe_page(url: str) -> str:
png_path = await screenshot_url(url)
with open(png_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
message = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": "What is shown on this web page? Summarize the main content."
}
],
}],
)
return message.content[0].textArchitecture for high-frequency agents
For agents that take many screenshots per minute, a single Openkova instance may become the bottleneck. Options for scaling:
- Horizontal scaling — run 2–4 Openkova instances behind a round-robin load balancer (Nginx, Traefik, or your cloud provider's ALB)
- Queue-based dispatch — put screenshot requests in a queue (Redis, BullMQ, or a simple Postgres table) and have workers pull from it
- Dedicated containers per agent — in Kubernetes, give each agent pod its own Openkova sidecar container for full isolation
SSE streaming and agent feedback loops
Openkova returns Server-Sent Events during conversion. For crawled URLs (depth 1–2), your agent receives progress events as each page is captured. This is useful for agents that need to know which sub-pages were processed before acting on the results.
# SSE event types
{"type": "progress", "message": "Capturing https://example.com"}
{"type": "progress", "message": "Capturing https://example.com/about"}
{"type": "done", "filePath": "/data/abc123.png", "fileCount": 2}Frequently asked questions
Why do AI agents need a screenshot API?
Multimodal agents see pages as images. A screenshot API gives any agent — regardless of language — a consistent HTTP interface to capture any URL or HTML as a PNG for vision model input.
What is the difference between a screenshot API and Playwright for agents?
Playwright is a Node.js library. Your agent must be JavaScript and manage Chromium directly. A screenshot API is language-agnostic: any Python, Go, or bash agent can call it over HTTP. This matters in multi-agent systems where agents run in different runtimes.
Does Openkova have rate limits for high-frequency agent workflows?
No. Self-hosted means no usage caps. The only ceiling is your server capacity — CPU and memory for concurrent Chromium instances. Scale horizontally with multiple containers behind a load balancer.
Can Openkova screenshot internal or authenticated pages?
Yes. It runs on your own infrastructure and can reach any URL your server can reach — including localhost, VPN-only services, and staging environments. SaaS screenshot APIs cannot access private networks.
Get started: Deploy Openkova with Docker — or see the API reference for the full endpoint and SSE event spec.