BlogAI / Agents

Screenshot API for AI Agents: Self-Hosted, No API Keys

Give your agents eyes — without rate limits or third-party dependencies.

The pattern: Agent calls POST /api/convert/url → receives PNG bytes → passes image to vision model → acts on what it sees. Self-hosted Openkova keeps this loop fast, private, and free of usage caps.

Why AI agents need screenshots

Modern agent frameworks — LangChain, AutoGen, CrewAI, Magentic-One — increasingly use multimodal models. An agent that can see a web page can verify that a form submitted correctly, detect UI regressions, extract structured data from a rendered table, or complete browser-based tasks without relying purely on HTML parsing.

Screenshots are more reliable than HTML for agents because:

Why self-hosted matters for agent workflows

SaaS screenshot APIs have per-request pricing. An agent that takes 50 screenshots per task run, running 100 times a day, generates 5,000 requests daily. At typical SaaS prices, that cost accumulates fast.

More importantly, agents often need to screenshot pages that a SaaS API cannot reach:

A self-hosted API running on the same network has access to all of these.

Calling Openkova from a Python agent

import httpx, json, base64

async def screenshot_url(url: str) -> bytes:
    """Return PNG bytes for a given URL."""
    async with httpx.AsyncClient(timeout=60) as client:
        async with client.stream(
            "POST",
            "http://localhost:3000/api/convert/url",
            json={"url": url, "depth": 1},
        ) as response:
            file_path = None
            async for line in response.aiter_lines():
                if line.startswith("data: "):
                    event = json.loads(line[6:])
                    if event.get("type") == "done":
                        file_path = event["filePath"]
            return file_path  # path inside the container

In practice, mount a shared volume so the agent can read the PNG directly from the filesystem, or add a GET /api/files/:id endpoint behind a reverse proxy to serve files over HTTP.

Calling from a Node.js / TypeScript agent

async function screenshotUrl(url: string): Promise<string> {
  const res = await fetch('http://localhost:3000/api/convert/url', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ url, depth: 1 }),
  });

  const reader = res.body!.getReader();
  const decoder = new TextDecoder();
  let filePath = '';

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    for (const line of decoder.decode(value).split('\n')) {
      if (line.startsWith('data: ')) {
        const event = JSON.parse(line.slice(6));
        if (event.type === 'done') filePath = event.filePath;
      }
    }
  }
  return filePath;
}

Using screenshots with a vision model

Once you have the PNG, pass it to a multimodal model. Example with the Anthropic Claude API:

import anthropic, base64, asyncio

client = anthropic.Anthropic()

async def describe_page(url: str) -> str:
    png_path = await screenshot_url(url)
    with open(png_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    message = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "What is shown on this web page? Summarize the main content."
                }
            ],
        }],
    )
    return message.content[0].text

Architecture for high-frequency agents

For agents that take many screenshots per minute, a single Openkova instance may become the bottleneck. Options for scaling:

SSE streaming and agent feedback loops

Openkova returns Server-Sent Events during conversion. For crawled URLs (depth 1–2), your agent receives progress events as each page is captured. This is useful for agents that need to know which sub-pages were processed before acting on the results.

# SSE event types
{"type": "progress", "message": "Capturing https://example.com"}
{"type": "progress", "message": "Capturing https://example.com/about"}
{"type": "done", "filePath": "/data/abc123.png", "fileCount": 2}

Frequently asked questions

Why do AI agents need a screenshot API?

Multimodal agents see pages as images. A screenshot API gives any agent — regardless of language — a consistent HTTP interface to capture any URL or HTML as a PNG for vision model input.

What is the difference between a screenshot API and Playwright for agents?

Playwright is a Node.js library. Your agent must be JavaScript and manage Chromium directly. A screenshot API is language-agnostic: any Python, Go, or bash agent can call it over HTTP. This matters in multi-agent systems where agents run in different runtimes.

Does Openkova have rate limits for high-frequency agent workflows?

No. Self-hosted means no usage caps. The only ceiling is your server capacity — CPU and memory for concurrent Chromium instances. Scale horizontally with multiple containers behind a load balancer.

Can Openkova screenshot internal or authenticated pages?

Yes. It runs on your own infrastructure and can reach any URL your server can reach — including localhost, VPN-only services, and staging environments. SaaS screenshot APIs cannot access private networks.

Get started: Deploy Openkova with Docker — or see the API reference for the full endpoint and SSE event spec.