Wafer documentation

Wafer is an AI security gateway. It sits between your app and the model and applies guardrails to every request and response — by changing one line of code.

Introduction

Point your existing OpenAI / Anthropic / Gemini / Mistral SDK at a project's gateway URL, keep your own provider key (BYOK — Wafer never stores it), and Wafer enforces your policy at the edge. For Cloudflare Workers AI env.AI bindings, a small wrapper instruments calls in-process (see Workers AI).

Wafer runs on the Cloudflare edge. Cheap checks run inline in single-digit milliseconds; model-based checks run in parallel. Fail-open by default — a guardrail error never takes down your app.

Quickstart

Sign in to the console and create a project — you get a gateway URL and a default policy.
Point your SDK's base_url at it (keep your provider key):

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_OPENAI_KEY",
    base_url="https://wafersecurity.ai/p/<project>/openai",
)
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello from Wafer"}],
)

The console shows live logs immediately. A fresh project stays in a focused setup view until its first request arrives, then unlocks tuning.

Providers & gateway URLs

Every project URL is uniform: https://wafersecurity.ai/p/<project>/<provider>. OpenAI, Anthropic, Mistral and Gemini use their native endpoints/SDKs; the rest are OpenAI-compatible (use the OpenAI SDK pointed at the URL).

Provider	Base URL	Wire format
OpenAI	/p/<project>/openai	OpenAI-compatible
Anthropic	/p/<project>/anthropic	anthropic
Google Gemini	/p/<project>/gemini	gemini
Mistral	/p/<project>/mistral	OpenAI-compatible
Groq	/p/<project>/groq	OpenAI-compatible
DeepSeek	/p/<project>/deepseek	OpenAI-compatible
xAI (Grok)	/p/<project>/xai	OpenAI-compatible
Together AI	/p/<project>/together	OpenAI-compatible
OpenRouter	/p/<project>/openrouter	OpenAI-compatible
Perplexity	/p/<project>/perplexity	OpenAI-compatible
Fireworks	/p/<project>/fireworks	OpenAI-compatible

Gemini uses the native generateContent API via the google-genai SDK (http_options.base_url); Anthropic uses its messages API. Your provider key passes through unchanged.

Guardrails

Each guardrail has an action: block redact flag off, set per project in the console (or via the CLI / admin API).

Guardrail	Catches	Default
Secrets	API keys, private keys (input + output)	block
PII	email, card, SSN, phone	redact
Blocklist	your own terms	block
Prompt injection	jailbreak / injection attempts	flag
LLM judge	nuanced policy (see below)	off

Tiering keeps latency low: Tier-0 regex (secrets/PII/blocklist) runs inline and applies redactions; Tier-1 (injection) and Tier-2 (judge) run concurrently. If Tier-0 already blocks, the model checks are skipped.

Streaming & latency. Streamed (SSE) responses pass through token-by-token — Wafer never buffers a completion to inspect it, and output guardrails apply incrementally as tokens flow. On the streaming path only guardrails that can block gate the first token; flag checks run alongside the stream and never delay it. The inline Tier-0 checks add under a millisecond, and Wafer runs at the edge — same network as your Workers.

Block signal. A blocked request returns HTTP 403 with headers x-wafer-blocked: 1 and x-wafer-categories (e.g. pii,injection); the JSON body matches your provider's native error shape, so existing SDK error handling catches it. On a stream — whose headers are already sent — the block arrives in-band as a final data: event carrying the same blocked flag and categories. With the env.AI wrapper a block throws WaferBlockedError exposing the same categories. One handler, keyed on the categories, covers every path.

Profiles. Define named posture overrides under one project and pick one per request with the x-wafer-profile header (or the Workers AI wrapper's profile option) — for example a strict, latency-tight interrupt profile and a looser enrich profile, sharing one project, one set of keys, and one log stream. A profile overrides only the guardrails (and cache) you name; everything else inherits the project default, and an unknown or absent profile falls back to it. Manage them with wafer profile <project> set <name> '<json>'.

Batch. For Mistral and Gemini batch jobs, Wafer inspects the uploaded JSONL file — applying secrets, PII and blocklist guardrails to every request line before the file reaches the provider. A blocked line rejects the upload (with the block signal above); redactions are applied in place. Model-based checks don't apply to async batch.

LLM judge

Use a model to classify prompts/responses against a plain-language policy — for rules regex can't express ("no medical or legal advice", topic adherence). Configure it under Guardrails → LLM judge: enable, flag or block, input/output/both, and the policy text. It runs at the edge via Workers AI. For env.AI binding traffic, the judge runs through your own binding (no extra round-trip).

Semantic cache

Return a stored response for near-identical prompts and skip the model call entirely. Enable it per project with a similarity threshold and TTL. Cache is isolated per project. Backed by Vectorize; responses are keyed by an embedding of the (post-guardrail) prompt.

Scoping. Isolate cached answers further with the x-wafer-cache-scope header — e.g. an episode or document id — so a cached answer is never served across scopes or users within the project. For per-item Q&A, set the scope to that item's id, set the TTL to your content-freshness window, and keep the threshold high (the 0.95 default) for answer correctness. The cache serves non-streaming responses, so it complements a streamed path rather than replacing it.

Rate limits & budgets

Cap requests per minute and a daily token budget per project. Both are enforced by a per-project Durable Object (strongly consistent) and reject excess traffic with 429 + Retry-After. Token spend is read from each response's usage.

Set a spend limit in USD per day and per month. Wafer estimates each request's cost from the model's list price and rejects traffic once a cap is reached — a hard ceiling so a project can't run away with your bill. Like every limit, it fails open.

Turn on retries to ride out transient provider failures (429 and 5xx) with automatic backoff, and set an optional fallback model on the same provider for Wafer to try when retries are still failing — so a provider hiccup doesn't take your app down.

Analytics & telemetry

The console shows decisions, cache hit rate, latency (p50/p95), 24h traffic, top guardrails and live logs — for both proxied and binding traffic. By default Wafer captures full request/response telemetry per log: model, tokens, request & response content, decision, findings, latency, status, IP and country. Click any log row for the full detail. Content is stored post-guardrail — secrets and PII are redacted before they're written.

Privacy controls: content capture is a per-project setting (Settings → Log request & response content). Turn it off to store metadata only (decision, categories, latency, tokens). The Workers AI wrapper logs metadata by default — set log: "content" to include content, or log: "off" to disable.

Decision webhook. Stream every guardrail decision to an external sink — including the native heystack.dev integration for observability and root-cause analysis. Events carry decision metadata and guardrail categories only, never request or response content, and are sent best-effort off the request path. Configure it in Settings → Limits, or wafer webhook <project> <url>.

Audit export. Export a project's request logs as CSV or JSON for audit and reporting: GET /admin/projects/<id>/logs/export?format=csv, the Export CSV button in the Logs tab, or wafer export <project> csv.

API keys

Programmatic access (CLI, agents, the Workers AI wrapper) uses Wafer API keys. Create them from API keys in the console header (also in a project's Agents tab); they're shown once, scoped to your account, and revocable. Use with wafer login or as WAFER_API_KEY (Bearer).

Console

At console.wafersecurity.ai: projects, guardrail config, limits, analytics, logs, a guardrail playground (test policy with no model call), an Agents tab (prompts, skills, CLI, API keys), and settings. Deep-linkable — URLs reflect the project and tab.

CLI

Agent-native CLI over the admin API. Every command supports --json.

npm i -g @wafersecurity/cli
wafer login                          # paste a key (or: export WAFER_API_KEY=...)

wafer projects                       # list
wafer guardrails set my-app pii redact
wafer cache my-app on
wafer ratelimit my-app 60
wafer test my-app "email me at jane@acme.com"   # run guardrails, no model call
wafer logs my-app --limit 20
wafer init my-app                    # print gateway base URLs

Agent skills

Install Wafer skills into your coding agent (Claude Code, Cursor, …) with the skills CLI:

npx skills add wafersecurity/wafer-skills --all

Includes wafer, wafer-integrate, wafer-guardrails, wafer-cli and wafer-workers-ai.

Cloudflare Workers AI (env.AI)

env.AI.run(...) is an in-process binding — the gateway can't intercept it. Instrument it in your Worker with @wafersecurity/workers-ai. One line, no per-call changes:

import { withWafer } from "@wafersecurity/workers-ai";

const handler = {
  async fetch(req, env, ctx) {
    // unchanged — env.AI is auto-wrapped
    return Response.json(await env.AI.run("@cf/meta/llama-3.3-70b-instruct", { messages }));
  },
};
export default withWafer(handler);   // reads WAFER_PROJECT + WAFER_API_KEY

withWafer wraps every handler — fetch, queue, scheduled, tail — so env.AI is guarded in cron jobs and queue consumers too (no handlers dropped). Tier-0 guardrails run locally (zero added latency); the LLM judge runs via your own binding; calls are logged to Wafer (metadata only). Fail-open. There is no truly zero-code option for bindings — this one-line wrapper is the minimum.

Streaming: for { stream: true } responses, chunks pass through with zero added latency while the full response is captured and logged after the stream ends — both the gateway and the wrapper record complete streamed request/response telemetry. Block-action guardrails still cut the stream on a hit; mid-stream redaction isn't applied (tokens are already sent).

Speech-to-text: transcription models (Whisper, Deepgram) have their transcript screened by the input guardrails before it reaches a downstream model — so a spoken prompt-injection or blocklisted phrase is caught on the voice channel. A blocked transcript throws WaferBlockedError; a redacted one rewrites the returned text.

Fail-open / fail-closed

Default is fail-open: if a guardrail errors or times out, the request proceeds (so Wafer never takes down your app). Switch a project to fail-closed in Settings to block on guardrail failure instead.

Admin API

Authenticate with Authorization: Bearer <WAFER_API_KEY> (or a Clerk session) against https://wafersecurity.ai/admin.

Method	Path	Purpose
GET	/projects	List projects
POST	/projects	Create
GET/PUT/DELETE	/projects/:id	Read / update policy / delete
GET	/projects/:id/logs	Recent logs
GET	/projects/:id/analytics	Analytics
POST	/projects/:id/test	Run guardrails on text
GET/POST/DELETE	/keys	Manage API keys