OpenRouter + Datadog Observability Setup Guide

Estimated read time: 6 minutes

In the world of traditional software engineering, deploying an API without logging is malpractice. You wouldn't ship a database query without knowing how long it takes, or an HTTP endpoint without tracking its status codes.

Yet, we see AI applications shipped to production every day where the core logic—the LLM call—is a black box. We treat it like magic: we send a prompt, we get an answer, and we cross our fingers.

As a Site Reliability Engineer, this terrifies me.

Why Observability Matters

If you are routing requests through OpenRouter, you already have a powerful advantage: a unified interface for 100+ models. But the real game-changer is Observability. By coupling OpenRouter's broadcast capability with Datadog LLM Observability, you can treat your AI features like any other production dependency: measurable, traceable, and debuggable.

Observability answers three critical questions: What happened? Why did it happen? What do I do about it? When you're managing LLM costs that can spiral out of control in minutes, this becomes a business imperative.

The Integration

The architecture is simple but effective. OpenRouter acts as a middleware that can asynchronously "broadcast" telemetry data—request traces, token counts, costs, and latencies—directly to your Datadog instance. This happens out-of-band, so it adds zero latency to your user-facing requests.

User Request → Your App → OpenRouter → LLM Response
                             ↓
                        [Broadcast]
                             ↓
                         Datadog

Setup Guide (5 Minutes)

You don't need to install new SDKs or wrap your code in complex tracing logic. Here's the exact process:

1. Generate a Datadog API Key

Navigate to your Datadog dashboard:

Click on Organization Settings (bottom left corner)
Go to API Keys
Click + New API Key
Give it a descriptive name: openrouter-broadcast-key
Copy the key—you'll need it in the next step

2. Configure OpenRouter

Head over to OpenRouter Settings > Broadcast.

You should see a list of supported integrations. Toggle Enable Broadcast to ON.

3. Connect the Pipes

Click the edit icon next to Datadog and fill in these required fields:

API Key: [Your Datadog API Key from Step 1]
ML App: production-chatbot (or your service name)
Site URL: [Check your Datadog URL bar]

Important: The Site URL depends on your Datadog region:

US (us5.datadoghq.com) → Use https://api.us5.datadoghq.com
US (app.datadoghq.com) → Use https://api.datadoghq.com
EU (eu.datadoghq.com) → Use https://api.datadoghq.eu

Check your browser's address bar to determine your region.

4. Verify

Click Test Connection. You should see a green checkmark. If it fails, double-check:

The Datadog API Key is valid
The Site URL matches your region
The API Key has permission to write logs

Why This Matters for Production

Once the data starts flowing, you move from "guessing" to "engineering". Here's what you get out of the box:

1. Cost Attribution

Datadog will track the exact cost of every request. You can break this down by model, by user, or by feature.

SRE Critical: Set up a monitor to alert you if your hourly spend spikes by 200%. Catch infinite loops or abusive traffic before the monthly bill arrives. LLM costs can 10x in minutes if a function gets stuck in a retry loop.

2. Latency Waterfalls

Is gpt-4-turbo feeling slow today? Is claude-3.5-sonnet outperforming on speed? The traces show you the full breakdown:

Time to First Token (TTFT) - How quickly the model starts responding
Total Generation Time - Total request duration
Token Throughput - Tokens generated per second

Use this data to dynamically route traffic. If a model's P99 latency breaches your SLA, failover to a faster, smaller model. Example: Route verbose requests to gpt-4-turbo, factual requests to gpt-3.5-turbo.

3. Quality & Error Tracking

When a request fails, you need to know if it was:

A timeout
A rate limit
A content policy violation
An overloaded provider

Datadog captures the full error trace and categorizes failures automatically.

Pro Tip: Create a custom metric that tracks "hallucination rate" by comparing model outputs to a ground truth. Use this to inform model selection decisions.

Practical Datadog Queries

Once your data is flowing, here are some powerful queries to monitor:

Track monthly LLM spend by model:

avg:openrouter_broadcast.request.cost{*} by {model}

Alert on error rate spike:

avg:openrouter_broadcast.request.error_rate{*} > 0.05

Monitor latency percentiles:

pct99:openrouter_broadcast.request.duration_ms{*}

Cost per user (if you pass user_id in headers):

avg:openrouter_broadcast.request.cost{*} by {user_id}

Conclusion

Observability is the art of asking questions about your system from the outside. With OpenRouter and Datadog, you can finally answer: "Is my AI application healthy?"

It transforms the magic black box into a reliable, engineered component. For an SRE, that's the only way to ship.

Next steps:

Set up the integration (5 minutes)
Generate sample traffic and verify data flows
Create alerts for cost spikes and error rates
Build dashboards for daily monitoring

Your future self will thank you when production stabilizes and you're sleeping through the night.

Next Reads

Gemini 3.1 Pro vs. Opus 4.7 Max — SOUL.md

Same brief, two frontier models. A side-by-side look at how Gemini 3.1 Pro and Opus 4.7 Max wrote the persona file for an OpenClaw-based AI assistant called IndieClaw.

How PeakofEloquence.org Scaled to 490K Monthly Users

The technical story behind scaling an open-source education platform to 490K+ monthly active users across 15+ countries — edge computing, Kubernetes, and lessons from unexpected viral growth.