OpenRouter + Datadog Observability
In the world of traditional software engineering, deploying an API without logging is malpractice. You wouldn't ship a database query without knowing how long it takes, or an HTTP endpoint without tracking its status codes.
Yet, we see AI applications shipped to production every day where the core logic—the LLM call—is a black box. We treat it like magic: we send a prompt, we get an answer, and we cross our fingers.
As a Site Reliability Engineer, this terrifies me.
If you are routing requests through OpenRouter, you already have a powerful advantage: a unified interface for 100+ models. But the real game-changer is Observability. By coupling OpenRouter's broadcast capability with Datadog LLM Observability, you can treat your AI features like any other production dependency: measurable, traceable, and debuggable.
Here is how to set it up, and more importantly, why you should.
The Integration
The architecture is simple but effective. OpenRouter acts as a middleware that can asynchronously "broadcast" telemetry data—request traces, token counts, costs, and latencies—directly to your Datadog instance. This happens out-of-band, so it adds zero latency to your user-facing requests.
Setup Guide (5 Minutes)
You don't need to install new SDKs or wrap your code in complex tracing logic.
1. Generate a Datadog API Key
Navigate to your Datadog dashboard: Organization Settings > API Keys.
Create a new key specifically for this integration (e.g., openrouter-broadcast-key).
2. Configure OpenRouter
Head over to OpenRouter Settings > Broadcast.
Toggle Enable Broadcast to ON.
3. Connect the Pipes
Click the edit icon next to Datadog and fill in the details:
- API Key: The key you just created.
- ML App: A logical name for your service (e.g.,
production-chatbotorcontent-engine). - Site URL: This defaults to
https://api.us5.datadoghq.com. Check your Datadog URL bar. If you are onapp.datadoghq.com, usehttps://api.datadoghq.com. If you are onus3.datadoghq.com, adjust accordingly.
4. Verify
Click Test Connection. If it turns green, you are live.
Why This Matters for Production
Once the data starts flowing, you move from "guessing" to "engineering". Here is what you get out of the box:
1. Cost Attribution
Datadog will track the exact cost of every request. You can break this down by model, by user (if you pass user IDs in headers), or by feature.
- SRE Take: Set up a monitor to alert you if your hourly spend spikes by 200%. Catch infinite loops or abusive traffic before the monthly bill arrives.
2. Latency Waterfalls
Is gpt-4 feeling slow today? Is claude-3.5-sonnet outperforming on speed? The traces show you the full breakdown: Time to First Token (TTFT) and total generation time.
- SRE Take: Use this data to dynamically route traffic. If a model's P99 latency breaches your SLA, failover to a faster, smaller model.
3. Quality & Error Tracking
When a request fails, you need to know if it was a timeout, a rate limit, or a content policy violation. Datadog captures the full error trace.
- SRE Take: Don't just retry blindly. Analyze the error rates per provider. If one provider is unstable, you have the data to justify switching routing priorities.
Conclusion
Observability is the art of asking questions about your system from the outside. With OpenRouter and Datadog, you can finally answer: "Is my AI application healthy?"
It turns the magic black box into a reliable, engineered component. And for an SRE, that is the only way to ship.