OpenRouter + Datadog Observability Setup Guide
Estimated read time: 6 minutes
In the world of traditional software engineering, deploying an API without logging is malpractice. You wouldn't ship a database query without knowing how long it takes, or an HTTP endpoint without tracking its status codes.
Yet, we see AI applications shipped to production every day where the core logic—the LLM call—is a black box. We treat it like magic: we send a prompt, we get an answer, and we cross our fingers.
As a Site Reliability Engineer, this terrifies me.
Why Observability Matters
If you are routing requests through OpenRouter, you already have a powerful advantage: a unified interface for 100+ models. But the real game-changer is Observability. By coupling OpenRouter's broadcast capability with Datadog LLM Observability, you can treat your AI features like any other production dependency: measurable, traceable, and debuggable.
Observability answers three critical questions: What happened? Why did it happen? What do I do about it? When you're managing LLM costs that can spiral out of control in minutes, this becomes a business imperative.
The Integration
The architecture is simple but effective. OpenRouter acts as a middleware that can asynchronously "broadcast" telemetry data—request traces, token counts, costs, and latencies—directly to your Datadog instance. This happens out-of-band, so it adds zero latency to your user-facing requests.
User Request → Your App → OpenRouter → LLM Response
↓
[Broadcast]
↓
Datadog
Setup Guide (5 Minutes)
You don't need to install new SDKs or wrap your code in complex tracing logic. Here's the exact process:
1. Generate a Datadog API Key
Navigate to your Datadog dashboard:
- Click on Organization Settings (bottom left corner)
- Go to API Keys
- Click + New API Key
- Give it a descriptive name:
openrouter-broadcast-key - Copy the key—you'll need it in the next step
2. Configure OpenRouter
Head over to OpenRouter Settings > Broadcast.
You should see a list of supported integrations. Toggle Enable Broadcast to ON.
3. Connect the Pipes
Click the edit icon next to Datadog and fill in these required fields:
API Key: [Your Datadog API Key from Step 1]
ML App: production-chatbot (or your service name)
Site URL: [Check your Datadog URL bar]
Important: The Site URL depends on your Datadog region:
- US (us5.datadoghq.com) → Use
https://api.us5.datadoghq.com - US (app.datadoghq.com) → Use
https://api.datadoghq.com - EU (eu.datadoghq.com) → Use
https://api.datadoghq.eu
Check your browser's address bar to determine your region.
4. Verify
Click Test Connection. You should see a green checkmark. If it fails, double-check:
- The Datadog API Key is valid
- The Site URL matches your region
- The API Key has permission to write logs
Why This Matters for Production
Once the data starts flowing, you move from "guessing" to "engineering". Here's what you get out of the box:
1. Cost Attribution
Datadog will track the exact cost of every request. You can break this down by model, by user, or by feature.
SRE Critical: Set up a monitor to alert you if your hourly spend spikes by 200%. Catch infinite loops or abusive traffic before the monthly bill arrives. LLM costs can 10x in minutes if a function gets stuck in a retry loop.
2. Latency Waterfalls
Is gpt-4-turbo feeling slow today? Is claude-3.5-sonnet outperforming on speed? The traces show you the full breakdown:
- Time to First Token (TTFT) - How quickly the model starts responding
- Total Generation Time - Total request duration
- Token Throughput - Tokens generated per second
Use this data to dynamically route traffic. If a model's P99 latency breaches your SLA, failover to a faster, smaller model. Example: Route verbose requests to gpt-4-turbo, factual requests to gpt-3.5-turbo.
3. Quality & Error Tracking
When a request fails, you need to know if it was:
- A timeout
- A rate limit
- A content policy violation
- An overloaded provider
Datadog captures the full error trace and categorizes failures automatically.
Pro Tip: Create a custom metric that tracks "hallucination rate" by comparing model outputs to a ground truth. Use this to inform model selection decisions.
Practical Datadog Queries
Once your data is flowing, here are some powerful queries to monitor:
Track monthly LLM spend by model:
avg:openrouter_broadcast.request.cost{*} by {model}
Alert on error rate spike:
avg:openrouter_broadcast.request.error_rate{*} > 0.05
Monitor latency percentiles:
pct99:openrouter_broadcast.request.duration_ms{*}
Cost per user (if you pass user_id in headers):
avg:openrouter_broadcast.request.cost{*} by {user_id}
Conclusion
Observability is the art of asking questions about your system from the outside. With OpenRouter and Datadog, you can finally answer: "Is my AI application healthy?"
It transforms the magic black box into a reliable, engineered component. For an SRE, that's the only way to ship.
Next steps:
- Set up the integration (5 minutes)
- Generate sample traffic and verify data flows
- Create alerts for cost spikes and error rates
- Build dashboards for daily monitoring
Your future self will thank you when production stabilizes and you're sleeping through the night.
Next Reads
How PeakofEloquence.org Scaled to 490K Monthly Users
The technical story behind scaling an open-source education platform to 490K+ monthly active users across 15+ countries — edge computing, Kubernetes, and lessons from unexpected viral growth.
What I Learned Monitoring LLMs in Production for a Year
Practical lessons from deploying and monitoring production LLMs — why traditional APM fails, what metrics actually matter, and how to build observability for non-deterministic systems.