From AWS Support to AI Infrastructure: 5 Years of Building Systems

Five years ago I started at Amazon Web Services as a Cloud Support Engineer. Today I'm deploying production LLMs, running open-source projects with hundreds of thousands of users, and building AI developer tools independently. The path between those two points wasn't planned, but looking back, every stage built directly on the last.

This is a career reflection — what each role taught me, the transitions that mattered, and why I eventually went independent.

AWS: Learning How Systems Actually Break (Mar 2021 – Dec 2022)

My first real engineering role was on AWS Cloud Support, working with Fortune 500 companies on their networking infrastructure. VPCs, Transit Gateways, load balancers, WAF configurations, multi-region failover designs. I handled 50+ escalations per month across compute, networking, and serverless workloads.

The thing nobody tells you about support engineering: you learn how systems fail faster than in any other role. Every day, someone's production environment is broken and they need it fixed now. You're reading CloudFormation templates you've never seen before, tracing packet flows through VPC peering connections, and figuring out why a NAT Gateway is dropping traffic — all on a live call with an engineering team that's losing money every minute.

What I took from AWS:

Systems thinking. Infrastructure is not a collection of services. It's an interconnected system where a misconfigured security group can cascade into a full outage. You learn to think in dependency graphs, not feature lists.

Communication under pressure. Enterprise customers don't care about your root cause analysis methodology. They care about their production being down. I learned to communicate status, timeline, and impact clearly while simultaneously debugging.

Deep networking knowledge. VPC design, hybrid cloud architecture, DNS resolution chains, TLS termination strategies. This knowledge compounds. I still use it every time I design infrastructure.

ApexCrypto → Bakkt: SRE in Regulated Finance (Jan 2023 – Dec 2023)

I left AWS to join ApexCrypto as a Site Reliability Engineer. Three months later, Bakkt acquired Apex, and I continued the role at Bakkt through December 2023.

Crypto trading infrastructure is a different beast from general cloud support. The stakes are financial, the regulations are strict, and the systems can't go down during market hours. The key difference from AWS: I was now responsible for keeping systems running, not just helping other people fix theirs.

What I built:

Observability from scratch. I built Datadog dashboards and tuned monitors that caught 90% of production issues before they escalated to customer impact. The monitoring stack became the team's first line of defense.

Infrastructure as Code at scale. Standardized Kubernetes service onboarding using Terraform and Helm. New services went from a week-long manual setup process to a one-day automated pipeline. Every change was auditable, which mattered for compliance.

Incident automation. Wrote Python Lambda functions for EC2 auto-healing that eliminated 75% of manual incident response. When an instance started misbehaving, the system would detect it, drain connections, terminate it, and spin up a replacement — all before PagerDuty woke anyone up.

The biggest lesson from SRE in fintech: reliability is a feature. In a crypto trading platform, every minute of downtime is measurable in dollars lost. You learn to make decisions that optimize for uptime over everything else.

I also supported the Webull crypto launch — a 2M+ user migration that tested every assumption we'd made about our infrastructure capacity.

DigitizedLLC: Deploying LLMs in Production (Jan 2024 – Dec 2024)

After Bakkt, I moved to DigitizedLLC as an LLM Observability Engineer. This was the transition point — from traditional infrastructure to AI infrastructure.

The role was about deploying production language models (Claude 3.5 Sonnet, Llama 3.2) via AWS Bedrock and building the monitoring systems to keep them reliable. LLMs in production are fundamentally different from traditional APIs:

Non-deterministic outputs. The same input can produce different outputs. You can't write traditional test assertions.
Cost is proportional to usage in unpredictable ways. A single prompt with a large context window can cost 100x a simple query.
Latency varies wildly. Time to first token depends on model load, prompt length, and provider capacity.
Quality degrades silently. A model can start producing worse outputs without any error or status code change.

I built observability frameworks that tracked token usage, cost per request, latency distributions, and output quality metrics. The result: 63% reduction in mean time to resolution for LLM-related incidents, and 50% improvement in data handling efficiency.

The key insight from this role: LLM observability is fundamentally different from traditional APM. You're not just tracking latency and errors. You're tracking semantic quality, cost curves, and model behavior over time. The tools I built at DigitizedLLC directly informed how I think about AI infrastructure today.

Going Independent (Jan 2025 – Present)

In January 2025, I went independent. I'm now building AutomateHub.dev, maintaining PeakofEloquence.org (490K+ monthly users), and contributing to the AI developer ecosystem through open-source projects like the Shopify MCP Server and OpenRAG.

Why independent? A few reasons:

The stack I care about moves too fast for traditional employment. MCP, RAG frameworks, edge-deployed AI, LLM orchestration — these technologies are evolving weekly. Working independently lets me build with the latest tools without waiting for a company's technology adoption process.

Open source is the best portfolio. My Shopify MCP Server is listed on 5+ MCP registries. PeakofEloquence.org serves 490K users. These projects say more about my capabilities than any job title.

The economics work. Cloudflare Workers, R2, and D1 make it possible to run production services for nearly nothing. I'm running a platform with 490K+ users for under $50/month in infrastructure costs. That changes the math on what's viable as an independent builder.

What I'd Tell Someone Starting Out

Learn how things break, not just how they work. AWS Support taught me more about systems design than any course or certification. When you've debugged 500 different production failures, you develop intuition for what will go wrong before it does.

SRE skills transfer everywhere. Observability, incident response, infrastructure as code, capacity planning — these aren't just SRE skills. They're the foundation of running any production system, including AI systems.

Build in public. Every project I built while employed informed what I could build independently. PeakofEloquence started as a side project. The MCP Server came from curiosity about the protocol. Ship things, even small things, and let the results speak.

The AI infrastructure wave is real and early. We're in the first inning of deploying LLMs in production. Most companies don't have observability for their AI systems. Most developers don't know how to design tools for AI agents. If you have infrastructure skills, the intersection with AI is where the highest leverage is right now.

I'm currently open to full-time platform engineering, AI infrastructure, and SRE roles — remote preferred. If you're building something interesting, reach out at admin@rezajafar.com.

Next Reads

How PeakofEloquence.org Scaled to 490K Monthly Users

The technical story behind scaling an open-source education platform to 490K+ monthly active users across 15+ countries — edge computing, Kubernetes, and lessons from unexpected viral growth.

What I Learned Monitoring LLMs in Production for a Year

Practical lessons from deploying and monitoring production LLMs — why traditional APM fails, what metrics actually matter, and how to build observability for non-deterministic systems.