Handbook

How we build

Engineering org, infrastructure, change management, and how we treat incidents.

Stack

Moonage runs on Cloudflare end to end.

LayerChoice
Edge + computeCloudflare Workers
Stateful coordinationDurable Objects
Relational storageD1 (per-workspace, regionally pinned for enterprise)
Object storageR2
Memory + retrievalWorkers + R2 + SuperMemory
Auth + sessionsWorkers + KV
Site, handbook, docsNext.js 16 + Tailwind v4 on OpenNext for Cloudflare
Sandboxed executionCloudflare Sandbox SDK

There is no legacy provider behind this stack. New services land directly on Workers.

Why Cloudflare

Three reasons, in priority order.

  1. One control plane, one trust boundary. Compute, storage, and network policy share a single auth and audit fabric. Less seams means less drift between intent and reality.
  2. Regional residency without a second cloud. Enterprise customers pin a dedicated D1 to their region. Subprocessors honor the same boundary. We do not need a multi-cloud strategy to deliver multi-region.
  3. Edge-first economics. Cold starts are measured in single-digit milliseconds. We stop paying for warming, and the latency budget for an agent run shrinks accordingly.

The longer story lives at /blog/why-we-chose-cloudflare.

Engineering principles

  • Ship to production first, talk about it second. No long-running feature branches. Trunk-based, behind flags where needed.
  • Tests are the brief. A change without a test is a change we don't trust. The test is how we describe the intended behavior.
  • Observability before optimization. Don't tune what you can't see.
  • The code review is part of the work. A thorough review is faster than a fast revert.
  • Cost is a constraint, not an afterthought. Cloudflare bills are visible to engineering, not buried in finance.
  • Boring on purpose. We adopt new tools when they earn it. We do not chase frameworks for fun.

How a change reaches production

One shape, with two named exceptions.

  1. Open an issue with a brief on the change — what it does, what it touches, how it gets verified.
  2. Land a PR with peer review and tests passing. Branch protection enforces this in GitHub.
  3. Deploy via the standard pipeline. Manual deploys outside the pipeline are reviewable but visible.

Exceptions:

  • Incident mitigation. A fix may bypass review to stop the bleeding. The follow-up postmortem must add the missing tests.
  • Documentation. Doc-only changes can ship without a secondary reviewer.

Incidents

A small, opinionated definition.

  • An incident is anything that requires unplanned human intervention to keep the system honest.
  • We mitigate first, prevent recurrence second.
  • Every incident gets a postmortem, regardless of severity. The cost of a short doc is lower than the cost of forgetting why we made the fix.

The on-call engineer leads the response. Synchronous coordination happens in #incidents. The postmortem is written async within five business days, shared internally, and linked from the changelog when the fix lands.

Engineer learning ladder

How an engineer at Moonage builds operational fluency.

  1. First incident. Own the response under the on-call's wing. Write the postmortem.
  2. First twenty postmortems read. Annotate what the fix could have been if caught earlier.
  3. Monthly PR feedback as a primary reviewer. Build the review muscle.
  4. Quarterly authoring of a design doc. Land a non-trivial system change end-to-end.
  5. Cross-team review. Review another team's design docs and incident write-ups.

The point is not the ladder. The point is to make sure nobody is stranded in their first incident.

Annual review

This page is reviewed every six months by the engineering lead. Last review: 2026-05.