Requirements

Scope, operating modes, deployment postures, and the R-* requirements catalog.

Status: Draft, v1 scope. Related: design.md, roadmap.md. Supersedes the requirements portion of simian-agent.md.

1. Objective & Scope

1.1 What Simian Agent is

Simian Agent is an open-source, AI-native chaos engineering orchestrator — a “Chaos Monkey for AI.” It exists to inject controlled, structurally meaningful failures into Kubernetes workloads so that downstream automated SRE agents can be exercised, evaluated, and improved.

It is not a generic chaos framework. Its differentiator is the dual operating model: it can be driven directly by an upstream caller through a standardized tool interface, and it can also operate fully autonomously — discovering topology, drafting an attack plan, and executing it under bounded safety constraints.

1.2 v1 scope

In scope:

  • The two operating modes (directed and autonomous) sharing a single execution substrate.
  • The provisioned-namespace deployment posture (Simian owns the target namespace).
  • A pluggable LLM provider interface, with Gemini as the default implementation.
  • The Online Boutique microservices suite as the default System Under Test (SUT).
  • Chaos Mesh and LitmusChaos as the underlying fault engines.
  • The “Red Phone” outbound notification bridge for downstream SRE agent activation.
  • Structured scenario data export so an external evaluation harness can grade SRE agent responses against Simian’s ground-truth inputs/outputs.

Out of scope for v1 (architecturally accommodated, not implemented):

  • The external-workload deployment posture (real staging / production targets).
  • Approval gates, change-window calendars, business-hours enforcement.
  • Observed-baseline ingestion from external observability stacks (Prometheus/Datadog/etc. as authoritative SLO sources).
  • Multi-tenant isolation between concurrent Simian operators.
  • Cross-cluster fault orchestration.

1.3 Non-goals

  • Replacing existing chaos tools’ breadth. Simian aims for AI-driven intent, not feature parity with Chaos Mesh’s full surface.
  • Acting as a defender, observer, or healer. Simian is purely the Red Team.
  • Targeting non-Kubernetes runtimes (VMs, serverless, on-prem bare metal) in v1.

2. Operating Modes

Simian exposes two modes that share the same execution substrate (the Fault Executor — see design.md §3).

2.1 Directed mode

An external caller — human, upstream agent, CI job — submits a high-level intent through Simian’s MCP tool interface. Simian translates the intent into one or more engine-specific fault manifests, runs safety validation, and applies them.

Example intents:

  • “Add 250ms latency to paymentservice for 5 minutes.”
  • “Kill one redis-cart replica.”
  • “Fill /tmp on the cartservice pod to 95% for 2 minutes.”

Directed mode is fully synchronous from the caller’s perspective: the call returns once the manifest is applied (or rejected), and the caller can poll/stream status.

2.2 Autonomous mode

Simian is pointed at a set of eligible namespaces. On each cycle it:

  1. Verifies the cluster baseline is healthy.
  2. Discovers topology, replica counts, dependency graph, and current metrics for each eligible namespace.
  3. Drafts an attack plan — an ordered sequence of fault manifests with rationale and a hypothesis about expected impact.
  4. Submits the plan to the Fault Executor for validation and execution under the autonomous budget.
  5. Optionally dispatches a Red Phone incident page after each fault is applied.

The plan is always emitted and audit-logged before execution, even when approval is auto. This gives a free dry-run mode (simian plan), a single chokepoint for safety enforcement, and a record of why the agent picked what it picked.

2.3 Mode selection & coexistence

Both modes are always available; they are not configured at deploy time. A single running Simian instance can serve directed-mode MCP calls while also running an autonomous loop. Both flow through the same Fault Executor, so the active-fault budget, lease tracking, and audit log are unified.

3. Deployment Postures

3.1 Provisioned-namespace posture (v1 focus)

Simian creates the target namespace and deploys the SUT into it. The namespace is owned end-to-end by Simian: created, populated, attacked, and torn down by the same system.

Implications:

  • Baseline is knowable. Simian deployed the workloads, so it knows their expected ready state.
  • Recovery is trivial. Any unrecoverable failure is resolved by deleting and recreating the namespace.
  • No real users. Synthetic load is generated by the SUT itself (Online Boutique includes a load generator).
  • No real SLOs. Synthetic SLOs exist only to give the downstream SRE agent something to detect.
  • Safety story reduces to: “do not escape the namespace Simian created.”

3.2 External-workload posture (v2, out of v1 scope)

Simian targets workloads it did not create — real staging or, with explicit opt-in, production. The architecture must remain compatible with this posture, but no v1 requirement depends on it.

Implications (informational only for v1):

  • Baseline must be ingested from an external observability source.
  • Per-workload opt-in (not just per-namespace) is required.
  • Approval gates, change windows, and explicit blast-radius caps become mandatory.
  • Audit becomes compliance-grade, not just diagnostic.
  • Failure modes include “did not damage real systems” alongside “did inject the intended fault.”

3.3 Posture marking convention

In the requirements below, requirements that apply only to one posture are marked [provisioned] or [external]. Unmarked requirements apply to both.

4. Functional Requirements

4.1 Eligibility & Scoping

  • R-SCOPE-01: Namespaces must opt in to Simian’s reach via a Kubernetes annotation (key: simian.chaos/eligible, value: "true"). Namespaces without this annotation are invisible to Simian regardless of mode.
  • R-SCOPE-02: An eligible namespace may further restrict targets via additional annotations (e.g., simian.chaos/exclude-workloads: "frontend,checkout"). Excluded workloads must never receive a fault.
  • R-SCOPE-03: Simian’s chaos ServiceAccount must be RBAC-bound only to eligible namespaces. Annotations express intent; RBAC enforces capability. Both must agree before a fault can be applied.
  • R-SCOPE-04: Any fault manifest targeting a non-eligible namespace, or an excluded workload within an eligible namespace, must be rejected by the Fault Executor before it reaches the driver.
  • R-SCOPE-05 [provisioned]: Simian’s provisioner must, when creating an eligible namespace, also create the RoleBinding granting the chaos ServiceAccount access into that namespace. The provisioner is the only Simian component with cluster-scoped privilege to create namespaces or RoleBindings.

4.2 Provisioning (provisioned posture)

  • R-PROV-01 [provisioned]: Simian must programmatically provision, manage, and tear down the Online Boutique microservices suite (or other declared SUT) inside an isolated, annotated namespace.
  • R-PROV-02 [provisioned]: Provisioning must establish a verifiable steady-state baseline before fault injection is permitted. Baseline is satisfied when all declared workloads pass liveness and readiness probes and synthetic load is flowing.
  • R-PROV-03 [provisioned]: Provisioning must fail closed: if baseline cannot be established within a configured timeout, the namespace is marked unhealthy and chaos is blocked until a human or the operator restarts provisioning.
  • R-PROV-04 [provisioned]: Teardown must remove the namespace and all created RoleBindings cleanly, including any active fault resources.

4.3 Topology Discovery

  • R-DISC-01: In autonomous mode, Simian must build a per-namespace topology snapshot containing: workloads (Deployments/StatefulSets/DaemonSets), replica counts, services, dependency edges (where derivable from service mesh, env vars, or annotations), and current pod status.
  • R-DISC-02: Topology discovery must complete using only read-only Kubernetes API calls; the Discoverer never mutates cluster state.
  • R-DISC-03: When a service mesh (Istio, Linkerd) is present, Simian should consume its telemetry/topology APIs to enrich the dependency graph. Absence of a service mesh must not block discovery — fall back to label/annotation/env-var heuristics.

4.4 LLM-Driven Planning

  • R-LLM-01: The LLM provider must be pluggable behind a single Go interface. Gemini is the default and only required v1 implementation.
  • R-LLM-02: In autonomous mode, the LLM is responsible for: ranking discovered vulnerabilities, drafting the attack plan, writing per-step rationale, and writing the natural-language incident page. It must not have direct authority to apply faults.
  • R-LLM-03: In directed mode, the LLM translates the high-level intent into a structured FaultManifest, optionally consulting read-only MCP tools for context. It must not have direct authority to apply faults.
  • R-LLM-04: The LLM’s available tool surface is restricted to read-only operations (topology queries, log reads, metric reads via MCP). Write operations are reachable only by the LLM emitting structured output that the Fault Executor then validates and applies.
  • R-LLM-05: All LLM outputs that flow into the executor must be JSON-schema-validated. On schema-invalid output, Simian retries once with the validation error fed back as a correction prompt; on second failure the cycle (autonomous) or call (directed) fails.
  • R-LLM-06: When the LLM provider is unreachable or returns a timeout, directed mode returns an error to the caller; autonomous mode skips the cycle, logs the failure, and waits for the next tick. There is no rule-based fallback that bypasses the LLM.

4.5 Fault Injection

  • R-FAULT-01: Simian must expose the full set of Chaos Mesh fault resources to the agent — not a curated subset. This includes (non-exhaustive) NetworkChaos, PodChaos, IOChaos, StressChaos, TimeChaos, KernelChaos, DNSChaos, HTTPChaos, JVMChaos, BlockChaos, and the cloud-provider chaos resources (AWSChaos, GCPChaos, AzureChaos). Implementation must integrate at the CRD/dynamic-client layer (apply arbitrary chaos-mesh.org/v1alpha1 resources), not via per-fault-type Go wrappers, so newly-released or custom Chaos Mesh resource types are usable without Simian code changes.
  • R-FAULT-02: Simian must support the full LitmusChaos experiment catalog, including community ChaosHub experiments. Implementation must use Litmus’s ChaosEngine / ChaosExperiment CRDs generically rather than per-experiment Go wrappers, so any installed experiment is immediately addressable.
  • R-FAULT-03: All fault manifests, regardless of source mode or engine, flow through the Fault Executor for safety validation, audit, lease registration, and lifecycle management. No code path may apply a fault by bypassing the Executor.
  • R-FAULT-04: Every applied fault must carry a hard duration cap (default 15 minutes, configurable per fault but bounded by an installation-wide ceiling). The cap is enforced both by the Chaos Mesh/Litmus duration field and by an in-process lease reaper.
  • R-FAULT-05: If the Simian process terminates, all active faults must self-heal within the lease ceiling. The system must not leave faults running indefinitely after agent failure.
  • R-FAULT-06: Simian must classify each available fault by blast-radius tier: namespace (contained to the targeted namespace), node (affects the underlying Kubernetes node — e.g. KernelChaos, PhysicalMachineChaos), or external (affects resources outside the cluster — e.g. AWSChaos, GCPChaos, AzureChaos, or a NetworkChaos/DNSChaos whose effective target leaves the cluster). Where a fault type can be either contained or external depending on its spec (notably DNSChaos and NetworkChaos), classification is performed per-spec by the Fault Executor, not by fault-type alone. Default v1 policy permits the namespace and node tiers; the external tier requires explicit per-installation opt-in via configuration.
  • R-FAULT-07: The Fault Executor must dynamically discover the catalog of fault types actually installed in the cluster (Chaos Mesh CRDs present, Litmus experiments installed) at startup and on configurable refresh. The catalog — including each fault’s CRD schema, parameters, and blast-radius tier — must be exposed to the LLM via an MCP read tool so the agent can only propose faults that exist and are permitted.
  • R-FAULT-08 (Litmus workflows): When a plan contains ordered, conditional, or parallel fault sequences and the target engine is Litmus, Simian must be able to materialize the sequence as a Litmus workflow (ChaosSchedule / workflow CRD) rather than as N independent applies, so that orchestration, dependencies, and rollback benefit from Litmus’s native primitives.
  • R-FAULT-09 (Litmus probes): Simian must support attaching Litmus probes (Cmd, HTTP, K8s, Prometheus) to chaos experiments. Probe results must be captured in the audit log and in the scenario export so external evaluators can use them as ground-truth signals about whether the fault produced the predicted symptom.
  • R-FAULT-10 (ChaosHub): The Litmus integration must consume experiments from configured Litmus ChaosHubs (community and private), not from a Simian-shipped experiment list. Hub configuration is part of installation; newly-published hub experiments become available to the LLM via R-FAULT-07’s catalog refresh.

4.6 Outbound Event Bridge (“Red Phone”)

  • R-PAGE-01: Simian must include an optional outbound notification subsystem. When enabled, it converts the structured fault outcome into a natural-language incident page suitable for prompting downstream SRE agents.
  • R-PAGE-02: Pages must be dispatchable via push webhook; future transports (streaming queues, MCP push) should be accommodated by the dispatcher interface.
  • R-PAGE-03: Page generation must support multiple linguistic styles (e.g., “direct/deterministic” and “symptoms-only/exploratory”) to stress-test how downstream agents interpret incident framing.
  • R-PAGE-04: Page dispatch is best-effort and asynchronous: a failed page must never roll back, abort, or block an applied fault.
  • R-PAGE-05: Outbound webhooks must support an authentication scheme (HMAC signature minimum) and a retry policy with bounded attempts.

4.7 Auditing & Observability

  • R-AUDIT-01: Every fault must emit structured audit records at: plan generation, executor receipt, validation outcome, apply success/failure, lease heartbeat events, page dispatch attempts, and clear/expiry.
  • R-AUDIT-02: The audit timeline must be queryable in cron-style chronological order for any incident or fault UID.
  • R-AUDIT-03: Simian must export Prometheus-compatible metrics covering: cycles run, faults applied, faults rejected (by reason), pages dispatched, LLM call latency and failure rate, lease reaper actions.
  • R-AUDIT-04: Logs must be structured (JSON) with consistent field names; LLM prompt/response payloads must be capturable at a configurable verbosity (off by default; on for debugging).

4.8 Scenario Data Export

Simian does not grade or score SRE agents — that work is delegated to the user’s existing evaluation harness. Simian’s job is to expose the structured inputs and outputs of each chaos cycle in a form an external harness can consume.

  • R-EXPORT-01: Each chaos cycle (autonomous mode) and each directed-mode call must produce a scenario record capturing both inputs (planned faults, applied faults with full parameter sets, target workloads, dispatched page text, pre-fault baseline snapshot, LLM rationale and hypothesis) and outputs (probe results, observed metric deltas during the fault window, lease/clear events, inbound SRE agent responses, time-to-recovery if observable).
  • R-EXPORT-02: The scenario record schema must be stable, versioned, and JSON-serializable, and must be consumable by external evaluation harnesses without requiring Simian-specific code in the consumer.
  • R-EXPORT-03: Scenario records must be addressable both as: (a) a final per-scenario document written to a configured sink (filesystem path, object storage, webhook), and (b) a streaming event feed during the cycle, for harnesses that want incremental signals rather than waiting for cycle completion.
  • R-EXPORT-04: Simian must not include scoring, grading, pass/fail, or scorecard rendering logic. Pass/fail criteria, K-of-N enforcement, RCA accuracy weighting, and similar policies belong in the external harness consuming the records.

5. Non-Functional Requirements

  • R-NFR-01: Simian ships as a single Go binary with subcommands. Container images are reproducible and minimal.
  • R-NFR-02: Simian deploys as two Kubernetes workloads: a long-running controller (chaos ServiceAccount, namespace-scoped) and a privileged provisioner (provisioner ServiceAccount, cluster-scoped for namespaces and RoleBindings only).
  • R-NFR-03: The controller process must survive transient cluster API errors via standard client-go retry/backoff. It must not exit on transient LLM provider errors.
  • R-NFR-04: All inter-component contracts (LLM provider, MCP tool interface, fault driver, event dispatcher) must be Go interfaces with mock implementations available for tests.
  • R-NFR-05: The autonomous loop must enforce per-cycle budget caps: max faults per cycle, min cooldown between faults, max concurrent active faults, max severity tier. Defaults must be conservative and configurable via Helm values.
  • R-NFR-06: Container image, Helm chart, and binary release pipelines must produce signed artifacts. Supply chain hygiene (SBOM, provenance attestation) is required from day 1.

6. Security Requirements

  • R-SEC-01: Simian must enforce a two-tier privilege split: the provisioner is the only component capable of creating namespaces or RoleBindings; the chaos controller can only mutate Chaos Mesh / Litmus CRDs within already-bound namespaces.
  • R-SEC-02: No Simian component may hold cluster-admin equivalent privilege. RBAC manifests must be declarative and reviewable.
  • R-SEC-03: LLM prompts and responses must be screened for prompt-injection patterns when running in environments where untrusted intent could enter the directed-mode interface. (v1 may rely on Google Cloud’s Model Armor or equivalent; the screening interface itself is required from day 1.)
  • R-SEC-04: No outbound network calls beyond: the configured LLM provider, the configured Red Phone webhooks, the Kubernetes API server, and (optionally) MCP servers explicitly listed in configuration.
  • R-SEC-05: Secrets (LLM provider API keys, webhook signing keys) must be sourced from Kubernetes Secrets or external secret managers; never from environment variables embedded in container images.

7. Out-of-Scope Clarifications

The following are commonly assumed but explicitly not required of Simian v1:

  • Healing or remediating the faults it injects (that is the downstream SRE agent’s job).
  • Coordinating with humans on call — no PagerDuty/Opsgenie integration in v1.
  • Cross-cluster orchestration — one Simian, one cluster.
  • Persistent state across restarts beyond the audit log; the active-fault registry is rebuilt from cluster state on startup.
  • A web UI — operations are CLI, MCP, and Kubernetes-native.
  • Grading or scoring SRE agent responses. Simian exposes the structured scenario records needed for evaluation; pass/fail logic, K-of-N enforcement, RCA accuracy weighting, and scorecards live in the external evaluation harness.

8. Glossary

  • SUT — System Under Test. The application topology Simian attacks.
  • Eligible namespace — A namespace annotated simian.chaos/eligible: "true" and bound to the chaos ServiceAccount via RBAC.
  • Directed mode — Externally-driven fault submission via the MCP tool interface.
  • Autonomous mode — Self-directed cycles where Simian discovers, plans, and executes faults under a budget.
  • Attack plan — Ordered sequence of fault manifests with rationale, emitted by the LLM in autonomous mode.
  • Fault Executor — The single chokepoint that validates, audits, applies, and lifecycles every fault. See design.md §3.
  • Red Phone — The outbound event bridge that converts fault outcomes into natural-language incident pages.
  • Provisioned posture — Deployment mode where Simian created the target namespace and SUT (v1 focus).
  • External posture — Deployment mode where Simian targets pre-existing workloads it did not create (v2).
  • Scenario record — Stable, versioned, JSON-serializable document describing the inputs and outputs of one chaos cycle (or one directed-mode call). Consumed by external evaluation harnesses.
  • Blast-radius tier — Per-fault classification: namespace (contained), node (affects host kernel/machine), or external (affects resources outside the cluster). Tiers above namespace require explicit opt-in.
  • Probe (Litmus) — Cmd, HTTP, K8s, or Prometheus check attached to a chaos experiment that produces a structured pass/fail signal about the fault’s predicted symptom.