Known limitations
This page is the canonical place to land if a fault “applied successfully” but didn’t appear to do anything, or if a SUT pod refuses to come up after enabling Envoy injection.
GKE Dataplane V2 silently breaks Chaos Mesh’s NetworkChaos
Chaos Mesh installs a netem qdisc on the pod’s eth0, which we verified is present at the kernel level. But Dataplane V2 routes pod-to-pod traffic through eBPF maps that bypass the tc qdisc layer, so the latency / loss never gets applied. The Sent ... pkt counter on the qdisc stays flat. This is a Chaos Mesh + Cilium incompatibility, not a Simian bug.
References: chaos-mesh#3302, cilium#19975 — both open since 2022, no fix in sight.
Workarounds shipped:
- The
network-policyengine handles partition-style chaos. Works on DPv2. - The
envoy-faultengine handles HTTP-layer delay + abort via an injected Envoy sidecar. Works on DPv2 (subject to the limitation immediately below).
For non-network chaos, PodChaos / StressChaos / TimeChaos / IOChaos / JVMChaos continue to work fine on Dataplane V2. See DPv2-compatible chaos engines for the full design rationale.
Envoy injection breaks gRPC kubelet probes
This is why the chart default is sutInjection.envoyFaults: false.
The current Envoy injection model intercepts ALL inbound TCP on the SUT-declared service ports via iptables PREROUTING REDIRECT to Envoy’s listener (port 15006). Envoy speaks HTTP at the L7 layer; it does not understand gRPC health-probe payloads.
| Workload probe type | Behavior with Envoy injection |
|---|---|
HTTP httpGet probes (e.g. Online Boutique frontend) | ✅ Works — Envoy responds to the probe |
TCP tcpSocket probes (e.g. redis-cart) | ✅ Works — Envoy accepts the TCP handshake |
gRPC grpc: probes on a redirected port (most Online Boutique services) | ❌ Probe fails → kubelet kills the container → CrashLoopBackOff |
gRPC grpc: probes on a NON-redirected port | ✅ Works — no interception |
For Online Boutique specifically, --no-envoy-faults=false (i.e. injection on) leaves 9 of 12 deployments crash-looping. Until probe rewriting (Istio’s pilot-agent style) or an outbound-only redirect mode is implemented, the rule of thumb is: only enable Envoy injection for SUTs whose probes you’ve audited as HTTP-only or TCP-only.
Cheap-escape-hatch: exclude probe ports from interception
If a workload’s probe port is different from its service port (e.g. a service on 8080 with a probe on 8081), you can exempt the probe port from the iptables redirect — kubelet’s probe traffic bypasses Envoy entirely while service traffic still goes through:
# SUT-wide: exclude port 8081 from interception for every Deployment
simian sut deploy --namespace boutique-1 --no-envoy-faults=false \
--envoy-exclude-port=8081
# Per-workload: only this Deployment exempts the listed ports
metadata:
template:
metadata:
annotations:
simian.chaos/envoy-exclude-ports: "8081,9090"
Or declare it on the SUT itself by implementing the EnvoyExcludePortsProvider interface (see pkg/sut/sut.go). The three layers merge.
Caveat: when probe port equals service port (Online Boutique’s situation for most workloads), exempting the port also disables fault injection against that workload. Trade-off: “no CrashLoopBackOff” vs “no fault injection on this workload.” For SUTs that need both, the full probe-rewriter (forthcoming) is the proper fix.
Workaround for arbitrary workloads
Deploy the SUT with the default (--no-envoy-faults=true), then hand-author a small Deployment whose probes you control (HTTP httpGet or TCP tcpSocket), add the Envoy sidecar + iptables init + bootstrap ConfigMap from pkg/sut/envoy/ to it, apply the EnvoyHttpDelay / EnvoyHttpAbort chaos against that pod’s label selector, and measure with curl through the Envoy listener port (15006). See Using the chaos engines for the simian chaos invocation pattern.
Chaos Mesh on GKE Standard with Node Auto-Provisioning
The chaos-daemon DaemonSet won’t land on NAP-provisioned nodes without (a) the right default-compute-class-non-daemonset label on the chaos-mesh namespace and (b) a cloud.google.com/compute-class:NoSchedule toleration. Without both, NetworkChaos / IOChaos reconciliation fails with cannot find daemonIP on node ....
This is an install-time concern, not a Simian bug — but it affects every chaos-mesh-using install on GKE NAP. Documented in the README’s “Known cluster-side gotchas” section.
Autonomous LLM bias toward chaos-mesh
Without --hypothesis-hint, the LLM almost never picks the new network-policy or envoy-fault engines because chaos-mesh has 12+ catalog entries vs 1+2. Possible mitigations: (a) tier-policy filtering, (b) explicit per-engine “weight” in the catalog, (c) prompt rule that encourages cross-engine plans. Not blocking; the hypothesis-hint workaround is reliable.
Metrics provider deferred
get_metrics returns {"configured":false,"reason":"metrics provider not configured (deferred); see roadmap.md M3 risks."}. The hook is wired; a real provider lands in a later milestone.