r9y-map

Project maintained by r9y-dev Hosted on GitHub Pages — Theme by mattgraham

Fault injection

Definition: Fault injection is a testing technique that deliberately introduces errors into a system to evaluate its resilience, error handling, and recovery mechanisms. Instead of waiting for real failures in production, engineers simulate faults under controlled conditions.

Examples of Fault Injection

Network latency injection: Artificially delay requests between services to test timeout handling and retries
Service crash simulation: Kill a microservice instance to validate failover and load balancing
Database failure: Drop connections or return errors to test fallback logic
Resource exhaustion: Limit CPU or memory to observe degradation behavior
Corrupted data injection: Feed invalid or malformed inputs to validate validation and error handling

Steps of conducting a chaos engineering experiment:

Define steady state: Identifies the specific metrics (e.g., latency, throughput) that you will look at and establish a baseline for them.
Formulate a hypothesis: This is the practice of creating a single testable statement, for example, ‘By deleting this container pod, user login will not be affected’. Hypotheses are generally created by identifying customer user journeys and deriving test scenarios from them.
Use a controlled environment: While one chaos engineering principle states that experiments need to run in production, you should still start small and run your experiment in a non-production environment first, learn and adjust, and then gradually expand the scope to production environment.
Inject failures: This is the practice of causing disruption by injecting failures either directly into the system (e.g., deleting a VM, stopping a database instance) or indirectly by injecting failures in the environment (e.g. deleting a network route, adding a firewall rule).
Automate experimental execution: Automation is crucial for establishing chaos engineering as a repeatable and scalable practice. This includes using automated tools for fault injection (e.g., making it part of a CI/CD pipeline) and automated rollback mechanisms.
Derive actionable insights: The primary objective of using chaos engineering is to gain insights into system vulnerabilities, thereby enhancing resilience. This involves rigorous analysis of experimental results; identifying weaknesses and areas for improvement; and disseminating findings to relevant teams to inform subsequent experimental design and system enhancements.

Best Practices:

Have observability tools setup beforehand
Define steady state clearly: Use measurable indicators (latency, error rate, throughput)
Have rollback mechanisms: Always be able to stop the experiment instantly
Document learnings: Convert findings into system improvements

Tools and Products for Fault injection:

1. Gremlin:

Website
enterprise-grade platform for infrastructure, application, and network fault injection

2. AWS Fault Injection Simulator:

Website
managed service to run controlled fault experiments on AWS workloads

3. Azure Chaos Studio:

Website
managed fault injection and resilience testing platform for Azure apps

4. Chaos Mesh:

Website
Kubernetes-native fault injection (network, pod, I/O failures)

5. Chaos Monkey:

Website
randomly terminates instances to test resilience

6. Toxiproxy:

Website
simulates latency, bandwidth limits, and packet loss between services

7. Chaos Toolkit:

Website
declarative chaos experiments via JSON/YAML (recommended by google cloud)

Prerequisites:

Observability in Place (Before injecting faults, the system must be measurable)
Stable Baseline (Clearly defined normal system behavior)
Ability to limit blast radius
Safety Mechanisms, Ability to deploy and rollback quickly
Engineering culture should support experimentation

TBC

Chaos Engineering
Resilience Testing
Fault Tolerance
High Availability
Circuit Breaker Pattern
Retry and Backoff Strategies
Observability (logs, metrics, traces)

type: post —

r9y-map

Fault injection

Related Tools and Products

Prerequisites:

Next:

Related terms: