Terraflakes
Repeatedly apply and destroy a Terraform environment with the goal to find flakes, treating the
Terraform configuration as a black box.
Status: EARLY BETA code
Usage
-
Run terraflakes:
$ cd /path/to/terraform-environment
$ terraflakes
[time passes...]
[terraflakes terminates]
-
Verify by hand that everything is cleaned up; if not, recover accordingly:
$ terraform plan
Recommendation
Since terraflakes uses terraform and is normally long-running, it makes sense to run it
within a tmux session, so that in case the connection to the host is interrupted, terraflakes and
terraform can keep going and eventually terminate gracefully.
Options
--repeat N repeat the apply/destroy sequence N times (see also --max-duration) [default: 10]
--max-duration D stop after duration D (upper limit to --repeat) [default: 1h]
Secrets handling
Terraflakes will call terraform, so probably you will have to make secrets available:
$ summon terraflakes ...
$ aws-vault exec PROFILE -- terraflakes ...
...
Interrupting Terraflakes
A single SIGINT
(keyboard: Ctrl-C
) or a SIGTERM
sent to terraflakes will be received also by
the underlying terraform. Both processes will perform a graceful shutdown.
A second SIGINT
or SIGTERM
will terminate immediately terraform, probably leaving an
inconsistent state. Don't do that!
Although the shutdown is graceful, it doesn't mean that you will be left with a terraform destroyed
workspace: do a terraform plan
, assess the output and cleanup accordingly.
Safety first
Terraflakes attempt to have a as safe as possible default behavior, but it requires the user
collaboration.
Choose wisely the value of --max-duration
. It must not surpass the duration of the cloud
credentials needed to perform terraform operations.
At the beginning of each cycle, terraflakes looks at --max-duration
and enters the apply/destroy
sequence only if it has enough time (based on the statistics obtained up to that moment) to run
said sequence. This is useful to reduce (but not remove!) the risk of stale locks or inconsistent
state when the credentials expire at mid-flight.
Examples
The examples have a parametric failure rate, by default 30% (see
examples/tf/random-failure.sh).
Warnings
Double-check that you are invoking it in the correct directory and correct workspace. DO NOT RUN IN
YOUR PRODUCTION ENVIRONMENT.
The recommended approach is to use a dedicated cloud account for testing (so that it is impossible
for the tests and the apply/destroy to spill into production).
2. Leftover resources cost money
In case that the final terraform destroy is not called or fails in the middle (eg cloud credentials
expire during terraform operation, bug in terraflakes, bug in something else), this tool will leave
resources in your cloud, for which you will have to pay.
Consider a fallback mechanism to ensure no cloud resources are left around, such as a script that
wraps terraflakes and sends an alarm in case of non-zero exit status.
There are failures that cannot be recovered by a terraform destroy
, for example if you have
shared state and the credentials expire in between, terraform will not release the state lock and
punt to human intervention (force-unlock).
Build and install
go build
- Copy the generated
terraflakes
executable to a directory in your $PATH
.
License
This code is released under the MIT license.
References