gVisor Runtime Tests
These tests execute language runtime test suites inside gVisor. They serve as
high-level integration tests for the various runtimes.
Runtime Test Components
The runtime tests have the following components:
- [
images
][runtime-images] - These are Docker images for each language
runtime we test. The images contain all the particular runtime tests, and
whatever other libraries or utilities are required to run the tests.
proctor
- This is a binary that acts as an agent inside the
container and provides a uniform command-line API to list and run the
various language tests.
runner
- This is the test entrypoint invoked by bazel run
.
This binary spawns Docker (using runsc
runtime) and runs the language
image with proctor
binary mounted.
exclude
- Holds a CSV file for each language runtime containing
the full path of tests that should be excluded from running along with a
reason for exclusion.
Testing Locally
The following make
targets will run an entire runtime test suite locally.
Note: java runtime test take 1+ hours with 16 cores.
Language |
Version |
Running the test suite |
Go |
1.22 |
make go1.22-runtime-tests |
Java |
21 |
make java21-runtime-tests |
NodeJS |
22.2.0 |
make nodejs22.2.0-runtime-tests |
Php |
8.3.7 |
make php8.3.7-runtime-tests |
Python |
3.12.3 |
make python3.12.3-runtime-tests |
You can modify the runtime test behaviors by passing in the following make
variables:
RUNTIME_TESTS_FILTER
: Comma-separated list of tests to run, even if
otherwise excluded. Useful to debug single failing test cases.
RUNTIME_TESTS_PER_TEST_TIMEOUT
: Modify per-test timeout. Useful when
debugging a test that has a tendency to get stuck, in order to make it fail
faster.
RUNTIME_TESTS_RUNS_PER_TEST
: Number of times to run each test. Useful to
find flaky tests.
RUNTIME_TESTS_FLAKY_IS_ERROR
: Boolean indicating whether tests found flaky
(i.e. running them multiple times has sometimes succeeded, sometimes failed)
should be considered a test suite failure (true
) or success (false
).
RUNTIME_TESTS_FLAKY_SHORT_CIRCUIT
: If true, when running tests multiple
times, and a test has been found flaky (i.e. running it multiple times has
succeeded at least once and failed at least once), exit immediately, rather
than running all RUNTIME_TESTS_RUNS_PER_TEST
attempts.
Example invocation:
$ make php8.1.1-runtime-tests \
RUNTIME_TESTS_FILTER=ext/standard/tests/file/bug60120.phpt \
RUNTIME_TESTS_PER_TEST_TIMEOUT=10s \
RUNTIME_TESTS_RUNS_PER_TEST=100
Clean Up
Sometimes when runtime tests fail or when the testing container itself crashes
unexpectedly, the containers are not removed or sometimes do not even exit. This
can cause some docker commands like docker system prune
to hang forever.
Here are some helpful commands (should be executed in order):
docker ps -a # Lists all docker processes; useful when investigating hanging containers.
docker kill $(docker ps -a -q) # Kills all running containers.
docker rm $(docker ps -a -q) # Removes all exited containers.
docker system prune # Remove unused data.
Updating Runtime Tests
To bump the version of an existing runtime test:
-
Update the Docker image for with the new runtime
version. Rename the Dockerfile
directory name and update any packages or
downloaded urls to point to the new version. Test building the image with
docker build images/runtimes/<new_runtime>
.
-
Update runtime_test
target. The name
field must be the
directory name for the Dockerfile
created in Step 1.
-
Update Buildkite pipeline.
-
Run the tests, and triage any failures. Some language tests are flaky (or
never pass at all), other failures may indicate a gVisor bug or divergence
from Linux behavior.
-
Update the exclude file by renaming it with the right version and
adding any failing tests to it with a reason.
Cleaning up exclude files
Usually when the runtime is updated, a lot has changed. Tests may have been
deleted, modified (fixed or broken) or added. After you have an exclude list
from step 3 above with which all runtime tests pass, it is useful to clean up
the exclude files with the following steps:
- Check for the existence of tests in the runtime image. See how each runtime
lists all its tests (see
ListTests()
implementations in proctor/lib
directory). Then you can compare against that list and remove any excluded
tests that don't exist anymore.
- Run all excluded tests with runc (native) for each runtime. If the test
fails, we can consider the test as broken. Such tests should be marked with
Broken test
in the reason column. These tests don't provide a
compatibility gap signal for gvisor. We can happily ignore them. Some tests
which were previously broken may not be unbroken and for them the reason
field should be cleared.
- Run all the unbroken and non-flaky tests on runsc (gVisor). If the test is
now passing, then the test should be removed from the exclude list. This
effectively increases our testing surface. Once upon a time, this test was
failing. Now it is passing. Something was fixed in between. Enabling this
test is equivalent to adding a regression test for the fix.
- Some tests are excluded and marked flaky. Run these tests 100 times on runsc
(gVisor). If it does not flake, then you can remove it from the exclude
list.
- Finally, close all corresponding bugs for tests that are now passing. These
bugs are stale.
Creating new runtime tests for an entirely new language is similar to the above,
except that Step 1 is a bit harder. You have to figure out how to download and
run the language tests in a Docker container. Once you have that, you must also
implement the proctor/TestRunner
interface for that
language, so that proctor can list and run the tests in the image you created.