xpd
⚠ Warning: this is still a proof of concept with many security and
reliability tradeoffs; not meant to be used in production.
See https://github.com/0x2b3bfa0/terraform-provider-xpd for user–friendly testing instructions.
Backlog
Infrastructure
Allowing instances to destroy themselves
With the current implementation, instances can't destroy all the accompanying
resources, because of interdependencies. For example, after deleting a security
group, it's impossible to issue more API calls because there is no network
connection.
Possible solutions include:
-
Using cloud-native templates like AWS CloudFormation, Google Cloud Deployment
Manager and Azure Resource Templates to let providers destroy everything.
-
Leaving cheap and costless resources in the cloud, and running a garbage
collector in every invocation to delete resources from past tasks.
-
Requiring users to explicitly delete resources after each task. This approach
is convenient with the launch/harvest lifecycle, but not for the CML runner.
Environment isolation and reproducibility
Machine images offered by providers have lots of quirks and don't include any of
the helper tools we need to offer a good user experience, like the agent.
Custom images are the only alternative to provisioning instances on the fly, but
it would be unwise to force users to run tasks in a fixed environment and,
especially, commit to build and maintain a stable and secure reference image.
Resposiveness-wise, the most appropriate solution would be using containers or
lightweight virtual machines with user-specified images, including some default
general purpose images with our custom machine images in order to reduce load
times.
Optimization
Resources' read/create/delete operations should be carried in parallel when
possible in order to speed up the user-facing operations. The only sane way of
doing this is by representing the resources as a directed acyclic graph and
walking it as Terraform does.
https://pkg.go.dev/github.com/hashicorp/terraform/dag
Security
Key generation algorithms
Azure does not support ED25519 SSH keys, and generating RSA 4096 keys with low
entropy is not a good idea. After using https://github.com/cloudflare/gokey, it
should be good enough, but we should use ED25519 as soon as it's universally
supported.
SSH communications
Machines' SSH port are accessible for the whole internet. While this should not
be an issue as long as the OpenSSH server is well written, some companies won't
allow this just because of the log noise of SSH brute force attacks.
Moreover, there is no easy way of verifying the server identity without having
it use a known host key. Trust on first use is no longer a viable option for
this use case.
Ideally, communications should be carried over a channel that requires upfront
mutual authentication, like WireGuard. This awesome blog post describes a really
interesting approach: https://fly.io/blog/ssh-and-user-mode-ip-wireguard
Secrets
Currently, secrets provisioning is being performed though the cloud-init
user
data script, in plain text and against the security recommendations of every
cloud provider. At least, they should be encrypted with age
using the SSH key.
Using a systemd EnvironmentFile
with the decryption results should be enough.
Development
Ideas