xpd

module
v0.0.0-...-1a23c2c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 6, 2021 License: Apache-2.0

README

xpd Go Reference lint

Warning: this is still a proof of concept with many security and reliability tradeoffs; not meant to be used in production.

See https://github.com/0x2b3bfa0/terraform-provider-xpd for user–friendly testing instructions.

Backlog

Infrastructure
Allowing instances to destroy themselves

With the current implementation, instances can't destroy all the accompanying resources, because of interdependencies. For example, after deleting a security group, it's impossible to issue more API calls because there is no network connection.

Possible solutions include:

  • Using cloud-native templates like AWS CloudFormation, Google Cloud Deployment Manager and Azure Resource Templates to let providers destroy everything.

  • Leaving cheap and costless resources in the cloud, and running a garbage collector in every invocation to delete resources from past tasks.

  • Requiring users to explicitly delete resources after each task. This approach is convenient with the launch/harvest lifecycle, but not for the CML runner.

Environment isolation and reproducibility

Machine images offered by providers have lots of quirks and don't include any of the helper tools we need to offer a good user experience, like the agent.

Custom images are the only alternative to provisioning instances on the fly, but it would be unwise to force users to run tasks in a fixed environment and, especially, commit to build and maintain a stable and secure reference image.

Resposiveness-wise, the most appropriate solution would be using containers or lightweight virtual machines with user-specified images, including some default general purpose images with our custom machine images in order to reduce load times.

Optimization

Resources' read/create/delete operations should be carried in parallel when possible in order to speed up the user-facing operations. The only sane way of doing this is by representing the resources as a directed acyclic graph and walking it as Terraform does.

https://pkg.go.dev/github.com/hashicorp/terraform/dag

Security
Key generation algorithms

Azure does not support ED25519 SSH keys, and generating RSA 4096 keys with low entropy is not a good idea. After using https://github.com/cloudflare/gokey, it should be good enough, but we should use ED25519 as soon as it's universally supported.

SSH communications

Machines' SSH port are accessible for the whole internet. While this should not be an issue as long as the OpenSSH server is well written, some companies won't allow this just because of the log noise of SSH brute force attacks.

Moreover, there is no easy way of verifying the server identity without having it use a known host key. Trust on first use is no longer a viable option for this use case.

Ideally, communications should be carried over a channel that requires upfront mutual authentication, like WireGuard. This awesome blog post describes a really interesting approach: https://fly.io/blog/ssh-and-user-mode-ip-wireguard

Secrets

Currently, secrets provisioning is being performed though the cloud-init user data script, in plain text and against the security recommendations of every cloud provider. At least, they should be encrypted with age using the SSH key.

Using a systemd EnvironmentFile with the decryption results should be enough.

Development
Ideas

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL