xpd

module

v0.0.0-...-1a23c2c Latest Latest Go to latest Published: Nov 6, 2021 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

README ¶

`xpd`

⚠ Warning: this is still a proof of concept with many security and reliability tradeoffs; not meant to be used in production.

See https://github.com/0x2b3bfa0/terraform-provider-xpd for user–friendly testing instructions.

Backlog

Infrastructure

Allowing instances to destroy themselves

With the current implementation, instances can't destroy all the accompanying resources, because of interdependencies. For example, after deleting a security group, it's impossible to issue more API calls because there is no network connection.

Possible solutions include:

Using cloud-native templates like AWS CloudFormation, Google Cloud Deployment Manager and Azure Resource Templates to let providers destroy everything.
Leaving cheap and costless resources in the cloud, and running a garbage collector in every invocation to delete resources from past tasks.
Requiring users to explicitly delete resources after each task. This approach is convenient with the launch/harvest lifecycle, but not for the CML runner.

Environment isolation and reproducibility

Machine images offered by providers have lots of quirks and don't include any of the helper tools we need to offer a good user experience, like the agent.

Custom images are the only alternative to provisioning instances on the fly, but it would be unwise to force users to run tasks in a fixed environment and, especially, commit to build and maintain a stable and secure reference image.

Resposiveness-wise, the most appropriate solution would be using containers or lightweight virtual machines with user-specified images, including some default general purpose images with our custom machine images in order to reduce load times.

Optimization

Resources' read/create/delete operations should be carried in parallel when possible in order to speed up the user-facing operations. The only sane way of doing this is by representing the resources as a directed acyclic graph and walking it as Terraform does.

https://pkg.go.dev/github.com/hashicorp/terraform/dag

Security

Key generation algorithms

Azure does not support ED25519 SSH keys, and generating RSA 4096 keys with low entropy is not a good idea. After using https://github.com/cloudflare/gokey, it should be good enough, but we should use ED25519 as soon as it's universally supported.

SSH communications

Machines' SSH port are accessible for the whole internet. While this should not be an issue as long as the OpenSSH server is well written, some companies won't allow this just because of the log noise of SSH brute force attacks.

Moreover, there is no easy way of verifying the server identity without having it use a known host key. Trust on first use is no longer a viable option for this use case.

Ideally, communications should be carried over a channel that requires upfront mutual authentication, like WireGuard. This awesome blog post describes a really interesting approach: https://fly.io/blog/ssh-and-user-mode-ip-wireguard

Secrets

Currently, secrets provisioning is being performed though the cloud-init user data script, in plain text and against the security recommendations of every cloud provider. At least, they should be encrypted with age using the SSH key.

Using a systemd EnvironmentFile with the decryption results should be enough.

Development

Ideas

Refactor orchestrator events & status as separate data sources
Escape rclone connection strings properly https://github.com/rclone/rclone/issues/4996#issuecomment-778628943
Chain attributes read to the outer task structs with pointers
Rename regions to locations so it's valid for zones and node selectors
Refactor API error checks to cast err.(native.Type).Code assertions
Migrate GCP API to https://github.com/googleapis/google-cloud-go
Consider using https://github.com/openlyinc/pointy on every provider

Directories ¶

Path	Synopsis
task
amazon
amazon/client
amazon/resources
google
google/client
google/resources
kubernetes
kubernetes/client
kubernetes/resources
microsoft
microsoft/client
microsoft/resources
universal
universal/ssh

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL