Step Auto Scaling Group (ASG) Deployer a.k.a. Asgard
Deploy your 12-factor-applications to AWS easily and securely with the Step Auto-Scaling Group (ASG) Deployer (Asgard).
Asgard's goals/requirements/features are:
- Ephemeral Blue/Green: create new instances, wait for them to become healthy, delete old instances.
- Declarative: describe what a successful release looks like, not how to deploy it.
- Scalable: can scale both vertically (larger instances) and horizontally (more instances).
- Secure: resources are verified to ensure that they cannot be used accidentally or maliciously.
- Gracefully Fail: handle failures to recover and roll back with no/minimal impact to users.
- Configuration Parity: minimize divergence between production, staging and development environments by keeping releases as similar as possible.
- Cattle not Pets: treat compute instances as disposable and ephemeral.
- No Deployer Configuration: no configuration and minimal setup needed to get Asgard up and running.
- Multi Account: one deployer for all AWS accounts.
Getting Started
Asgard is made of an AWS Lambda Function (with a role) and AWS Step Function. You can bootstrap these into AWS with:
git pull # pull down new code
./scripts/bootstrap
Testing Asgard with deploy-test
Asgard includes a test project deploy-test
that has one service web
which is a nginx server to be mounted behind a Elastic Load Balancer (ELB) and Load Balancer target group. The service instances have a security group and instance profile.
To create the AWS resources for deploy-test
:
./scripts/geo apply resources/deploy-test-resources.rb
Note: you will also have to tag the latest Ubuntu release
A deploy-test
release file deployer-test-release.json
looks like:
{
"project_name": "coinbase/deploy-test",
"config_name": "development",
"subnets": ["test_private_subnet_a", "test_private_subnet_b"],
"ami": "ubuntu",
"user_data": "{{USER_DATA_FILE}}",
"services": {
"web": {
"instance_type": "t2.nano",
"security_groups": ["ec2::coinbase/deploy-test::development"],
"elbs": ["coinbase-deploy-test-web-elb"],
"profile": "coinbase-deploy-test",
"target_groups": ["coinbase-deploy-test-web-tg"]
}
}
}
The user data for the release is stored in the file deployer-test-release.json.userdata
:
#cloud-config
repo_update: true
repo_upgrade: all
packages:
- docker.io
runcmd:
- docker run -d --restart always --name test_server -p 8000:80 nginx
To build a release for deploy-test
and send it to Asgard we use the step-asg-deployer
executable:
step-asg-deployer deploy deploy-test-release.json
Asgard then:
- validates the sent release and any referenced resources.
- creates a new auto-scaling group for
web
which is configured to start an nginx server.
- waits for the EC2 instances in the
web
ASG to become healthy behind the ELB and target group. Healthy means that the health checks for both ELB and target group pass.
- Once healthy the old ASG and its instances are terminated.
Asgard Release
An Asgard release is a request to deploy a Project-Configuration where:
- A Project is a code-base typically named with
org/name
.
- A Configuration is the environment the project is being deployed into, e.g.
development
, production
.
Each release can define 1-to-many Services; each service is a logical group of servers, e.g. web
or worker
, that maps to a single auto-scaling group (ASG).
When Asgard is sent a release, it moves it through a state machine:
- Validate: validate the release is correct.
- Lock: grabs a lock on project-configuration.
- ValidateResources: validate resources w.r.t. the project, configuration and service using them.
- Deploy: creates an ASG and other resource for each service.
- CheckHealthy: check to see if the new instances created are healthy w.r.t. their ASGs ELBs and target groups. If instances are seen to be terminating immediately halt release.
- CleanUpSuccess: if the release was a success, then delete the old ASGs.
- CleanUpFailure: if the release failed, delete the new ASGs.
- ReleaseLockFailure: try to release the lock and fail.
At each of these states it is possible to fail and then move towards a failure state. The typical failures are:
- BadReleaseError: The release sent was invalid because either its structure was incorrect, its values were invalid, or its resources were invalid.
- LockExistsError: Could not grab the lock because either another deploy for the project-configuration is currently going out, or a previous deploy left a lock in place.
- DeployError: Unable to create a new ASG or resource.
- HaltError: Halt was detected or instances were found terminating.
- TimeoutError: The deploy took too long and failed.
The end states are:
- Success: the release went went as planned.
- FailureClean: release was unsuccessful, but cleanup was successful, so AWS was left in good state.
- FailureDirty: release was unsuccessful, but cleanup failed so AWS was left in a bad state. This should never happen and should alert if this happens, and file a bug.
- It is possible to not end in one of these states if the state machine is incorrect. This is very bad, alert if this happens and file a bug.
Resources
A release uses resources that must exist and be configured correctly to be used for the project-configuration-service being deployed.
A release must have:
- an AMI defined with the
ami
key that can be either a Name
tag or AMI ID e.g. ami-1234567
- Subnets defined with
subnets
key that is a list of either Name
tags or Subnet IDs e.g. subnet-1234567
Both the above resources MUST have a tag DeployWith
that equals step-asg-deployer
.
Services can have:
- Security Groups defined with
security_groups
key is a list of security groups Name
tags
- Elastic Load Balancers defined with
elbs
key is a list of ELB names
- Application Load Balancer Target Groups defined with
target_groups
is a list of target group's Name
tags
All the above resources MUST be tagged with the ProjectName
, ConfigName
and ServiceName
of the release to ensure that resources are assigned correctly.
Services can also have an Instance Profile defined by the profile
key that is and instance profile Name
tag. The roles path MUST be equal to /<project_name>/<config_name>/<service_name>/
.
Scale
Asgard makes it easy to scale both vertically and horizontally. To scale deploy-test
we add to the release:
{ ...
"services": {
"web": { ...
"instance_type": "c4.xlarge",
"ebs_volume_size": 20,
"ebs_volume_type": "gp2",
"ebs_device_name": "/dev/sda1",
"autoscaling": {
"min_size": 3,
"max_size": 5,
"spread": 0.2,
"max_terms": 1,
"policies": [
{
"type": "cpu_scale_up",
"threshold" : 25,
"scaling_adjustment": 2
},
{
"type": "cpu_scale_down",
"threshold" : 15,
"scaling_adjustment": -1
}
]
}
}
}
}
instance_type
is the EC2 instance type for the service
ebs_volume_size
, ebs_volume_type
, ebs_device_name
define the attached EBS volume in GB.
The autoscaling
key defines the horizontal scaling of a service:
- all calculations are bounded by
min_size
and max_size
.
- the
desired_capacity
is equal to the min_size
or capacity of the previously launched service
- the actual number of instances launched is the
desired_capacity * (1 + spread)
- to be deemed the healthy the service must have
desired_capacity * (1 - spread)
- if the number of terminating is greater than or equal to
max_terms
(default 0
), the release is immediately halts.
policies
are defined above to increase the desired_capacity
by 2 instances if the CPU goes above 25% and reduce by 1 instance if it drops below 15%.
Both spread
and max_terms
are useful when launching many instances because as scale increases the number of cloud errors increase.
User Data
Do not put sensitive data into user data. User data is not treated by Asgard as secure information, it is difficult to secure with IAM, and it is very limited in size. We recommend using Vault, AWS Parameter store, or KMS encrypted S3 authenticated by a service's instance profile.
The user_data
in the release is the plain text instance metadata sent to initialize each instance. Asgard will replace some strings with information about the release, project, config and service, e.g.:
...
write_files:
- path: /
content: |
{{RELEASE_ID}}
{{PROJECT_NAME}}
{{CONFIG_NAME}}
{{SERVICE_NAME}}
Asgard will replace {{PROJECT_NAME}}
with the name of the project and {{SERVICE_NAME}}
with the name of the service. This can be useful for getting service specific configuration and logging.
If user_data
is equal to {{USER_DATA_FILE}}
and deployed with step-asg-deployer
the value will be replaced with the contents of the <release_file>.userdata
, e.g. deployer-test-release.json.userdata
.
Timeout
A release can have a timeout
which is how long in seconds a release will wait for its services to become healthy. By default the timeout is 10 minutes, the max value would be around a year (31556926 seconds) since that is how long a step function can run.
Lifecycle
AWS provides Auto Scaling Group Lifecycle Hooks to detect and react to auto-scaling events. You can add the lifecycle hooks to the ASGs with:
{ ...
"lifecycle": {
"termhook" : {
"transition": "autoscaling:EC2_INSTANCE_TERMINATING",
"role": "asg_lifecycle_hooks",
"sns": "asg_lifecycle_hooks",
"heartbeat_timeout": 300
}
}
}
These can be used to gracefully shutdown instances, which is necessary if a service has long running jobs e.g. a worker
service.
Halt
Asgard supports manually stopping a release while is it being deployed. Just execute:
step-asg-deployer halt deploy-test-release.json
This will:
- Find the currently running deploy for the project configuration
- Write a
halt
file to S3
- Wait for Asgard to detect the halt file and fail the deploy
Halt does not guarantee that the release will not be deployed, if executed too late the release may still result in success.
DO NOT use Stop execution
of the Asgard step function as it will not clean up resources and leave AWS in a bad state.
Security
Deployers are critical pieces of infrastructure as they may be used to compromise software they deploy. As such, we take security very seriously around the step-asg-deployer
and try to answer the following questions:
- Authentication: Who can deploy?
- Authorization: What can be deployed?
- Replay and Man-in-the-middle (MITM): Can some unauthorized person edit or reuse a release to change what is deployed?
- Audit: Who has done what, and when?
Authentication
The central authentication mechanisms are the AWS IAM permissions for step functions and S3.
By limiting the ec2:CreateAutoscalingGroup
, permissions the Asgard function becomes the only way to deploy ASG's. Then limiting permissions to who can call states:StartExecution
for Asgard limits who can deploy.
Ensuring that Asgard's lambda can only access a single S3 bucket, further limits who can deploy with:
{
"Effect": "Allow",
"Action": [
"s3:GetObject*", "s3:PutObject*",
"s3:List*", "s3:DeleteObject*"
],
"Resource": [
"arn:aws:s3:::#{s3_bucket_name}/*",
"arn:aws:s3:::#{s3_bucket_name}"
]
},
{
"Effect": "Deny",
"Action": ["s3:*"],
"NotResource": [
"arn:aws:s3:::#{s3_bucket_name}/*",
"arn:aws:s3:::#{s3_bucket_name}"
]
},
Who can execute the step function, and who can upload to S3 are the two permissions that guard who can deploy.
Authorization
All resources that can be used in a Asgard deploy must opt-in using tags or paths. Additionally, service resources require specific tags or paths denoting which project/config/service can use them.
Assets uploaded to S3 are in the path /<ProjectName>/<ConfigName>
so limiting who can s3:PutObject
to a path can be used to limit what project-configs they can deploy or halt.
Replay and MITM
Each release the client generates a release release_id
, a created_at
date, and together also uploads the release to S3.
The step-asg-deployer
will reject any request where the created_at
date is not recent, or the release sent to the step function and S3 don't match. This means that if a user can invoke the step function, but not upload to S3 (or vice-versa) it is not possible to deploy old or malicious code.
Audit
Working out what happened and when is very useful for debugging and security response. Step functions make it easy to see the history of all executions in the AWS console and via API. S3 can log all access to cloud-trail, so collecting from these two sources will show all information about a deploy.
Continuing Deployment
There is always more to do:
- Allow LifeCycle Hooks to send to Cloudwatch.
- Subnet, AMI, life cycle and userdata overrides per service.
- Check EC2 instance limits and capacity before deploying.
- Slowly scale instances up rather than all at once, e.g. deploy 1 instance check it is healthy then deploy the rest.
- Add ELB and Target Group error rates when checking healthy.
- Custom auto-scaling policy types.