Automated Multi-Region deployments in AWS: Gotchas
"Gotcha" maybe a bit over the top, but perhaps "caveats" is a better term. Leveraging StackSets alone can cause some order of operation issues, as well as adding multi-region on top of it.
We will discuss these caveats more in depth in other articles, but wanted to touch on StackSets up front, since they underpin everything we will do.
With StackSets and applying them to OUs, automated deployment of them works like a charm, most of the time. As we laid out in the Intro, we deploy all of our IaC as StackSets into OU targets. We do this to automate deployments and ensure we have a consistent deployment throughout all of our Accounts for an application.
This also enables us to create private tenets for customers that only they access, with minimal overhead.
Our entire cloud journey is to remove overhead and reduce maintenance needs and build more awesome things.
StackSet Execution Order
Below we show an example set of StackSets that we deploy into our AWS Organization using Delegated Admin:
DSOP Centralized Lambda Code Buckets
This is deployed only to the DSOP OU. It creates an S3 bucket that any org in the organization can pull from. Lambda S3 code and Lambda Layers are deployed here
DSOP Centralized Kinesis
This is deployed only to our Logging OU. It creates a S3 bucket that Kinesis Firehose streams can write to, centralized
DSOP SSM Lookup
This is deployed to our Production OU. This enables us to lookup SSM values. This replaced the
resolve
functionality with SSM.
DSOP ACM Generator
This is deployed to our Production OU. This creates the DNS records necessary for our production apps to create and authorize TLS certificates
DSOP Dependency Analysis
This is deployed to our Production OU. This enables us to analyze how dependent stack sets are created and deleted, ensuring we don't try to create or delete resources
DSOP Parameter Lookup
This is deployed to our Production OU. This enables us to store configuration data in our Security account and pull that from the Production OU accounts.
Next, per application OU (In this case the Weather App in the Weather App OU), we deploy the following (generally)
App Infrastructure StackSet (KMS, S3, VPC, Subnets)
App Global Tables StackSet
App StepFunction StackSet
App Lambda Functions StackSet
App API Gateway StackSet
We break apart things like Lambda functions and API Gateway. We do this because we want to separate compute from infrastructure. Our desire is to have the App CodePipeline deploy as "thin" of a template as possible and keep the "blast radius" of changes restricted to as few potential items as possible.
Part of this is we can then deploy other types of compute, like Fargate with loose coupling.
So, here comes the limitations. We are only going to touch on this major one here. StackSets execute all at once and in no particular order. We can't set the order of operations to run in this order (for deployment):
App Infrastructure StackSet
App StepFunction StackSet (Depends on App Infrastructure StackSet)
App Lambda Functions StackSet (Depends on App Infrastructure StackSet)
App API Gateway StackSet (depends on App Lambda Functions StackSet and App Infrastructure StackSet)
When a new Account is added to the Weather App OU
, it would run all of them at the same time. The StackSets App StepFunction StackSet
, App Lambda Functions StackSet
, App API Gateway StackSet
all will fail because App Infrastructure StackSet
is executing. Thus, the three important StackSets we need fail. Getting them to redeploy is a pain as well, there is no straight forward means to apply failed stacks.
But wait theres more
StackSet Execution order also rears its head when you remove StackSets. Let's say your Weather App OU
has 2 accounts in it. You decide to decommission an account and remove it from the OU. When this occurs, if you scope StackSets to that OU for automatic deployment and enforce deletion on removal, it will remove them from the account.
So, now you have the problem above in the reverse order. How do you delete AWS Lambda functions defined in your Function StackSet before you delete your Infrastructure StackSet that defines the Sub Groups and VPCs?
Well, again there is no easy way to do this.
Tags to the rescue
So, we thought awhile about how to solve this. We iterated on three different solutions and believe we have one that doesn't hard code dependencies on core templates.
To solve this problem, we've developed yet another AWS CloudFormation Custom Resource. This one tracks dependencies that CloudFormation Templates may have on one another. We do this by attaching tag data to the StackSet Instances themselves.
Thus, we can indicate that our lambda
function relies on our infrastructure
template and our api-gateway
template depends on our lambda
template.
When we start to tie this together, we can enforce the following order:
Deploy these first:
App Infrastructure StackSet (KMS, S3, VPC, Subnets)
App Global Tables StackSet
Then after the infrastructure templates deploy, the dependency checks pass for the following:
App StepFunction StackSet
App Lambda Functions StackSet
Finally, after the App Lambda Functions StackSet deploys, we can deploy:
App API Gateway StackSet
All of this occurs without us having to embed logic to monitor dependencies between StackSets.
We mandated each StackSet apply the following tags:
Tags:
- Key: "dsop:stackset:dependson"
Value: !Sub "${parStackSetApplicationName}-infrastructure"
- Key: "dsop:stackset:name"
Value: !Sub "${parStackSetApplicationName}-lambda"
- Key: "dsop:stackset:application"
Value: !Ref parStackSetApplicationName
These tags give context to our templates that enable our Custom Resource to function. Adding these two resources into our deployed template:
# Waits for stacks depending on this stack to be cleaned up
WaitForDependencyCleanup:
Type: Custom::WaitForDependencyCleanup
DependsOn:
- AppLambdaIAMRole
- WebAppApiAppDotNetCoreFunction
Version: '1.0'
Properties:
ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:dsop-cloudformation-dependency-analysis"
StackId: !Ref "AWS::StackId"
Type: Delete
# Waits for stacks depending on this stack to be created
WaitForDependencyCreation:
Type: Custom::WaitForDependencyCreation
Version: '1.0'
Properties:
ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:dsop-cloudformation-dependency-analysis"
StackId: !Ref "AWS::StackId"
Type: Create
Arn:
- !Sub "arn:${AWS::Partition}:kms:${parGlobalTableRegion2}:${AWS::AccountId}:alias/App/dynamodb-global"
- !Sub "arn:${AWS::Partition}:kms:${AWS::Region}:${AWS::AccountId}:alias/App/dynamodb-global"
Performs all the dependency analysis we need.
Now, as our StackSets begin to cleanup, they will be forced to wait for their dependencies to clean themselves up.
We do suggest after decommissioning an account to provide some additional cleanup. For instance, you may have S3 buckets with data in them still that you don't mark for automated deletion. Some of this probably could be accomplished with a bespoke set of StepFunctions, but that could impose some risks as well. Finally, the removed accounts could have templates that didn't finish deleting, a bit of cleanup must be performed.
Other Gotchas
There are other gotchas out there, which we will touch on in the follow on articles below. Things like DynamoDB Global Tables, S3 bucket replication, Lambda code locations, etc.
Finally, StackSets have a limit of 100 per admin account. So, this strategy may incur some limits depending how big your solution stack and application stack is going to grow. If you have 10 apps that have 10 StackSet templates, thats going to start go create some issues for you.
Shout Out
Big shout out to George for helping me sort one issue on deploying CloudFormation StackSets in CodePipeline to get over a last hurtle! We'll touch on his help in Part 7.
Next Up in series
Next up in this series will be:
Part 1: Intro
Part 2: Gotchas
Part 3: DynamoDB Global Tables
Part 4: AWS Lambda (Pending)
Part 5: S3 Replication (Pending)
Part 6: AWS Fargate (Pending)
Part 7: AWS CodePipeline (Pending)