16 Jul 2020

Yohan Beschi
Developer, Cloud Architect and DevOps Advocate

Keeping your System up-to-date with AWS Systems Manager

Keeping an Operating System (OS) up-to-date is essential for obvious security reasons. But updating an OS blindly is strongly discouraged. An Operating System is part of an infrastructure which has a lifecycle managed by CICD pipelines, exactly as the application running on the aforementioned OS. The best solution is to have an AMI Factory to build new AMIs regularly or when required, to include a security update for example, which will be part of our CICD pipelines. Unfortunately, it is not always possible to terminate an instance and launch a new one with the new AMI. Typically, non stateless applications, like Databases should be terminated only as a last resort. Therefore, the only solution is to log into the server and do the update. Moreover, the same principle can be applied to server/application configuration. There is no reason to launch a new instance and have to wait 10 min to change a single property. Reverse proxies are one of the best example, we can change and reload the configuration without any downtime.

Of course, logging into the server and modifying anything should absolutely not be a manual operation. AWS Systems Manager Run Command can help us to easily automate these operations.

AWS Systems Manager Run Command lets you remotely and securely manage the configuration of your managed instances. A managed instance is any EC2 instance or on-premises machine in your hybrid environment that has been configured for Systems Manager. Run Command enables you to automate common administrative tasks and perform ad hoc configuration changes at scale. You can use Run Command from the AWS console, the AWS Command Line Interface, AWS Tools for Windows PowerShell, or the AWS SDKs. Run Command is offered at no additional cost.

Administrators use Run Command to perform the following types of tasks on their managed instances: install or bootstrap applications, build a deployment pipeline, capture log files when an instance is terminated from an Auto Scaling group, and join instances to a Windows domain, to name a few.

— Systems Manager User Guide

AWS Systems Manager documents

An AWS Systems Manager document (SSM document) defines the actions that Systems Manager performs on our managed instances. Systems Manager includes more than a hundred pre-configured documents that we can use by specifying parameters at runtime. Documents use JSON or YAML, and they include steps and parameters that we specify.

There are multiple types of documents (Command, Automation, Package, etc.). For the Run Command operations we will use Command documents. We could use AWS Systems Manager State Manager or AWS Systems Manager Maintenance Windows as-well but they are less flexible.

Running an SSM Command

To be able to Run a Command on an EC2 instance we need:

the SSM agent installed on the EC2 instance

yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
systemctl enable amazon-ssm-agent
systemctl start amazon-ssm-agent

to give permission to the instance to Run Commands received by the SSM agent

  AppRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: ec2-demo-app-role
      Path: /
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service: ec2.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM

  AppInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      InstanceProfileName: demo-app-instance-profile
      Path: /
      Roles:
        - !Ref AppRole

As an example we are going to execute the following commands on a remote instance:

pwd
mkdir -pv /tmp/test
touch /tmp/test/hello.txt
ls /tmp/test

To do so, we can use the CLI with the document AWS-RunShellScript (provided by AWS) and Instance IDs (or Tags present on the EC2 instances):

aws ssm send-command --document-name AWS-RunShellScript \
  --instance-ids i-0f0be3540555019f0 \
  --parameters '{"commands": \
                  [ "pwd", \
                    "mkdir -pv /tmp/test", \
                    "touch /tmp/test/hello.txt", \
                    "ls /tmp/test" \
                  ] \
                }'

The output is as follow:

{
    "Command": {
        "CommandId": "3ef85dee-2228-4afc-a714-adca1351c06a",
        "DocumentName": "AWS-RunShellScript",
        "DocumentVersion": "",
        "Comment": "",
        "ExpiresAfter": "2020-06-10T17:06:18.561000+02:00",
        "Parameters": {
            "commands": [
                "pwd",
                "mkdir -pv /tmp/test",
                "touch /tmp/test/hello.txt",
                "ls /tmp/test"
            ]
        },
        "InstanceIds": [
            "i-0f0be3540555019f0"
        ],
        "Targets": [],
        "RequestedDateTime": "2020-06-10T15:06:18.561000+02:00",
        "Status": "Pending",
        "StatusDetails": "Pending",
        "OutputS3BucketName": "",
        "OutputS3KeyPrefix": "",
        "MaxConcurrency": "50",
        "MaxErrors": "0",
        "TargetCount": 1,
        "CompletedCount": 0,
        "ErrorCount": 0,
        "DeliveryTimedOutCount": 0,
        "ServiceRole": "",
        "NotificationConfig": {
            "NotificationArn": "",
            "NotificationEvents": [],
            "NotificationType": ""
        },
        "CloudWatchOutputConfig": {
            "CloudWatchLogGroupName": "",
            "CloudWatchOutputEnabled": false
        },
        "TimeoutSeconds": 3600
    }
}

We can then check the status of the execution:

aws ssm get-command-invocation \
   --command-id 3ef85dee-2228-4afc-a714-adca1351c06a \
   --instance-id i-0004f05e59904ab79

In this example the execution ended in success and we can see the result of the pwd and ls commands in the StandardOutputContent property.

{
    "CommandId": "3ef85dee-2228-4afc-a714-adca1351c06a",
    "InstanceId": "i-0f0be3540555019f0",
    "Comment": "",
    "DocumentName": "AWS-RunShellScript",
    "DocumentVersion": "",
    "PluginName": "aws:runShellScript",
    "ResponseCode": 0,
    "ExecutionStartDateTime": "2020-06-10T13:06:18.999Z",
    "ExecutionElapsedTime": "PT0.009S",
    "ExecutionEndDateTime": "2020-06-10T13:06:18.999Z",
    "Status": "Success",
    "StatusDetails": "Success",
    "StandardOutputContent":
      "/usr/bin
      hello.txt",
    "StandardOutputUrl": "",
    "StandardErrorContent": "",
    "StandardErrorUrl": "",
    "CloudWatchOutputConfig": {
        "CloudWatchLogGroupName": "",
        "CloudWatchOutputEnabled": false
    }
}

With the document AWS-RunShellScript, we can easily execute Bash commands (for Linux OSes only). In addition of the parameter commands, we can use workingDirectory to execute the commands in a specific folder and executionTimeout to avoid having the command run for hours when we know the average execution time when everything is running smoothly.

For Windows, we have a similar document named AWS-RunPowerShellScript with the same parameters (commands, workingDirectory and executionTimeout).

A custom AWS Systems Manager document

AWS-RunShellScript document is an easy way to execute one-shot simple commands. For more complex scripts we can either use AWS-RunRemoteScript that will retrieve a script from Github or S3, and execute it, or we can create our own.

SsmCustomDocument:
  Type: AWS::SSM::Document
  Properties:
    DocumentType: Command
    Content:
      schemaVersion: '2.2'
      description: Run Commands
      parameters: {}
      mainSteps:
        - action: aws:runShellScript
          name: configureServer
          inputs:
            timeoutSeconds: 10
            runCommand:
              - pwd
              - mkdir -pv /tmp/test
              - touch /tmp/test/hello.txt
              - ls /tmp/test

In this example, we create a document we the same bash commands we passed to AWS-RunShellScript commands parameter.

It is worth mentioning that using Cloudformation to create SSM Documents comes with one limitation. We cannot give a name to the document (the name of the document created in this example is s-ew1-demo-ssm-documents-SsmCustomDocument-M0BYFGP4T9Z8).

We can then execute the document like previously without the need of extra parameters:

aws ssm send-command \
  --document-name s-ew1-demo-ssm-documents-SsmCustomDocument-M0BYFGP4T9Z8 \
  --instance-ids i-0f0be3540555019f0

The result is the same as before.

AWS Systems Manager and Ansible

To execute Ansible playbook AWS provides the document AWS-RunAnsiblePlaybook. But it is limited to Playbooks stored in Github or S3. And as mentioned in the article Provisioning EC2 instances with Ansible, only source repository should be used to version Ansible Playbooks and Roles.

But creating a document supporting AWS CodeCommit is quite simple:

SsmAnsiblePlaybookDocument:
  Type: AWS::SSM::Document
  Properties:
    DocumentType: Command
    Content:
      schemaVersion: '2.2'
      description: Run Ansible playbook
      parameters:
        RepositoryName:
          description: The name of repository containing the Ansible playbook
          type: String
        PlaybookPath:
          description: the path to the playbook file in the repository
          default: playbook.yml
        HasRequirements:
          description: Do we have to use Ansible Galaxy to retrieve Ansible Roles
          default: false
        RequirementsPath:
          description: the path to the requirements file in the repository
          default: requirements.yml
      mainSteps:
        - action: aws:runShellScript
          name: configureServer
          inputs:
            timeoutSeconds: 3600
            runCommand:
              - sudo su
              - for i in {1..5}; do sleep $(shuf -i1-30 -n1) && git clone codecommit://git-ansible@{{ RepositoryName }} /root/asbpb && break ; done
              - |-
                if [ "{{ HasRequirements }}" == "true" ]; then
                  ansible-galaxy install -r /root/asbpb/{{ RequirementsPath }}
                fi
              - ansible-playbook /root/asbpb/{{ PlaybookPath }}
              - rm -rf /root/asbpb

We have defined 4 parameters to make the document reusable and the commands are quite straight forward (see. Provisioning EC2 instances with Ansible for more details). Only the second one is a little more complex. It is a workaround to AWS CodeCommit throttling limit. If the git clone command fails, we retry it at most 5 times waiting 1 to 30 seconds between each retry.

We can then use the AWS CLI with the new document:

aws ssm send-command \
  --document-name s-ew1-demo-ssm-documents-SsmAnsiblePlaybookDocument-PQG12KX6HTVL \
  --instance-ids i-0004f05e59904ab79 \
  --parameters '{"RepositoryName": ["demo-ansible"] }'

For this example, the output StandardOutputContent of the execution is as follow:

Cloning into '/root/asbpb'...
PLAY [localhost] ***************************************************************
TASK [Gathering Facts] *********************************************************
ok: [localhost]
TASK [role-nginx-config : Copy nginx.conf] *************************************
ok: [localhost]
PLAY RECAP *********************************************************************
localhost  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

And the error output StandardErrorContent is:

remote:
remote: Counting objects: 0
remote: Counting objects: 17
remote: Counting objects: 17
remote: Counting objects: 17, done.
[WARNING]: provided hosts list is empty, only localhost is available.
Note that the implicit localhost does not match 'all'

To avoid the warning we can add the parameter -i localhost, (the comma is required) in the ansible-playbook command:

ansible-playbook -i localhost, /root/asbpb/playbook.yml

Handling errors

By default, the exit code of the last command executed in a script is reported as the exit code for the entire script. This means that the exit code of every other command is ignored, which is usually not what we want.

In order to make the script fail when a command fails we need to add the command below after every command:

if [ $? != 0 ]; then exit 1; fi

But now if there is an error we won’t remove the folder /root/asbpb at the end, which is a problem. If this folder is still present the next time we execute the Document, the command git clone will fail. The easiest solution is to try to remove it at the beginning of the script:

rm -rf /root/asbpb || true

The final script looks like this:

- sudo su
- rm -rf /root/asbpb || true
- for i in {1..5}; do sleep $(shuf -i1-30 -n1) && git clone codecommit://git-ansible@{{ RepositoryName }} /root/asbpb && break ; done
- if [ $? != 0 ]; then exit 1; fi
- |-
  if [ "{{ HasRequirements }}" == "true" ]; then
    ansible-galaxy install -r /root/asbpb/{{ RequirementsPath }}
    if [ $? != 0 ]; then exit 1; fi
  fi
- ansible-playbook -i localhost, /root/asbpb/{{ PlaybookPath }}
- if [ $? != 0 ]; then exit 1; fi
- rm -rf /root/asbpb
- if [ $? != 0 ]; then exit 1; fi

Logging

Commands output and errors can be logged into AWS CloudWatch and/or an AWS S3 bucket.

aws ssm send-command \
  --document-name s-ew1-demo-ssm-documents-SsmAnsiblePlaybookDocumentSafe-1IGXKSX0SB1RP \
  --targets "Key=tag:Application,Values=demo" \
  --parameters '{"RepositoryName": ["demo-ansible"] }' \
  --output-s3-region eu-west-1 \
  --output-s3-bucket-name s-ew-demo \
  --cloud-watch-output-config \
      CloudWatchLogGroupName=/aws/ssm/runCommand/ansible-playbook,CloudWatchOutputEnabled=true

AWS CloudWatch LogStreams will have the following patterns:

<CommandId>/<InstanceId>/<DocumentStepName>/stdout for the standard output and ;
<CommandId>/<InstanceId>/<DocumentStepName>/stderr for errors

For example:

e3cb7ea7-f06f-441f-9c68-7ee4c7bc1982/i-0004f05e59904ab79/configureServer/stdout
e3cb7ea7-f06f-441f-9c68-7ee4c7bc1982/i-0004f05e59904ab79/configureServer/stderr

Outputs stored in AWS S3 have keys slightly different than LogStream names:

<CommandId>/<InstanceId>/<SetAction>/<DocumentStepName>/stdout
<CommandId>/<InstanceId>/<SetAction>/<DocumentStepName>/stderr

For example:

e3cb7ea7-f06f-441f-9c68-7ee4c7bc1982/i-0004f05e59904ab79/awsrunShellScript/configureServer/stdout
e3cb7ea7-f06f-441f-9c68-7ee4c7bc1982/i-0004f05e59904ab79/awsrunShellScript/configureServer/stderr

AWS CloudWatch Events

For automation, the AWS CLI is not the best tool. With AWS CloudWatch Rules we can execute SSM Run Commands triggered by an event or on a schedule.

This can be useful in 2 ways:

executing an Ansible Playbook at regular interval, for example to keep a configuration up-to-date
executing an Ansible Playbook when an EC2 instance is launched

AWS CloudWatch Event 1

AWS CloudWatch Event 2

Unfortunately, this feature comes with several limitations compared to what we can do with the CLI or the SDKs. It is not possible to define:

an S3 Bucket or a CloudWatch LogGroup for the logs
a maximum number of concurrent executions (useful when we use tags and have hundreds of EC2 instances)
an execution timeout
an SNS Topic triggered after the execution

Lambda

To overcome the limitations of AWS CloudWatch Events with SSM Run Command, we can use an AWS Lambda triggered by AWS CloudWatch Events:

ExecuteEc2MonitoringPlaybookAtEC2Launch:
  Type: AWS::Events::Rule
  Properties:
    Name: s-demo-ansible-playbook-atlaunch-event-rule
    Description: Execute Ansible Playbook when an ec2 instance changes to RUNNING
    State: ENABLED
    EventPattern:
      source:
        - aws.ec2
      detail-type:
        - EC2 Instance State-change Notification
      detail:
        state:
          - running
    Targets:
      - Id: SsmCmdAnsiblePlaybookLambdaSchedule
        Arn: !GetAtt SsmCmdAnsiblePlaybookLambda.Arn

ExecuteAnsiblePlaybookCron:
  Type: AWS::Events::Rule
  Properties:
    Name: s-demo-ansible-playbook-cron-event-rule
    Description: Execute Ansible Playbook at some frequency
    State: ENABLED
    ScheduleExpression: !Sub cron(${Scheduling})
    Targets:
      - Id: SsmCmdAnsiblePlaybookLambdaSchedule
        Arn: !GetAtt SsmCmdAnsiblePlaybookLambda.Arn

Using Python and Boto3, we can execute an SSM Run Command this way:

document_name = os.getenv['DOCUMENT_NAME']
ssm = boto3.client('ssm')
ssm.send_command(
  Targets=[
    {
      'Key': 'tag:Application',
      'Values': 'demo'
    }
  ],
  # or
  # InstanceIds=[ instance_id ],
  DocumentName=document_name,
  DocumentVersion='$LATEST',
  Parameters={
    'RepositoryName': [ 'demo-ansible' ]
  },
  TimeoutSeconds=3600,
  MaxConcurrency='10',
  MaxErrors='1',
  CloudWatchOutputConfig={
    'CloudWatchLogGroupName': f'/aws/ssm/runCommand/{ document_name }',
    'CloudWatchOutputEnabled': True
  }
)

To know which event triggered the lambda (a new EC2 instance or a CRON), we need to check the event source:

if event['source'] == 'aws.ec2':
  # New EC2
elif event['source'] == 'aws.events':
  # Triggered by CRON
else:
  # Unknown event

Furthermore, it is important to note that if we install the SSM Agent in the user data, the event "EC2 Instance State-change Notification == RUNNING" trigger the lambda too early for AWS Systems Manager to be able to contact the Agent.

We have to wait the status check OK and sometimes a little more time after that before invoking ssm.send_command():

ec2_waiter = ec2.get_waiter('instance_status_ok')
ec2_waiter.wait(InstanceIds=[instance_id])

time.sleep(60)

Conclusion

While AWS Systems Manager Run Command is a little rough on the edges and requires a little work to build something useful, it is a wonderful service that can be use without any additional charges.