Using transient SSH keys on GitHub Actions Runners for Azure VM deployment and configuration

This post is about deploying Azure Linux Virtual Machines (VMs) in GitHub Actions Continuous Deployment (CD) workflows, and how to do post-deployment VM configuration when "out of box" Azure VM capabilities aren't available.

Virtual Machine images, agents, and extensions

Let's start with a quick recap of Azure VM images, agents, and extensions.

Azure VMs can be created from Azure Marketplace images, or from your own custom images.

Azure Marketplace Linux images provided by Microsoft have various features and extensions available. These include the pre-installed Linux Agent, which is required for extensions to work. Microsoft has open-sourced the Linux Agent on GitHub.

Extensions are used for various tasks on Azure VMs, including post-deployment configuration and automation. One very commonly-used extension is the Custom Script Extension, which runs your custom script on an Azure VM.

Ubuntu Azure VM images also provide cloud-init support for post-deployment configuration, as an alternative to some Linux Agent capabilities.

Both the Linux VM Agent and the Custom Script Extension have some constraints, such as OS support (see the above doc links for details).

GitHub Actions

Next, let's review GitHub Actions.

GitHub Actions is GitHub's Continuous Integration/Continuous Deployment (CI/CD) platform. Workflows can run on GitHub-hosted runners, or you can host your own runners.

A very common CD scenario is deploying infrastructure, such as Azure VMs. In order to deploy to Azure, the workflow needs to login to Azure. This can be done with the GitHub Action for Azure Login, available on the GitHub Actions Marketplace. This Action has also been open-sourced on GitHub by Microsoft.

This Action (obviously) requires a credential to log into Azure, so that later steps in the same workflow can operate in Azure in an established authentication context. There are two ways to provide Azure credential information to the Action, either a specifically formatted GitHub repo secret or OpenID Connect (OIDC).

Use of a secret requires configuring a Service Principal in Azure, then getting its properties as a JSON block, and storing that JSON block in a GitHub repository secret, then configuring the Azure Login Action to retrieve that secret during a workflow. This is documented in detail.

The result is that we have a Service Principal which will be used by your workflow and the various Actions that operate in Azure, with its credential information stored in a GitHub repository secret.

This Service Principal will operate on the control plane (deploy and configure Azure resources), and will also enable data plane (work inside the data or the resource) access through the above-described Linux VM Agent and Custom Script Extension.

There is no additional credential you need to manage in order to run a post-deployment script on a VM, if you have the Linux VM Agent and the Custom Script Extension available.

But... What if we can't use the Linux VM Agent and/or the Custom Script Extension?

Sometimes, the Linux VM Agent or the Custom Script Extension will not be available for you to use. Why not?

  • You create and use custom OS images, instead of Azure Marketplace OS images, and your custom images do not have or allow the Linux VM Agent to be installed
  • You use OS images for an OS or version that is not compatible with the Linux VM Agent or the Custom Script Extension
  • You have a list of allowed Extensions which may exclude the Custom Script Extension, for example due to your organization's security posture
  • Etc.

This type of constraint is not unusual in highly regulated and very security-focused industries such as Finance and Health.

So let's say you need to deploy some Azure Linux VMs, and you need to run some post-deploy configuration on them (e.g. install or configure additional software, etc.). You can't do it via the above mechanisms. Now what?

Run scripts on VMs

Any Linux admin knows you can SSH direct to a VM and run scripts. Say you have some script that does an install, creates files, etc. - basically anything you might want to do on a VM. For example, I might have a script file myscript.sh...

#!/bin/bash
# This is the script to run remotely on VMs
# Let's update/install
sudo apt update -y
sudo apt upgrade -y
sudo apt install nginx -y
# Let's do some file system things
sudo mkdir /usr/patrick_was_here
sudo touch /usr/patrick_was_here/foo.txt
sudo chown -v -R root /patrick_was_here/
sudo chmod 744 /patrick_was_here/
# and so on

How do I get myscript.sh onto the VM and run it there? SSH.

The following is a minimal SSH command which logs into a remote VM as userName at its network address vmFqdn, and uploads and runs the local script myscript.sh.

ssh userName@vmFqdn < ./myscript.sh

This requires that you have the private SSH key for userName locally, and that the corresponding public SSH key is on the remote VM. As long as that is true, you can SSH to a remote VM and run a script to configure the VM, without any agents or extensions needed otherwise.

How does the public SSH key even get onto a VM you're deploying? You specify it, along with a corresponding admin username, at VM deployment time. You can do this in the VM ARM template via the ssh configuration element, or via the Azure CLI by specifying the --admin-username and --ssh-key-values arguments to az vm create.

Note that you require an SSH username and private key to log into the VM (data plane access). You're not using the Service Principal I discussed above, which gives you Azure resource (control plane) access. This SSH credential is an entirely separate credential from the Service Principal.

Now do that in a CD pipeline

Let's say we're running a CD pipeline and we deploy an Azure Linux VM, then we need to SSH to it and run some script to configure it. We can just add a step to SSH to the VM and run a script, right?

Yes. However! You need that SSH username and its private key where you are opening the connection. In your dev environment, that's fine; you likely have your SSH keys installed persistently in ~/.ssh/, and everything works. But a CD pipeline will run on a build server, on GitHub called a "runner".

You can use a GitHub-hosted runner, or you can host your own. If you use a GitHub-hosted runner, you get a fresh runner instance with a GitHub-set configuration for each pipeline run. You can't create a GitHub-hosted runner once and then re-use it, with your specific configurations and installs, for later pipeline runs. (You can do this if you host your own runners.)

This means... your SSH keys will not be in the GitHub-hosted runner you get. But when your CD pipeline runs, it will need that private SSH key to connect to your VM and run your script. So - how do we get that private key onto your runner, securely?

The easy way - but it's risky

The easy way to have the private SSH key available to your runner is to have it persisted somewhere, then have your workflow retrieve and use it when the workflow runs.

For example, you could store the private SSH key as a GitHub repository secret.

Storing private SSH key as a GitHub repository secret for Actions use

Then you could pass it as an argument to a VM deployment script -

  - name: Deploy VM
    run: |
      ./scripts/deploy-vm.sh "{{ secrets.MY_PRIVATE_SSH_KEY }}"

or you could set it as an environment variable for use by steps/Actions in the scope of the environment variable -

env:
  MY_PRIVATE_SSH_KEY: "{{ secrets.MY_PRIVATE_SSH_KEY }}"

Here's the risk with this approach. If you turn on Actions debug logging to troubleshoot, or you add scripting to your workflow which uses the private SSH key somehow (like writing it to a key file, or concatenating it into an ssh command string), you may inadvertently disclose that private SSH key in GitHub logs! These can be viewed by anyone with access to your repository, in your Actions history.

If you are re-using this SSH key-pair, perhaps over a period of time and in multiple workflows and across many VMs, there is an obvious and increasing risk of disclosure and resulting compromise.

Is there a safer way to work with SSH keys and eliminate the possibility of private key leakage and resulting compromise risk? Of course there is, or I wouldn't have written this post.

Transient SSH keys

Recall what we're doing. We determined there is a need to use SSH to configure Azure VMs after deploying them, because the Azure VM Agent and Custom Script Extension aren't available for us.

To use SSH, we need a key pair: a public and private key. The public key gets configured on the VM when we deploy it. The private key needs to be in our environment, where we initiate the connection to the Azure VM. This could be our dev environment or a GitHub runner executing a workflow.

But... does it matter which SSH key-pair we use? Do we have to keep re-using the same key pair? No!

All we need is a key pair to connect to the VM after deployment, and run some configuration script in an administrative context. We establish this account on the VM when we deploy it, by specifying the admin username and corresponding public SSH key.

What if we...

  1. Created a new SSH key-pair "just in time" for a deployment username...
  2. Deployed a VM with the just-created public SSH key and deployment username...
  3. SSHed to the VM using that deployment username and the just-created private SSH key...
  4. Ran our configuration script(s) on the VM...
  5. Added a different "real" username and its public SSH key to the VM...
  6. Then removed our deployment user and public SSH key from the VM...
  7. And disconnected from the VM...
  8. And deleted the local SSH key-pair we just created for the one-time post-deployment work.

Ah hah, but wait! What about that different "real" username and its public SSH key added in step 5 above? What's that about?

Here, we're adding a "real" username which will eventually be used to connect to the VM once it's running and in production. This "real" credential is stored and managed elsewhere - all we need is that username and its public SSH key, not its private SSH key.

The only private SSH key we use in our environment, and our GitHub workflow, is a transient one. We create it, use it to get the VM configured, then remove it from the VM and delete it locally. It's gone, and it can't be re-used to log into the VM, and it's not used anywhere else, so it's not a leakage risk!

We don't need the "real" username's private SSH key in our dev environment or the GitHub runner. That's up to whomever winds up using the VM later; all we do is add the "real" user's public SSH key, which is presumably durable (we don't manage it, we just use it) and presents far lower risk if disclosed, since it's the public key.

Is it zero risk, since it's a public SSH key? No - there have been vulnerabilities in SSH that are exploitable with a public SSH key, there have been demonstrations of calculating a private SSH key from a public SSH key (though the compute power and time required are enormous) and in general, it's best to keep both keys secure. But, the risk from inadvertent disclosure of a public key is far lower than from a private key.

So how do we use one username and SSH key-pair for deployment only, and set up the VM with a different "real" user and public SSH key only? I'm glad you asked.

How to Create SSH Key Pair for Deployment

Recall above I wrote that you can deploy an Azure VM with a username and public SSH key by specifying it in the VM ARM template ssh configuration element, or via the Azure CLI by specifying the --admin-username and --ssh-key-values arguments to az vm create.

We'll use a deployment-only username and a just-generated SSH public key for the deployment user for this. So that's how we get the deployment user configured on the VM, ready for us to SSH to it using the just-generated SSH private key we have.

How do we generate an SSH key pair? That's standard sys admin stuff. Here's a script fragment I wrote and use. It checks if there's an .ssh directory in the user's home and creates it if not (GitHub runners may not have this directory), then checks if the deployment user SSH key already exists (this is more for a persistent environment like our dev environment, where user files can be expected to linger), then generates a new SSH key-pair named for the deployment username.

#!/bin/bash

# Listed here for convenience, normally would set these to env vars or retrieve from config store
DEPLOYMENT_SSH_KEY_TYPE="rsa"
DEPLOYMENT_SSH_KEY_BITS=4096
DEPLOYMENT_SSH_KEY_PASSPHRASE="" # Use blank passphrase since this is a one-time-use deployment key pair
DEPLOYMENT_SSH_USER_NAME="my_deployment_username"
DEPLOYMENT_SSH_USER_KEY_NAME=$DEPLOYMENT_SSH_USER_NAME

# Check if ~/.ssh folder exists and if not, create it
# This folder may not exist in a GitHub runner context
if [[ ! -d "~/.ssh" ]]
then
  mkdir -p ~/.ssh
fi

# Check if deployment user SSH key exists, and if so clean up the existing key files
# This is more for persistent environments but a good cleanup practice either way
if [[ -f ~/.ssh/"$DEPLOYMENT_SSH_USER_KEY_NAME" ]]
then
    delCmd="rm ~/.ssh/""$DEPLOYMENT_SSH_USER_KEY_NAME""*"
    eval $delCmd
fi

# Generate new deployment user public and private key pair and write the files here
ssh-keygen -v -q -m "PEM" -f ~/.ssh/"$DEPLOYMENT_SSH_USER_KEY_NAME" \
  -t "$DEPLOYMENT_SSH_KEY_TYPE" -b $DEPLOYMENT_SSH_KEY_BITS \
    -N "$DEPLOYMENT_SSH_KEY_PASSPHRASE" -C "$DEPLOYMENT_SSH_USER_NAME"

Isn't there a risk to setting the SSH private key passphrase to blank? Yes, of course. But we will only ever be using this SSH key-pair once, during one pipeline run, and re-generating it fresh and new each time. So I judged that a blank passphrase isn't a blocking risk.

In fact, let's also be explicit and clean up after ourselves when we're done, with a script we can run in a step near the end of our pipeline. Doing this is a good practice in our local, persistent dev environment too; if these keys are really one-time use, let's not leave them littered around.

#!/bin/bash

DEPLOYMENT_SSH_USER_NAME="my_deployment_username"
DEPLOYMENT_SSH_USER_KEY_NAME=$DEPLOYMENT_SSH_USER_NAME

# Clean out deployment SSH key
eval $(ssh-agent)
sshAddCmd="ssh-add -d ~/.ssh/""$DEPLOYMENT_SSH_USER_KEY_NAME"
eval $sshAddCmd

# Delete existing SSH key files, if any
delCmd="rm ~/.ssh/""$DEPLOYMENT_SSH_USER_KEY_NAME""*"
eval $delCmd

At this point, we've created a local SSH key-pair, and we can use the deployment username and its public SSH key to deploy and configure the VM.

How to add the "Real" Username and public SSH key

At some point in the configuration, we still need to add the "real" username, for which we don't have a private SSH key because someone else has and manages that for their later use, after our VM deployment and configuration is complete.

How can we do that? Here are two ways.

The easy way is to use the Azure CLI's az vm user update command. This command will create the user account specified if it doesn't exist yet.

az vm user update --username "my_real_username" --resource-group "my_resource_group" --name "my_virtual_machine_name"

However! The az vm user update command uses the VMAccessForLinux VM extension - that is, your VM must have that extension installed, or you must not have policies in place that block that extension, which is the case in some highly regulated environments.

So how can we add a user if we can't use az vm user update? Simple - we'll just use some bash script again in our workflow.

This varies by Linux OS: here is some script that works on Ubuntu Linux. If you use a different distro, please check your docs for the corresponding script.

This script creates a user named my_real_username (substitute your own value), creates its new home directory via -d /home/my_real_username -m, and enables the account to sudo via -G.

We can run a script like this while SSHed to the VM as the deployment user account discussed above, to create the "real" user account.

MY_REAL_USERNAME="my_real_username"

sudo useradd -s /bin/bash -d "/home/""$MY_REAL_USERNAME" -m -G sudo "$MY_REAL_USERNAME"

We also need to enable this username to SSH to the VM. That means we need the user's public SSH key. There are various ways to get this onto the VM. You could scp an SSH public key file to the VM, or you could pass a script and the public key value as an argument to an ssh command from your local environment, so it can be processed on the VM.

If you have a local create-user.sh script which takes username and public SSH key arguments, you can send that script and its arguments over SSH to run on the VM like this:

#!/bin/bash

ssh my_deployment_username@my_vm_fqdn "bash -s < ./create-user.sh "my_real_username" "my_public_SSH_key"

What does the script to add a user on the VM actually do? As described above, it creates the "real" user and prepares it to be able to SSH to the VM by adding the public SSH key to that user's authorized_keys and appropriately securing that user's `~/.ssh' directory.

(Note that the following script assumes you are only passing the actual SSH public key value in the second argument. The script completes putting it into ssh-rsa format. Why? Because that format is well-known and can easily be scripted, and passing single arguments which contain spaces over ssh and bash -s is difficult.)

#!/bin/bash

vmUserName=$1
publicKeyInfix=$2
vmUserSshPublicKey="ssh-rsa ""$publicKeyInfix"" ""$vmUserName"

# Add user
sudo useradd -s /bin/bash -d "/home/""$vmUserName" -m -G sudo "$vmUserName"

# Create "/home/""$vmUserName"/.ssh directory
sudo mkdir -p "/home/""$vmUserName"/.ssh

# Write public key file and add to authorized_keys for user
echo "$vmUserSshPublicKey" | sudo tee -a "/home/""$vmUserName""/.ssh/id_""$vmUserName"".pub"
echo "$vmUserSshPublicKey" | sudo tee -a "/home/""$vmUserName""/.ssh/authorized_keys"

# Secure the new user's SSH files and folder
sudo chmod 600 "/home/""$vmUserName"/.ssh/authorized_keys
sudo chmod 644 "/home/""$vmUserName"/.ssh/id_"$vmUserName".pub
sudo chmod 700 "/home/""$vmUserName"/.ssh
sudo chown -R "$vmUserName":"$vmUserName" "/home/""$vmUserName"/.ssh

Now the "real" user will be able to SSH to the VM! And we can finish our deploy work and delete the deployment user, disabling our ability (in dev environment or GitHub runner) to log into the VM again.

A physical building analogy is that we were issued a key to a building, we've completed our work in the building, exited, locked the door behind us, thrown away the key, and the lock that they key could open is removed. The actual owner of the building can be confident we can't get back in without their permission.

How to Delete the Deployment Username and SSH key from the Azure VM

The last piece is how to delete the deploy account off the VM. Can an account delete itself? Yes! Here's a script that shows how, which we can invoke over SSH the same way as the user creation script above:

# Delete user
sudo userdel -rf "my_deployment_username"

The -rf options to userdel remove the user's home folder, and force the deletion to happen even if there are running processes in the user's context, as is the case if we are deleting "ourself" while logged in.

Wrapping up

Let's recap. This post showed how to work with Azure Linux VMs under the following conditions:

  • You cannot use the Azure Linux Agent and/or various Azure Linux VM extensions due to security, compliance, custom image, or other constraints
  • You need to run post-deployment VM installation and/or configuration steps
  • You need to avoid your deployment process (dev environment or CD pipelines) being able to access deployed VMs after the deployment process completes
  • You need to avoid your deployment process having access to durable SSH credentials

Under these conditions, this post demonstrated how to deploy Azure Linux VMs using transient SSH credentials for in-VM configuration, and enabling "real" user access to the VMs using only the lower-risk public key from a "real" user's SSH key pair.

As with my other recent posts around IaC, this post was part of a larger piece of work which will be the subject of an upcoming post tying it all together - stay tuned! In the meantime, hope this post was useful. Feel free to get in touch if you need Azure help.