Custom GitHub Actions runners

The Infrastructure Team manages a pool of self-hosted GitHub Actions runners, meant to be used by whitelisted repositories that need to run tests on platforms not supported by the GitHub-hosted runners. We’re currently running the following machines:

The server configuration for the runners is managed with Ansible (playbook, role), and the source code for the tooling run on the server is in the gha-self-hosted repository.

Please get in touch with the Infrastructure Team if you need to run builds on this pool for your project in the rust-lang organization.

Maintenance procedures

Updating the GitHub Actions runner version

Our self-hosted CI runs on a custom fork of the GitHub Actions runner, which improves the security of the setup. The fork needs to be manually rebased every time a new version comes out though, and that needs to be done relatively quickly to prevent CI from stopping1.

Once a new release of actions/runner is out, clone rust-lang/gha-runner and fetch the new tag pushed to the upstream repository. Then, rebase the changes on top of the latest tag:

git rebase --onto ${NEW_TAG} ${OLD_TAG} ${OLD_TAG}-rust${N}

For example, if the new tag is v2.275.0, the old tag is v2.274.2 and there were two releases of our fork, the command to execute would be:

git rebase --onto v2.275.0 v2.274.2 v2.274.2-rust2

The last commit to rebase will conflict, as that commit updates the version number and the release notes. Add the -rust1 suffix to the new version number and remove the description of the changes from the changelog (keeping the “Fork of the GitHub Actions runner used by the Rust Infrastructure Team.” sentence). Once the rebase is complete force-push the commits to main.

After you force-push the new commits to main you’re done! CI will create a tag, build the release, upload it to GitHub Releases, and automatically push a commit to rust-lang/gha-self-hosted bumping the pinned runner version to download in the images. The servers will then shortly pull the latest changes, rebuild the images and restart idle VMs.

1

The GitHub Actions runner really wants to self-update when a new release is out, but such updates would prevent our security mitigations. Because of that, one of the patches in our fork disable self-updates, but that means the runner just stops working until it’s updated.

Changing the instances configuration

The set of instances available in each host is configured through the Ansible configuration located in the simpleinfra repo:

ansible/envs/prod/host_vars/{hostname}.yml

You’ll be able to add, remove and resize instances by changing that file and applying the changes:

ansible/apply prod gha-self-hosted

Forcing an update of the source code

The server checks for source code updates every 15 minutes, but it’s possible to start such check in advance. You need to log into the machine you want to act on, and run the following command:

sudo systemctl start gha-self-hosted-update

If the contents of the images/ directory were changed, an image rebuild will also be started. The new image will be used by each VM after they finish processing the current job.

Forcing a rebuild of the images

The server automatically rebuilds the images every week, but it’s possible to rebuild them in advance. You need to log into the machine you want to act on, and run the following command:

sudo systemctl start gha-self-hosted-rebuild-image

Managing the lifecycle of virtual machines

Each virtual machine is assigned a name and its own systemd unit, called gha-vm-{name}.service. For example, the arm-1-1 VM is managed by the gha-vm-arm-1-1.service systemd unit. You can stop, start and restart the virtual machine by stopping, starting and restarting the systemd unit.

Virtual machines are configured to restart after each build finishes.

Logging into the virtual machines

It’s possible to log into the virtual machines from localhost to debug builds. This should be used as the last resort. Each VM binds SSH on a custom port on the host (configured in the host Ansible configuration), and allows access to the manage user (with password password). For example, to log into the VM with port 2201 you can run:

ssh manage@localhost -p 2201

Note that the VM image regenerates its own host key every time it boots, so you’ll likely get host key mismatch errors when connecting to a freshly booted VM.

Accessing the out-of-band console for Packet servers

In the event that a bare metal server hosted on Packet becomes unreachable but is still marked as online, it’s possible to access the out-of-band console over the serial port to get a root shell.

To access it, retrieve the root password configured on the server with:

aws ssm get-parameter --name /prod/ansible/HOSTNAME/root-password --with-decryption --query 'Parameter.Value' --output text

For example, to get the root password of ci-arm-1, run:

aws ssm get-parameter --name /prod/ansible/ci-arm-1/root-password --with-decryption --query 'Parameter.Value' --output text

Then, log into the packet console, navigate to the server page and click the “out-of-band console” button at the top right: the SSH command to use will be shown. Once you run the command you will be asked to login on the server: use root as the username and the password you fetched earlier as the password.

To exit the out-of-band console, type a new line followed by ~..