SPRUSH Training 24.07 Postmortem
Table of Contents
Foreword #
SPRUSH conducted its third Attack/Defense CTF and that one was the worst of them:
- We had an hour delay because of some infrastructure problems described below
- One of the services was broken (and its checker, too)
I consider both mishaps as my mistakes and explore them below.
Infrastructure details #
Virtual machines are provisioned on Yandex.Cloud using Terraform, Ansible playbooks are used to configure everything. This includes creating users (and storing their passwords) for players to access their vulnbox.
Vulnboxes Without Passwords #
Three hours before start of the competition we’ve been testing our services and infrastructure. Due to prolonged testing, there was no time to redeploy everything from scratch, thus only vulnboxes were recreated. I’ve reran playbooks to configure them, assembled packages and sent them to players.
13:25 - Players report that passwords do not work. I test them - yep, none of them work. Weird. Fix should be staightforward, just delete files that store password for all machines so that create-user playbook will work… Oh, thats what happened.
I have this playbook for user creation:
- name: Check whether we recorded user's password
ansible.builtin.stat:
path: "{{ playbook_dir }}/../internal/passwds/{{ inventory_hostname }}"
delegate_to: localhost
register: password_record
- name: Install makepasswd
ansible.builtin.apt:
name:
- makepasswd
update_cache: true
become: true
when: not password_record.stat.exists
...
All other tasks won’t run if Check whether we recorded user's password succeeds. Since running terrafrom apply -replace="..." didn’t delete internal/passwds/{{ inventory_hostname }}, this part of playbook was skipped. Yikes, no users for you.
Reasonable fix for this issue - define destructor in each vulnbox and make it delete this machine’s file:
resource "yandex_compute_instance" "vulnbox" {
...
provisioner "local-exec" {
when = destroy
command = "rm ../internal/passwds/${format("team%03d", count.index + 1)}"
}
}
This essenitally binds our local resource with terraform’s managed entity. Neat.
ControlPersist Moment #
In my infrastructure vulnboxes do not have a public IP address - they use NAT instance to access internet. A few months ago I used to copy deploy directory with ansible playbooks to gateway machine, that has direct access to vulnboxes, and configured them from gateway. Overhead. I’ve spent some time and implemented
jumphost server, and by doing so eliminated that additional level of indirection. Practically every article about implementing jumphost server metions usage of ControlPersist directive, something like this:
# ~/.ssh/config
Host gateway
...
ControlMaster auto
ControlPersist 5m
ControlPath ~/.ssh/ansible-%r@%h:%p
These three lines make ssh reuse existing connections, making concurrent connections faster (did not test it, though). The problem occured when I lost uplink for some time - moved between two access points. Thats was the moment I had to recreate passwords for vulnboxes, and… Ansible fails. Vulnboxes are unreachable.
First thought - somehow terraform fucked everything up, vulnboxes aren’t even up. I try to ssh into gateway. Session hangs. I blame my Uni’s network, I ping gateway. Pings are there. I ssh -vvv. Hangs. Right at the beginning. “I’ve seen it somewhere” - I think as I lower MTU for WLAN interface, cursing MEPhI’s networking guys out. This does not help. I run to another building which has decent connectivity. The problem does not go away.
My mind quickly dissected existing situation - I have ping, I can ssh into jury, I cannot ssh into gateway. The only significant difference between jury and gateway is ControlPersist option. I removed them from config and - got my connectivity back.
Possible solution - add ServerAliveInterval 10 to existing configuration, but I haven’t tested it. Quick hotfix - ssh gateway -O exit, this drops existing managed sessions.
Untested Services #
TBD.