====== Proxmox maintenance ====== This is a runbook for planned maintenance on the Hackeriet Proxmox hosts in [[infra:clusters:klynge001|klynge001]]. It is documentation and procedure, not inventory. Use NetBox for canonical device, IP, cabling, and VM placement data. ===== Current scope ===== Planned hosts: * [[infra:hosts:host006|host006]] * [[infra:hosts:host007|host007]] Current goals: * Bring host006 and host007 up to date. * Review failed services and storage health. * Keep the cluster healthy while working one host at a time. * Avoid guest-level changes unless needed for recovery. ===== Announcement draft ===== Planned Proxmox maintenance for Hackeriet I plan to do maintenance on the Proxmox hosts host006 and host007 in the klynge001 cluster one of the next days. Scope: * OS and Proxmox package updates * storage and backup health checks * failed service review * possible host reboots if required Expected impact: * Some VMs and services may be briefly unavailable. * I will avoid guest-level changes unless needed for recovery. * I will work on one host at a time and check cluster health between steps. ===== DNS and service risk ===== Live DNS checks on 2026-05-23 showed that hackeriet.no has two authoritative nameservers: * ns0.hackeriet.no - Hackeriet hosted, resolves to blade at 185.35.202.202 and 2a02:ed06::202 * ns.hyp.net - external nameserver, resolves to 194.63.248.53 and 2a01:5b40:0:248::53 Both authoritative nameservers served the same SOA serial when checked. DNS resolution should survive a short outage of ns0 because ns.hyp.net is external and synced. Do not treat this as service redundancy. Important service dependencies observed: * hackeriet.no and www.hackeriet.no point to blade. * wiki.hackeriet.no points to blade. * hackeriet.no MX points to blade. * ns0.hackeriet.no points to blade. * ip.hackeriet.no and nms.hackeriet.no point through ingress. * blade is currently documented as a VM on host007. * ingress is VM 510 on host006. * idp1 was observed on host007. Maintenance implications: * Rebooting host007 can affect blade, public web, wiki, mail target, ns0, and likely IDP. * Rebooting host006 can affect ingress-routed services such as NetBox and LibreNMS. * Avoid DNS zone edits during the maintenance window. * Keep this runbook available locally before starting, because wiki and NetBox may be affected. Certificate automation for internal Proxmox hostnames is documented at [[infra:operations:proxmox-acme-dns|Proxmox ACME DNS automation]]. ===== Pre-maintenance checks ===== Run on both host006 and host007 before making changes: * hostname -f * pveversion -v * pvecm status * systemctl --failed --no-pager * pvesm status * df -h * qm list * cat /etc/pve/jobs.cfg * apt update * apt list --upgradable * test -f /var/run/reboot-required && cat /var/run/reboot-required || true On host006, also check local storage pressure: * du -sh /var/lib/vz/dump /var/lib/vz/template/iso /var/log /var/cache/apt /root/proxmox-templates /var/lib/fail2ban Before rebooting anything, check DNS redundancy: * dig @ns0.hackeriet.no SOA hackeriet.no * dig @ns.hyp.net SOA hackeriet.no The SOA serial should match. ===== Maintenance procedure ===== Work one host at a time. Do not reboot both host006 and host007 at once. Suggested order: - Start with host006 if the main concern is storage and backup health. - Start with host007 if host006-hosted ingress services must stay stable first. For each host: - Confirm cluster state with pvecm status. - Confirm storage state with pvesm status and df -h. - Review failed units with systemctl --failed --no-pager. - Run apt update. - Review apt list --upgradable. - Apply updates only after reviewing the package set. - Reboot only if required or clearly useful. - After reboot, wait for the node to return and confirm cluster health before touching the next host. Suggested update commands, after review: * apt full-upgrade Do not change guest VM configuration as part of host maintenance unless needed for recovery. ===== Post-host checks ===== After each host update or reboot: * hostname -f * pveversion -v * pvecm status * systemctl --failed --no-pager * pvesm status * df -h * qm list Check DNS and key service names: * dig @ns0.hackeriet.no SOA hackeriet.no * dig @ns.hyp.net SOA hackeriet.no * dig hackeriet.no A * dig wiki.hackeriet.no A * dig idp.hackeriet.no A * dig ip.hackeriet.no A * dig nms.hackeriet.no A Check actual services, not only DNS, when the relevant host has been touched. ===== Host006 notes ===== host006 has about 1 TB physical storage, but Proxmox local storage is on the root filesystem. The root filesystem was previously close to full, and local backups under /var/lib/vz/dump were the main pressure point. Known cleanup/remediation context is documented at [[infra:operations:proxmox-backups|Proxmox backups]]. During maintenance, avoid casual LVM reshaping. It can put VM disks at risk and should only be done with a maintenance window and recovery plan. ===== Safety notes ===== * There is no plan, and we should avoid, to touch guest VMs unless required for recovery. * Do not change DNS during the maintenance window unless DNS itself is the incident. * There is no plan, and we should avoid, delete backups or ISOs without understanding what they are. * Keep notes locally while working; wiki, NetBox, IDP, and public services may be affected depending on which host is down.