====== Proxmox maintenance ======
This page documents maintenance for the Hackeriet Proxmox hosts in [[infra:clusters:klynge001|klynge001]]. It is a runbook and maintenance log.
===== Current scope =====
Hosts currently covered by this procedure:
* [[infra:hosts:host006|host006]]
* [[infra:hosts:host007|host007]]
===== Last maintenance: 2026-05-31 =====
Scope: host006 and host007 in the ''klynge001'' Proxmox cluster.
Actions performed:
* Removed obsolete local backup archives on host006 for moved/stopped service VMs.
* Updated the scheduled Proxmox backup job to exclude moved/stopped VMs:
* ''105'' / ''blade''
* ''510'' / ''ingress''
* ''511'' / ''app-01''
* Upgraded host006 and host007 to Proxmox VE 8.4.19.
* Rebooted host006 and host007 one at a time to activate kernel 6.8.12-28-pve.
* Temporarily adjusted expected votes during single-node reboot windows so the remaining node stayed quorate.
* Verified storage, package state, cluster quorum, guest state, and basic guest reachability after reboots.
Final state after maintenance:
* host006: Proxmox VE 8.4.19, kernel 6.8.12-28-pve.
* host007: Proxmox VE 8.4.19, kernel 6.8.12-28-pve.
* Cluster: 2 nodes, expected votes 2, quorate.
* No pending package upgrades were listed on either node.
* host006 root filesystem usage was about 72% after cleanup and updates.
* host007 root filesystem usage was about 19% after updates.
Follow-up actions completed after the maintenance:
* Retired Munin on host006 and host007 by disabling and stopping ''munin-node.service''.
* Cleared stale ''lock: migrate'' locks on moved/stopped VMs.
* Disabled autostart for moved/stopped VMs:
* ''105'' / ''blade''
* ''510'' / ''ingress''
* ''511'' / ''app-01''
* ''601'' / ''idp1''
* Verified that ''systemctl --failed'' was clear on both nodes after Munin retirement.
===== What we learned =====
* The cluster is currently operating as a two-node cluster. During a single-node reboot, the remaining node can temporarily lose quorum unless expected votes is adjusted when either other node is rebooted.
* Expected votes returned to 2 after both nodes were back and joined.
* ''munin-node.service'' had been failing for months on both host006 and host007. It is now intentionally retired on these hosts.
* Moved/stopped service VMs may still carry stale Proxmox migration locks and ''onboot: 1'' from before migration. Clear stale locks only after confirming there is no active migration task.
* host007 emitted GRUB/LVM warnings about a missing physical volume name ''pv1'' during update-grub, but rebooted successfully on the new kernel. The active LVM metadata still contains an internal ''pv1'' label.
* host006 emitted a GRUB warning that the removable EFI fallback path is not updated automatically. The explicit Proxmox EFI boot entry worked and the host rebooted successfully.
===== Maintenance procedure =====
Work one host at a time. Do not reboot both host006 and host007 at once.
Before making changes on either host:
hostname -f
pveversion -v
uname -r
pvecm status
systemctl --failed --no-pager
pvesm status
df -h -x tmpfs -x devtmpfs
qm list
cat /etc/pve/jobs.cfg
apt update
apt list --upgradable
test -f /var/run/reboot-required && cat /var/run/reboot-required || true
For host006, also check local storage pressure:
du -sh /var/lib/vz/dump /var/lib/vz/template/iso /var/log /var/cache/apt /root/proxmox-templates /var/lib/fail2ban
Update flow for each host:
- Confirm cluster health with ''pvecm status''.
- Confirm storage health with ''pvesm status'' and ''df -h''.
- Review running guests with ''qm list''.
- Review failed units with ''systemctl --failed --no-pager''.
- Simulate package changes if the update set is large or risky.
- Apply updates only after reviewing the package set.
- Reboot only if required or clearly useful, such as after a kernel update.
- After reboot, wait for the node to return and confirm cluster health before touching the next host.
Suggested commands after review:
apt-get -s full-upgrade
apt full-upgrade
===== Quorum during reboot =====
When one node is rebooted, the remaining node may temporarily lose quorum. If that happens during planned maintenance, set expected votes to 1 on the remaining node:
pvecm expected 1
pvecm status
After the rebooted node rejoins, confirm the cluster has returned to two nodes and expected votes 2:
pvecm status
Do not use this as an incident workaround without understanding which node has the correct cluster state.
===== Post-host checks =====
After each host update or reboot:
hostname -f
pveversion
uname -r
pvecm status
systemctl --failed --no-pager
pvesm status
df -h -x tmpfs -x devtmpfs
qm list
apt list --upgradable
Check guests affected by the touched host:
qm status
qm agent ping
qm guest cmd network-get-interfaces
qm config | sed -n '/^ipconfig/p;/^net/p'
ping -c 2
nc -vz -w3 22
Interpretation:
* ''qm status running'' means the hypervisor sees the VM running.
* ''qm agent ping'' means the guest OS and QEMU guest agent are responsive.
* ''ping'' means basic network path works, if ICMP is allowed.
* ''nc -vz -w3 22'' means SSH is listening, if SSH is expected for that guest.
* Lack of ping or SSH does not always mean the guest is broken. Investigate only failures that contradict expectations for that guest.
===== Monitoring / LibreNMS =====
As of 2026-06-01, host006 and host007 are monitored in LibreNMS as Proxmox hypervisors. This replaces the old Munin host monitoring for these nodes.
LibreNMS records:
* host006.hackeriet.no: LibreNMS device ID 25.
* host007.hackeriet.no: LibreNMS device ID 26.
* Both devices have OS detected as ''proxmox''.
* Both devices have the LibreNMS ''proxmox'' application enabled with app instance ''klynge001''.
Host-side setup:
* ''snmpd'' is installed, enabled, and running on both hosts.
* SNMP listens only on the management IP of each host:
* host006: ''10.10.50.26:161/udp''
* host007: ''10.10.50.27:161/udp''
* SNMP uses SNMPv3 authPriv with the security name ''librenms_klynge001''.
* Credentials are stored in Hackerpass at ''infrastructure/librenms-klynge001-snmpv3''. Do not put the credential values in the wiki or NetBox.
* The Proxmox LibreNMS agent-local script is installed as ''/usr/local/libexec/librenms-proxmox''.
* SNMP exposes Proxmox VM traffic through:
extend proxmox /usr/bin/sudo /usr/local/libexec/librenms-proxmox
* ''/etc/sudoers.d/librenms-proxmox'' allows the ''Debian-snmp'' user to run only that script via sudo.
Firewall setup:
* The Proxmox cluster firewall allows SNMP only from app-01 / LibreNMS source address ''10.10.50.51'' to the ''klynge001'' IP set.
* The firewall rule is in ''/etc/pve/firewall/cluster.fw'':
IN ACCEPT -source 10.10.50.51/32 -dest +klynge001 -p udp -dport 161 -log nolog # LibreNMS SNMP polling from app-01
LibreNMS setup:
* ''enable_proxmox'' is set to ''true''.
* The ''unix-agent'' poller module is disabled per device for host006 and host007, because this setup uses SNMP extend instead of the LibreNMS unix-agent on port 6556.
* No LibreNMS alert rules existed when this monitoring was added. Host metrics and Proxmox application polling are live, but alert policy still needs to be defined.
Useful verification commands:
# On each Proxmox host
systemctl is-active snmpd
ss -lunp | grep ':161'
sudo -u Debian-snmp sudo /usr/local/libexec/librenms-proxmox
# From the LibreNMS container on app-01
snmpwalk -v3 -l authPriv -u librenms_klynge001 -a SHA -A '' -x AES -X '' host006.hackeriet.no SNMPv2-MIB::sysName.0
snmpget -v3 -l authPriv -u librenms_klynge001 -a SHA -A '' -x AES -X '' -Oqv host006.hackeriet.no .1.3.6.1.4.1.8072.1.3.2.3.1.2.7.112.114.111.120.109.111.120
lnms device:poll -m applications host006.hackeriet.no
===== Follow-up items =====
* Define LibreNMS alert rules and notification routing for klynge001 host monitoring.
* Investigate the host007 GRUB/LVM ''pv1'' warning.
* Investigate the host006 EFI fallback warning.
* Review whether the two-node quorum procedure should be part of the standard klynge001 maintenance runbook.
* Keep NetBox VM placement aligned with the current reality for moved service VMs.
* If Munin package cleanup is desired later, remove it in a separate low-risk cleanup task.
===== Other notes =====
* Host006 storage and backup context is documented at [[infra:operations:proxmox-backups|Proxmox backups]].
* Certificate automation for internal Proxmox hostnames is documented at [[infra:operations:proxmox-acme-dns|Proxmox ACME DNS automation]].