User Tools

Site Tools


infra:operations:proxmox-maintenance

Proxmox maintenance

This page documents maintenance for the Hackeriet Proxmox hosts in klynge001. It is a runbook and maintenance log.

Current scope

Hosts currently covered by this procedure:

Last maintenance: 2026-05-31

Scope: host006 and host007 in the klynge001 Proxmox cluster.

Actions performed:

  • Removed obsolete local backup archives on host006 for moved/stopped service VMs.
  • Updated the scheduled Proxmox backup job to exclude moved/stopped VMs:
    • 105 / blade
    • 510 / ingress
    • 511 / app-01
  • Upgraded host006 and host007 to Proxmox VE 8.4.19.
  • Rebooted host006 and host007 one at a time to activate kernel 6.8.12-28-pve.
  • Temporarily adjusted expected votes during single-node reboot windows so the remaining node stayed quorate.
  • Verified storage, package state, cluster quorum, guest state, and basic guest reachability after reboots.

Final state after maintenance:

  • host006: Proxmox VE 8.4.19, kernel 6.8.12-28-pve.
  • host007: Proxmox VE 8.4.19, kernel 6.8.12-28-pve.
  • Cluster: 2 nodes, expected votes 2, quorate.
  • No pending package upgrades were listed on either node.
  • host006 root filesystem usage was about 72% after cleanup and updates.
  • host007 root filesystem usage was about 19% after updates.

Follow-up actions completed after the maintenance:

  • Retired Munin on host006 and host007 by disabling and stopping munin-node.service.
  • Cleared stale lock: migrate locks on moved/stopped VMs.
  • Disabled autostart for moved/stopped VMs:
    • 105 / blade
    • 510 / ingress
    • 511 / app-01
    • 601 / idp1
  • Verified that systemctl –failed was clear on both nodes after Munin retirement.

What we learned

  • The cluster is currently operating as a two-node cluster. During a single-node reboot, the remaining node can temporarily lose quorum unless expected votes is adjusted when either other node is rebooted.
  • Expected votes returned to 2 after both nodes were back and joined.
  • munin-node.service had been failing for months on both host006 and host007. It is now intentionally retired on these hosts.
  • Moved/stopped service VMs may still carry stale Proxmox migration locks and onboot: 1 from before migration. Clear stale locks only after confirming there is no active migration task.
  • host007 emitted GRUB/LVM warnings about a missing physical volume name pv1 during update-grub, but rebooted successfully on the new kernel. The active LVM metadata still contains an internal pv1 label.
  • host006 emitted a GRUB warning that the removable EFI fallback path is not updated automatically. The explicit Proxmox EFI boot entry worked and the host rebooted successfully.

Maintenance procedure

Work one host at a time. Do not reboot both host006 and host007 at once.

Before making changes on either host:

hostname -f
pveversion -v
uname -r
pvecm status
systemctl --failed --no-pager
pvesm status
df -h -x tmpfs -x devtmpfs
qm list
cat /etc/pve/jobs.cfg
apt update
apt list --upgradable
test -f /var/run/reboot-required && cat /var/run/reboot-required || true

For host006, also check local storage pressure:

du -sh /var/lib/vz/dump /var/lib/vz/template/iso /var/log /var/cache/apt /root/proxmox-templates /var/lib/fail2ban

Update flow for each host:

  1. Confirm cluster health with pvecm status.
  2. Confirm storage health with pvesm status and df -h.
  3. Review running guests with qm list.
  4. Review failed units with systemctl –failed –no-pager.
  5. Simulate package changes if the update set is large or risky.
  6. Apply updates only after reviewing the package set.
  7. Reboot only if required or clearly useful, such as after a kernel update.
  8. After reboot, wait for the node to return and confirm cluster health before touching the next host.

Suggested commands after review:

apt-get -s full-upgrade
apt full-upgrade

Quorum during reboot

When one node is rebooted, the remaining node may temporarily lose quorum. If that happens during planned maintenance, set expected votes to 1 on the remaining node:

pvecm expected 1
pvecm status

After the rebooted node rejoins, confirm the cluster has returned to two nodes and expected votes 2:

pvecm status

Do not use this as an incident workaround without understanding which node has the correct cluster state.

Post-host checks

After each host update or reboot:

hostname -f
pveversion
uname -r
pvecm status
systemctl --failed --no-pager
pvesm status
df -h -x tmpfs -x devtmpfs
qm list
apt list --upgradable

Check guests affected by the touched host:

qm status <vmid>
qm agent <vmid> ping
qm guest cmd <vmid> network-get-interfaces
qm config <vmid> | sed -n '/^ipconfig/p;/^net/p'
ping -c 2 <ip>
nc -vz -w3 <ip> 22

Interpretation:

  • qm status running means the hypervisor sees the VM running.
  • qm agent <vmid> ping means the guest OS and QEMU guest agent are responsive.
  • ping means basic network path works, if ICMP is allowed.
  • nc -vz -w3 <ip> 22 means SSH is listening, if SSH is expected for that guest.
  • Lack of ping or SSH does not always mean the guest is broken. Investigate only failures that contradict expectations for that guest.

Monitoring / LibreNMS

As of 2026-06-01, host006 and host007 are monitored in LibreNMS as Proxmox hypervisors. This replaces the old Munin host monitoring for these nodes.

LibreNMS records:

  • host006.hackeriet.no: LibreNMS device ID 25.
  • host007.hackeriet.no: LibreNMS device ID 26.
  • Both devices have OS detected as proxmox.
  • Both devices have the LibreNMS proxmox application enabled with app instance klynge001.

Host-side setup:

  • snmpd is installed, enabled, and running on both hosts.
  • SNMP listens only on the management IP of each host:
    • host006: 10.10.50.26:161/udp
    • host007: 10.10.50.27:161/udp
  • SNMP uses SNMPv3 authPriv with the security name librenms_klynge001.
  • Credentials are stored in Hackerpass at infrastructure/librenms-klynge001-snmpv3. Do not put the credential values in the wiki or NetBox.
  • The Proxmox LibreNMS agent-local script is installed as /usr/local/libexec/librenms-proxmox.
  • SNMP exposes Proxmox VM traffic through:

extend proxmox /usr/bin/sudo /usr/local/libexec/librenms-proxmox

  • /etc/sudoers.d/librenms-proxmox allows the Debian-snmp user to run only that script via sudo.

Firewall setup:

  • The Proxmox cluster firewall allows SNMP only from app-01 / LibreNMS source address 10.10.50.51 to the klynge001 IP set.
  • The firewall rule is in /etc/pve/firewall/cluster.fw:

IN ACCEPT -source 10.10.50.51/32 -dest +klynge001 -p udp -dport 161 -log nolog # LibreNMS SNMP polling from app-01

LibreNMS setup:

  • enable_proxmox is set to true.
  • The unix-agent poller module is disabled per device for host006 and host007, because this setup uses SNMP extend instead of the LibreNMS unix-agent on port 6556.
  • No LibreNMS alert rules existed when this monitoring was added. Host metrics and Proxmox application polling are live, but alert policy still needs to be defined.

Useful verification commands:

# On each Proxmox host
systemctl is-active snmpd
ss -lunp | grep ':161'
sudo -u Debian-snmp sudo /usr/local/libexec/librenms-proxmox

# From the LibreNMS container on app-01
snmpwalk -v3 -l authPriv -u librenms_klynge001 -a SHA -A '<auth password>' -x AES -X '<privacy password>' host006.hackeriet.no SNMPv2-MIB::sysName.0
snmpget -v3 -l authPriv -u librenms_klynge001 -a SHA -A '<auth password>' -x AES -X '<privacy password>' -Oqv host006.hackeriet.no .1.3.6.1.4.1.8072.1.3.2.3.1.2.7.112.114.111.120.109.111.120
lnms device:poll -m applications host006.hackeriet.no

Follow-up items

  • Define LibreNMS alert rules and notification routing for klynge001 host monitoring.
  • Investigate the host007 GRUB/LVM pv1 warning.
  • Investigate the host006 EFI fallback warning.
  • Review whether the two-node quorum procedure should be part of the standard klynge001 maintenance runbook.
  • Keep NetBox VM placement aligned with the current reality for moved service VMs.
  • If Munin package cleanup is desired later, remove it in a separate low-risk cleanup task.

Other notes

/srv/hackeriet-wiki/dokuwiki/data/pages/infra/operations/proxmox-maintenance.txt · Last modified: by atluxity_idp.hackeriet.no