Table of Contents
Proxmox maintenance
This page documents maintenance for the Hackeriet Proxmox hosts in klynge001. It is a runbook and maintenance log.
Current scope
Last maintenance: 2026-05-31
Scope: host006 and host007 in the klynge001 Proxmox cluster.
Actions performed:
- Removed obsolete local backup archives on host006 for moved/stopped service VMs.
- Updated the scheduled Proxmox backup job to exclude moved/stopped VMs:
105/blade510/ingress511/app-01
- Upgraded host006 and host007 to Proxmox VE 8.4.19.
- Rebooted host006 and host007 one at a time to activate kernel 6.8.12-28-pve.
- Temporarily adjusted expected votes during single-node reboot windows so the remaining node stayed quorate.
- Verified storage, package state, cluster quorum, guest state, and basic guest reachability after reboots.
Final state after maintenance:
- host006: Proxmox VE 8.4.19, kernel 6.8.12-28-pve.
- host007: Proxmox VE 8.4.19, kernel 6.8.12-28-pve.
- Cluster: 2 nodes, expected votes 2, quorate.
- No pending package upgrades were listed on either node.
- host006 root filesystem usage was about 72% after cleanup and updates.
- host007 root filesystem usage was about 19% after updates.
Follow-up actions completed after the maintenance:
- Retired Munin on host006 and host007 by disabling and stopping
munin-node.service. - Cleared stale
lock: migratelocks on moved/stopped VMs. - Disabled autostart for moved/stopped VMs:
105/blade510/ingress511/app-01601/idp1
- Verified that
systemctl –failedwas clear on both nodes after Munin retirement.
What we learned
- The cluster is currently operating as a two-node cluster. During a single-node reboot, the remaining node can temporarily lose quorum unless expected votes is adjusted when either other node is rebooted.
- Expected votes returned to 2 after both nodes were back and joined.
munin-node.servicehad been failing for months on both host006 and host007. It is now intentionally retired on these hosts.- Moved/stopped service VMs may still carry stale Proxmox migration locks and
onboot: 1from before migration. Clear stale locks only after confirming there is no active migration task. - host007 emitted GRUB/LVM warnings about a missing physical volume name
pv1during update-grub, but rebooted successfully on the new kernel. The active LVM metadata still contains an internalpv1label. - host006 emitted a GRUB warning that the removable EFI fallback path is not updated automatically. The explicit Proxmox EFI boot entry worked and the host rebooted successfully.
Maintenance procedure
Work one host at a time. Do not reboot both host006 and host007 at once.
Before making changes on either host:
hostname -f pveversion -v uname -r pvecm status systemctl --failed --no-pager pvesm status df -h -x tmpfs -x devtmpfs qm list cat /etc/pve/jobs.cfg apt update apt list --upgradable test -f /var/run/reboot-required && cat /var/run/reboot-required || true
For host006, also check local storage pressure:
du -sh /var/lib/vz/dump /var/lib/vz/template/iso /var/log /var/cache/apt /root/proxmox-templates /var/lib/fail2ban
Update flow for each host:
- Confirm cluster health with
pvecm status. - Confirm storage health with
pvesm statusanddf -h. - Review running guests with
qm list. - Review failed units with
systemctl –failed –no-pager. - Simulate package changes if the update set is large or risky.
- Apply updates only after reviewing the package set.
- Reboot only if required or clearly useful, such as after a kernel update.
- After reboot, wait for the node to return and confirm cluster health before touching the next host.
Suggested commands after review:
apt-get -s full-upgrade apt full-upgrade
Quorum during reboot
When one node is rebooted, the remaining node may temporarily lose quorum. If that happens during planned maintenance, set expected votes to 1 on the remaining node:
pvecm expected 1 pvecm status
After the rebooted node rejoins, confirm the cluster has returned to two nodes and expected votes 2:
pvecm status
Do not use this as an incident workaround without understanding which node has the correct cluster state.
Post-host checks
After each host update or reboot:
hostname -f pveversion uname -r pvecm status systemctl --failed --no-pager pvesm status df -h -x tmpfs -x devtmpfs qm list apt list --upgradable
Check guests affected by the touched host:
qm status <vmid> qm agent <vmid> ping qm guest cmd <vmid> network-get-interfaces qm config <vmid> | sed -n '/^ipconfig/p;/^net/p' ping -c 2 <ip> nc -vz -w3 <ip> 22
Interpretation:
qm status runningmeans the hypervisor sees the VM running.qm agent <vmid> pingmeans the guest OS and QEMU guest agent are responsive.pingmeans basic network path works, if ICMP is allowed.nc -vz -w3 <ip> 22means SSH is listening, if SSH is expected for that guest.- Lack of ping or SSH does not always mean the guest is broken. Investigate only failures that contradict expectations for that guest.
Monitoring / LibreNMS
As of 2026-06-01, host006 and host007 are monitored in LibreNMS as Proxmox hypervisors. This replaces the old Munin host monitoring for these nodes.
LibreNMS records:
- host006.hackeriet.no: LibreNMS device ID 25.
- host007.hackeriet.no: LibreNMS device ID 26.
- Both devices have OS detected as
proxmox. - Both devices have the LibreNMS
proxmoxapplication enabled with app instanceklynge001.
Host-side setup:
snmpdis installed, enabled, and running on both hosts.- SNMP listens only on the management IP of each host:
- host006:
10.10.50.26:161/udp - host007:
10.10.50.27:161/udp
- SNMP uses SNMPv3 authPriv with the security name
librenms_klynge001. - Credentials are stored in Hackerpass at
infrastructure/librenms-klynge001-snmpv3. Do not put the credential values in the wiki or NetBox. - The Proxmox LibreNMS agent-local script is installed as
/usr/local/libexec/librenms-proxmox. - SNMP exposes Proxmox VM traffic through:
extend proxmox /usr/bin/sudo /usr/local/libexec/librenms-proxmox
/etc/sudoers.d/librenms-proxmoxallows theDebian-snmpuser to run only that script via sudo.
Firewall setup:
- The Proxmox cluster firewall allows SNMP only from app-01 / LibreNMS source address
10.10.50.51to theklynge001IP set. - The firewall rule is in
/etc/pve/firewall/cluster.fw:
IN ACCEPT -source 10.10.50.51/32 -dest +klynge001 -p udp -dport 161 -log nolog # LibreNMS SNMP polling from app-01
LibreNMS setup:
enable_proxmoxis set totrue.- The
unix-agentpoller module is disabled per device for host006 and host007, because this setup uses SNMP extend instead of the LibreNMS unix-agent on port 6556. - No LibreNMS alert rules existed when this monitoring was added. Host metrics and Proxmox application polling are live, but alert policy still needs to be defined.
Useful verification commands:
# On each Proxmox host systemctl is-active snmpd ss -lunp | grep ':161' sudo -u Debian-snmp sudo /usr/local/libexec/librenms-proxmox # From the LibreNMS container on app-01 snmpwalk -v3 -l authPriv -u librenms_klynge001 -a SHA -A '<auth password>' -x AES -X '<privacy password>' host006.hackeriet.no SNMPv2-MIB::sysName.0 snmpget -v3 -l authPriv -u librenms_klynge001 -a SHA -A '<auth password>' -x AES -X '<privacy password>' -Oqv host006.hackeriet.no .1.3.6.1.4.1.8072.1.3.2.3.1.2.7.112.114.111.120.109.111.120 lnms device:poll -m applications host006.hackeriet.no
Follow-up items
- Define LibreNMS alert rules and notification routing for klynge001 host monitoring.
- Investigate the host007 GRUB/LVM
pv1warning. - Investigate the host006 EFI fallback warning.
- Review whether the two-node quorum procedure should be part of the standard klynge001 maintenance runbook.
- Keep NetBox VM placement aligned with the current reality for moved service VMs.
- If Munin package cleanup is desired later, remove it in a separate low-risk cleanup task.
Other notes
- Host006 storage and backup context is documented at Proxmox backups.
- Certificate automation for internal Proxmox hostnames is documented at Proxmox ACME DNS automation.