====== Proxmox maintenance ====== This page documents maintenance for the Hackeriet Proxmox hosts in [[infra:clusters:klynge001|klynge001]]. It is a runbook and maintenance log. ===== Current scope ===== Hosts currently covered by this procedure: * [[infra:hosts:host006|host006]] * [[infra:hosts:host007|host007]] ===== Last maintenance: 2026-05-31 ===== Scope: host006 and host007 in the ''klynge001'' Proxmox cluster. Actions performed: * Removed obsolete local backup archives on host006 for moved/stopped service VMs. * Updated the scheduled Proxmox backup job to exclude moved/stopped VMs: * ''105'' / ''blade'' * ''510'' / ''ingress'' * ''511'' / ''app-01'' * Upgraded host006 and host007 to Proxmox VE 8.4.19. * Rebooted host006 and host007 one at a time to activate kernel 6.8.12-28-pve. * Temporarily adjusted expected votes during single-node reboot windows so the remaining node stayed quorate. * Verified storage, package state, cluster quorum, guest state, and basic guest reachability after reboots. Final state after maintenance: * host006: Proxmox VE 8.4.19, kernel 6.8.12-28-pve. * host007: Proxmox VE 8.4.19, kernel 6.8.12-28-pve. * Cluster: 2 nodes, expected votes 2, quorate. * No pending package upgrades were listed on either node. * host006 root filesystem usage was about 72% after cleanup and updates. * host007 root filesystem usage was about 19% after updates. Follow-up actions completed after the maintenance: * Retired Munin on host006 and host007 by disabling and stopping ''munin-node.service''. * Cleared stale ''lock: migrate'' locks on moved/stopped VMs. * Disabled autostart for moved/stopped VMs: * ''105'' / ''blade'' * ''510'' / ''ingress'' * ''511'' / ''app-01'' * ''601'' / ''idp1'' * Verified that ''systemctl --failed'' was clear on both nodes after Munin retirement. ===== What we learned ===== * The cluster is currently operating as a two-node cluster. During a single-node reboot, the remaining node can temporarily lose quorum unless expected votes is adjusted when either other node is rebooted. * Expected votes returned to 2 after both nodes were back and joined. * ''munin-node.service'' had been failing for months on both host006 and host007. It is now intentionally retired on these hosts. * Moved/stopped service VMs may still carry stale Proxmox migration locks and ''onboot: 1'' from before migration. Clear stale locks only after confirming there is no active migration task. * host007 emitted GRUB/LVM warnings about a missing physical volume name ''pv1'' during update-grub, but rebooted successfully on the new kernel. The active LVM metadata still contains an internal ''pv1'' label. * host006 emitted a GRUB warning that the removable EFI fallback path is not updated automatically. The explicit Proxmox EFI boot entry worked and the host rebooted successfully. ===== Maintenance procedure ===== Work one host at a time. Do not reboot both host006 and host007 at once. Before making changes on either host: hostname -f pveversion -v uname -r pvecm status systemctl --failed --no-pager pvesm status df -h -x tmpfs -x devtmpfs qm list cat /etc/pve/jobs.cfg apt update apt list --upgradable test -f /var/run/reboot-required && cat /var/run/reboot-required || true For host006, also check local storage pressure: du -sh /var/lib/vz/dump /var/lib/vz/template/iso /var/log /var/cache/apt /root/proxmox-templates /var/lib/fail2ban Update flow for each host: - Confirm cluster health with ''pvecm status''. - Confirm storage health with ''pvesm status'' and ''df -h''. - Review running guests with ''qm list''. - Review failed units with ''systemctl --failed --no-pager''. - Simulate package changes if the update set is large or risky. - Apply updates only after reviewing the package set. - Reboot only if required or clearly useful, such as after a kernel update. - After reboot, wait for the node to return and confirm cluster health before touching the next host. Suggested commands after review: apt-get -s full-upgrade apt full-upgrade ===== Quorum during reboot ===== When one node is rebooted, the remaining node may temporarily lose quorum. If that happens during planned maintenance, set expected votes to 1 on the remaining node: pvecm expected 1 pvecm status After the rebooted node rejoins, confirm the cluster has returned to two nodes and expected votes 2: pvecm status Do not use this as an incident workaround without understanding which node has the correct cluster state. ===== Post-host checks ===== After each host update or reboot: hostname -f pveversion uname -r pvecm status systemctl --failed --no-pager pvesm status df -h -x tmpfs -x devtmpfs qm list apt list --upgradable Check guests affected by the touched host: qm status qm agent ping qm guest cmd network-get-interfaces qm config | sed -n '/^ipconfig/p;/^net/p' ping -c 2 nc -vz -w3 22 Interpretation: * ''qm status running'' means the hypervisor sees the VM running. * ''qm agent ping'' means the guest OS and QEMU guest agent are responsive. * ''ping'' means basic network path works, if ICMP is allowed. * ''nc -vz -w3 22'' means SSH is listening, if SSH is expected for that guest. * Lack of ping or SSH does not always mean the guest is broken. Investigate only failures that contradict expectations for that guest. ===== Monitoring / LibreNMS ===== As of 2026-06-01, host006 and host007 are monitored in LibreNMS as Proxmox hypervisors. This replaces the old Munin host monitoring for these nodes. LibreNMS records: * host006.hackeriet.no: LibreNMS device ID 25. * host007.hackeriet.no: LibreNMS device ID 26. * Both devices have OS detected as ''proxmox''. * Both devices have the LibreNMS ''proxmox'' application enabled with app instance ''klynge001''. Host-side setup: * ''snmpd'' is installed, enabled, and running on both hosts. * SNMP listens only on the management IP of each host: * host006: ''10.10.50.26:161/udp'' * host007: ''10.10.50.27:161/udp'' * SNMP uses SNMPv3 authPriv with the security name ''librenms_klynge001''. * Credentials are stored in Hackerpass at ''infrastructure/librenms-klynge001-snmpv3''. Do not put the credential values in the wiki or NetBox. * The Proxmox LibreNMS agent-local script is installed as ''/usr/local/libexec/librenms-proxmox''. * SNMP exposes Proxmox VM traffic through: extend proxmox /usr/bin/sudo /usr/local/libexec/librenms-proxmox * ''/etc/sudoers.d/librenms-proxmox'' allows the ''Debian-snmp'' user to run only that script via sudo. Firewall setup: * The Proxmox cluster firewall allows SNMP only from app-01 / LibreNMS source address ''10.10.50.51'' to the ''klynge001'' IP set. * The firewall rule is in ''/etc/pve/firewall/cluster.fw'': IN ACCEPT -source 10.10.50.51/32 -dest +klynge001 -p udp -dport 161 -log nolog # LibreNMS SNMP polling from app-01 LibreNMS setup: * ''enable_proxmox'' is set to ''true''. * The ''unix-agent'' poller module is disabled per device for host006 and host007, because this setup uses SNMP extend instead of the LibreNMS unix-agent on port 6556. * No LibreNMS alert rules existed when this monitoring was added. Host metrics and Proxmox application polling are live, but alert policy still needs to be defined. Useful verification commands: # On each Proxmox host systemctl is-active snmpd ss -lunp | grep ':161' sudo -u Debian-snmp sudo /usr/local/libexec/librenms-proxmox # From the LibreNMS container on app-01 snmpwalk -v3 -l authPriv -u librenms_klynge001 -a SHA -A '' -x AES -X '' host006.hackeriet.no SNMPv2-MIB::sysName.0 snmpget -v3 -l authPriv -u librenms_klynge001 -a SHA -A '' -x AES -X '' -Oqv host006.hackeriet.no .1.3.6.1.4.1.8072.1.3.2.3.1.2.7.112.114.111.120.109.111.120 lnms device:poll -m applications host006.hackeriet.no ===== Follow-up items ===== * Define LibreNMS alert rules and notification routing for klynge001 host monitoring. * Investigate the host007 GRUB/LVM ''pv1'' warning. * Investigate the host006 EFI fallback warning. * Review whether the two-node quorum procedure should be part of the standard klynge001 maintenance runbook. * Keep NetBox VM placement aligned with the current reality for moved service VMs. * If Munin package cleanup is desired later, remove it in a separate low-risk cleanup task. ===== Other notes ===== * Host006 storage and backup context is documented at [[infra:operations:proxmox-backups|Proxmox backups]]. * Certificate automation for internal Proxmox hostnames is documented at [[infra:operations:proxmox-acme-dns|Proxmox ACME DNS automation]].