====== Proxmox maintenance ======

This page documents maintenance for the Hackeriet Proxmox hosts in [[infra:clusters:klynge001|klynge001]]. It is a runbook and maintenance log.

===== Current scope =====

Hosts currently covered by this procedure:

  * [[infra:hosts:host006|host006]]
  * [[infra:hosts:host007|host007]]


===== Last maintenance: 2026-05-31 =====

Scope: host006 and host007 in the ''klynge001'' Proxmox cluster.

Actions performed:

  * Removed obsolete local backup archives on host006 for moved/stopped service VMs.
  * Updated the scheduled Proxmox backup job to exclude moved/stopped VMs:
    * ''105'' / ''blade''
    * ''510'' / ''ingress''
    * ''511'' / ''app-01''
  * Upgraded host006 and host007 to Proxmox VE 8.4.19.
  * Rebooted host006 and host007 one at a time to activate kernel 6.8.12-28-pve.
  * Temporarily adjusted expected votes during single-node reboot windows so the remaining node stayed quorate.
  * Verified storage, package state, cluster quorum, guest state, and basic guest reachability after reboots.

Final state after maintenance:

  * host006: Proxmox VE 8.4.19, kernel 6.8.12-28-pve.
  * host007: Proxmox VE 8.4.19, kernel 6.8.12-28-pve.
  * Cluster: 2 nodes, expected votes 2, quorate.
  * No pending package upgrades were listed on either node.
  * host006 root filesystem usage was about 72% after cleanup and updates.
  * host007 root filesystem usage was about 19% after updates.

Follow-up actions completed after the maintenance:

  * Retired Munin on host006 and host007 by disabling and stopping ''munin-node.service''.
  * Cleared stale ''lock: migrate'' locks on moved/stopped VMs.
  * Disabled autostart for moved/stopped VMs:
    * ''105'' / ''blade''
    * ''510'' / ''ingress''
    * ''511'' / ''app-01''
    * ''601'' / ''idp1''
  * Verified that ''systemctl --failed'' was clear on both nodes after Munin retirement.

===== What we learned =====

  * The cluster is currently operating as a two-node cluster. During a single-node reboot, the remaining node can temporarily lose quorum unless expected votes is adjusted when either other node is rebooted.
  * Expected votes returned to 2 after both nodes were back and joined.
  * ''munin-node.service'' had been failing for months on both host006 and host007. It is now intentionally retired on these hosts.
  * Moved/stopped service VMs may still carry stale Proxmox migration locks and ''onboot: 1'' from before migration. Clear stale locks only after confirming there is no active migration task.
  * host007 emitted GRUB/LVM warnings about a missing physical volume name ''pv1'' during update-grub, but rebooted successfully on the new kernel. The active LVM metadata still contains an internal ''pv1'' label.
  * host006 emitted a GRUB warning that the removable EFI fallback path is not updated automatically. The explicit Proxmox EFI boot entry worked and the host rebooted successfully.

===== Maintenance procedure =====

Work one host at a time. Do not reboot both host006 and host007 at once.

Before making changes on either host:

<code>
hostname -f
pveversion -v
uname -r
pvecm status
systemctl --failed --no-pager
pvesm status
df -h -x tmpfs -x devtmpfs
qm list
cat /etc/pve/jobs.cfg
apt update
apt list --upgradable
test -f /var/run/reboot-required && cat /var/run/reboot-required || true
</code>

For host006, also check local storage pressure:

<code>
du -sh /var/lib/vz/dump /var/lib/vz/template/iso /var/log /var/cache/apt /root/proxmox-templates /var/lib/fail2ban
</code>

Update flow for each host:

  - Confirm cluster health with ''pvecm status''.
  - Confirm storage health with ''pvesm status'' and ''df -h''.
  - Review running guests with ''qm list''.
  - Review failed units with ''systemctl --failed --no-pager''.
  - Simulate package changes if the update set is large or risky.
  - Apply updates only after reviewing the package set.
  - Reboot only if required or clearly useful, such as after a kernel update.
  - After reboot, wait for the node to return and confirm cluster health before touching the next host.

Suggested commands after review:

<code>
apt-get -s full-upgrade
apt full-upgrade
</code>

===== Quorum during reboot =====

When one node is rebooted, the remaining node may temporarily lose quorum. If that happens during planned maintenance, set expected votes to 1 on the remaining node:

<code>
pvecm expected 1
pvecm status
</code>

After the rebooted node rejoins, confirm the cluster has returned to two nodes and expected votes 2:

<code>
pvecm status
</code>

Do not use this as an incident workaround without understanding which node has the correct cluster state.

===== Post-host checks =====

After each host update or reboot:

<code>
hostname -f
pveversion
uname -r
pvecm status
systemctl --failed --no-pager
pvesm status
df -h -x tmpfs -x devtmpfs
qm list
apt list --upgradable
</code>

Check guests affected by the touched host:

<code>
qm status <vmid>
qm agent <vmid> ping
qm guest cmd <vmid> network-get-interfaces
qm config <vmid> | sed -n '/^ipconfig/p;/^net/p'
ping -c 2 <ip>
nc -vz -w3 <ip> 22
</code>

Interpretation:

  * ''qm status running'' means the hypervisor sees the VM running.
  * ''qm agent <vmid> ping'' means the guest OS and QEMU guest agent are responsive.
  * ''ping'' means basic network path works, if ICMP is allowed.
  * ''nc -vz -w3 <ip> 22'' means SSH is listening, if SSH is expected for that guest.
  * Lack of ping or SSH does not always mean the guest is broken. Investigate only failures that contradict expectations for that guest.

===== Monitoring / LibreNMS =====

As of 2026-06-01, host006 and host007 are monitored in LibreNMS as Proxmox hypervisors. This replaces the old Munin host monitoring for these nodes.

LibreNMS records:

  * host006.hackeriet.no: LibreNMS device ID 25.
  * host007.hackeriet.no: LibreNMS device ID 26.
  * Both devices have OS detected as ''proxmox''.
  * Both devices have the LibreNMS ''proxmox'' application enabled with app instance ''klynge001''.

Host-side setup:

  * ''snmpd'' is installed, enabled, and running on both hosts.
  * SNMP listens only on the management IP of each host:
    * host006: ''10.10.50.26:161/udp''
    * host007: ''10.10.50.27:161/udp''
  * SNMP uses SNMPv3 authPriv with the security name ''librenms_klynge001''.
  * Credentials are stored in Hackerpass at ''infrastructure/librenms-klynge001-snmpv3''. Do not put the credential values in the wiki or NetBox.
  * The Proxmox LibreNMS agent-local script is installed as ''/usr/local/libexec/librenms-proxmox''.
  * SNMP exposes Proxmox VM traffic through:

<code>
extend proxmox /usr/bin/sudo /usr/local/libexec/librenms-proxmox
</code>

  * ''/etc/sudoers.d/librenms-proxmox'' allows the ''Debian-snmp'' user to run only that script via sudo.

Firewall setup:

  * The Proxmox cluster firewall allows SNMP only from app-01 / LibreNMS source address ''10.10.50.51'' to the ''klynge001'' IP set.
  * The firewall rule is in ''/etc/pve/firewall/cluster.fw'':

<code>
IN ACCEPT -source 10.10.50.51/32 -dest +klynge001 -p udp -dport 161 -log nolog # LibreNMS SNMP polling from app-01
</code>

LibreNMS setup:

  * ''enable_proxmox'' is set to ''true''.
  * The ''unix-agent'' poller module is disabled per device for host006 and host007, because this setup uses SNMP extend instead of the LibreNMS unix-agent on port 6556.
  * No LibreNMS alert rules existed when this monitoring was added. Host metrics and Proxmox application polling are live, but alert policy still needs to be defined.

Useful verification commands:

<code>
# On each Proxmox host
systemctl is-active snmpd
ss -lunp | grep ':161'
sudo -u Debian-snmp sudo /usr/local/libexec/librenms-proxmox

# From the LibreNMS container on app-01
snmpwalk -v3 -l authPriv -u librenms_klynge001 -a SHA -A '<auth password>' -x AES -X '<privacy password>' host006.hackeriet.no SNMPv2-MIB::sysName.0
snmpget -v3 -l authPriv -u librenms_klynge001 -a SHA -A '<auth password>' -x AES -X '<privacy password>' -Oqv host006.hackeriet.no .1.3.6.1.4.1.8072.1.3.2.3.1.2.7.112.114.111.120.109.111.120
lnms device:poll -m applications host006.hackeriet.no
</code>

===== Follow-up items =====

  * Define LibreNMS alert rules and notification routing for klynge001 host monitoring.
  * Investigate the host007 GRUB/LVM ''pv1'' warning.
  * Investigate the host006 EFI fallback warning.
  * Review whether the two-node quorum procedure should be part of the standard klynge001 maintenance runbook.
  * Keep NetBox VM placement aligned with the current reality for moved service VMs.
  * If Munin package cleanup is desired later, remove it in a separate low-risk cleanup task.

===== Other notes =====

  * Host006 storage and backup context is documented at [[infra:operations:proxmox-backups|Proxmox backups]].
  * Certificate automation for internal Proxmox hostnames is documented at [[infra:operations:proxmox-acme-dns|Proxmox ACME DNS automation]].