User Tools

Site Tools


infra:operations:proxmox-maintenance

This is an old revision of the document!


Proxmox maintenance

This is a runbook for planned maintenance on the Hackeriet Proxmox hosts in klynge001. It is documentation and procedure, not inventory. Use NetBox for canonical device, IP, cabling, and VM placement data.

Current scope

Planned hosts:

Current goals:

  • Bring host006 and host007 up to date.
  • Review failed services and storage health.
  • Keep the cluster healthy while working one host at a time.
  • Avoid guest-level changes unless needed for recovery.

Announcement draft

Planned Proxmox maintenance for Hackeriet

I plan to do maintenance on the Proxmox hosts host006 and host007 in the klynge001 cluster one of the next days.

Scope:

  • OS and Proxmox package updates
  • storage and backup health checks
  • failed service review
  • possible host reboots if required

Expected impact:

  • Some VMs and services may be briefly unavailable.
  • I will avoid guest-level changes unless needed for recovery.
  • I will work on one host at a time and check cluster health between steps.

DNS and service risk

Live DNS checks on 2026-05-23 showed that hackeriet.no has two authoritative nameservers:

  • ns0.hackeriet.no - Hackeriet hosted, resolves to blade at 185.35.202.202 and 2a02:ed06::202
  • ns.hyp.net - external nameserver, resolves to 194.63.248.53 and 2a01:5b40:0:248::53

Both authoritative nameservers served the same SOA serial when checked. DNS resolution should survive a short outage of ns0 because ns.hyp.net is external and synced. Do not treat this as service redundancy.

Important service dependencies observed:

  • hackeriet.no and www.hackeriet.no point to blade.
  • wiki.hackeriet.no points to blade.
  • hackeriet.no MX points to blade.
  • ns0.hackeriet.no points to blade.
  • ip.hackeriet.no and nms.hackeriet.no point through ingress.
  • blade is currently documented as a VM on host007.
  • ingress is VM 510 on host006.
  • idp1 was observed on host007.

Maintenance implications:

  • Rebooting host007 can affect blade, public web, wiki, mail target, ns0, and likely IDP.
  • Rebooting host006 can affect ingress-routed services such as NetBox and LibreNMS.
  • Avoid DNS zone edits during the maintenance window.
  • Keep this runbook available locally before starting, because wiki and NetBox may be affected.

Pre-maintenance checks

Run on both host006 and host007 before making changes:

  • hostname -f
  • pveversion -v
  • pvecm status
  • systemctl –failed –no-pager
  • pvesm status
  • df -h
  • qm list
  • cat /etc/pve/jobs.cfg
  • apt update
  • apt list –upgradable
  • test -f /var/run/reboot-required && cat /var/run/reboot-required || true

On host006, also check local storage pressure:

  • du -sh /var/lib/vz/dump /var/lib/vz/template/iso /var/log /var/cache/apt /root/proxmox-templates /var/lib/fail2ban

Before rebooting anything, check DNS redundancy:

  • dig @ns0.hackeriet.no SOA hackeriet.no
  • dig @ns.hyp.net SOA hackeriet.no

The SOA serial should match.

Maintenance procedure

Work one host at a time. Do not reboot both host006 and host007 at once.

Suggested order:

  1. Start with host006 if the main concern is storage and backup health.
  2. Start with host007 if host006-hosted ingress services must stay stable first.

For each host:

  1. Confirm cluster state with pvecm status.
  2. Confirm storage state with pvesm status and df -h.
  3. Review failed units with systemctl –failed –no-pager.
  4. Run apt update.
  5. Review apt list –upgradable.
  6. Apply updates only after reviewing the package set.
  7. Reboot only if required or clearly useful.
  8. After reboot, wait for the node to return and confirm cluster health before touching the next host.

Suggested update commands, after review:

  • apt full-upgrade

Do not change guest VM configuration as part of host maintenance unless needed for recovery.

Post-host checks

After each host update or reboot:

  • hostname -f
  • pveversion -v
  • pvecm status
  • systemctl –failed –no-pager
  • pvesm status
  • df -h
  • qm list

Check DNS and key service names:

  • dig @ns0.hackeriet.no SOA hackeriet.no
  • dig @ns.hyp.net SOA hackeriet.no
  • dig hackeriet.no A
  • dig wiki.hackeriet.no A
  • dig idp.hackeriet.no A
  • dig ip.hackeriet.no A
  • dig nms.hackeriet.no A

Check actual services, not only DNS, when the relevant host has been touched.

Host006 notes

host006 has about 1 TB physical storage, but Proxmox local storage is on the root filesystem. The root filesystem was previously close to full, and local backups under /var/lib/vz/dump were the main pressure point.

Known cleanup/remediation context is documented at Proxmox backups.

During maintenance, avoid casual LVM reshaping. It can put VM disks at risk and should only be done with a maintenance window and recovery plan.

Safety notes

  • There is no plan, and we should avoid, to touch guest VMs unless required for recovery.
  • Do not change DNS during the maintenance window unless DNS itself is the incident.
  • There is no plan, and we should avoid, delete backups or ISOs without understanding what they are.
  • Keep notes locally while working; wiki, NetBox, IDP, and public services may be affected depending on which host is down.
/srv/hackeriet-wiki/dokuwiki/data/attic/infra/operations/proxmox-maintenance.1779549155.txt.gz · Last modified: by atluxity_idp.hackeriet.no