User Tools

Site Tools


infra:operations:proxmox-backups

This is an old revision of the document!


Proxmox backups

This page documents the Proxmox backup context discovered while investigating host006. It is an operational orientation page, not a complete backup policy.

Scope

This page is about Proxmox guest backups on the Hackeriet Proxmox cluster. Do not confuse this with unrelated service-specific backups such as forum.hausmania.org backups.

Current cluster context

The Proxmox cluster documented so far is klynge001, with verified nodes:

A weekly vzdump job was observed from the cluster configuration while documenting host006 and host007. The observed job used:

  • Schedule: Sunday 03:00
  • Mode: snapshot
  • Storage: local
  • Compression: zstd
  • Retention: keep last 1
  • Failure mail: backupmail@hackeriet.no

Treat this as observed state, not as a reviewed backup policy.

Host006 storage finding

host006 has enough physical storage, but the layout makes local backups fragile:

  • Physical disk observed: about 1 TB NVMe.
  • LVM volume group observed: about 953G, with 0G free.
  • Root filesystem observed: about 94G usable, about 94% used, about 6.2G free.
  • Proxmox local storage is on the root filesystem.
  • /var/lib/vz/dump is not a separate large filesystem on host006; it resolves to the crowded root filesystem.
  • local-lvm is the large thin pool for VM disks, not file-based backup dumps.

Several recent vzdump failures on host006 had errors like:

  • vma_queue_write: write error - Broken pipe

Disk pressure on host006 local storage is the first suspect for these failures. Investigate storage before changing guests.

Contrast: host007

host007 has the same cluster backup job, but a healthier storage layout:

  • Root filesystem observed: about 18% used.
  • /var/lib/vz/dump observed as a separate filesystem, about 503G total and about 331G free.
  • Proxmox local storage observed around 17% used.
  • local-lvm observed around 27% used.

This makes host006 disk-pressure failure mode much less likely on host007.

Temporary diagnostics status

Temporary increased logging was installed on host006 on 2026-05-23 to capture host-level storage and Proxmox context around backup failures. After confirming the storage layout issue, the temporary service, timers, script, and logrotate file were removed the same day.

Removed files:

  • /usr/local/sbin/hackeriet-vzdump-watch
  • /etc/systemd/system/hackeriet-vzdump-watch.service
  • /etc/systemd/system/hackeriet-vzdump-watch.timer
  • /etc/systemd/system/hackeriet-vzdump-prebackup.service
  • /etc/systemd/system/hackeriet-vzdump-prebackup.timer
  • /etc/logrotate.d/hackeriet-vzdump-watch

The collected log was left in place for reference:

  • /var/log/hackeriet/vzdump-watch.log

Remediation options

Possible ways to reduce recurrence risk:

  • Short term: review and remove obsolete files from /var/lib/vz/dump on host006.
  • Better medium term: add dedicated backup storage for host006, either mounted at /var/lib/vz/dump or added as a new Proxmox storage target.
  • Longer term: use Proxmox Backup Server for clearer retention and deduplicated backups.
  • Avoid casual in-place LVM reshaping; it can put VM disks at risk and should only be done with a maintenance window and recovery plan.

First checks

On the relevant Proxmox host:

  • df -h
  • du -sh /var/lib/vz/dump
  • pvesm status
  • systemctl –failed

In Proxmox, check the backup job configuration and recent task logs before deleting files or changing retention.

Safety notes

  • Do not delete backups or ISOs during an incident without understanding what they are.
  • Do not change guest VM state unless the incident requires it.
  • Do not mix up Proxmox guest backups with application-specific backup systems.
/srv/hackeriet-wiki/dokuwiki/data/attic/infra/operations/proxmox-backups.1779542902.txt.gz · Last modified: by atluxity_idp.hackeriet.no