3.1 KiB
02-02-26-postmortem
Postmortem: Server Downtime on February 2
Date of Incident: February 2 2026
Duration: ~2 hours
Impact: Service hosted on the affected machine was unavailable during the incident window.
Summary
On February 2, the primary server experienced approximately two hours of downtime following a hardware and configuration change. The incident was caused by the installation of a new HBA (Host Bus Adapter) card and the subsequent activation of PCI passthrough for that device. This change triggered a delayed kernel panic that only manifested after the VM had fully loaded and established an active internet connection, resulting in repeated crashes and service unavailability.
Timeline
- T0: Installation of new HBA card completed successfully.
- T0 + X: PCI passthrough enabled for the HBA device and assigned to the VM.
- T0 + Y: VM booted successfully with no immediate errors.
- T0 + Y + Z: Once the VM completed boot and established an internet connection, a delayed kernel panic occurred, causing the VM to crash.
- Following ~2 hours: Multiple attempts to diagnose and stabilize the VM; issue was initially difficult to reproduce consistently due to the delayed and conditional nature of the panic.
- Resolution: PCI passthrough for the HBA card was disabled and the configuration was rolled back, restoring system stability and service availability.
Root Cause
The root cause of the outage was an incompatibility or instability introduced by enabling PCI passthrough on the newly installed HBA card. This configuration resulted in a delayed kernel panic that:
- Did not occur immediately at boot
- Only triggered once the VM was fully loaded
- Only manifested when the VM was connected to the internet
This behavior significantly complicated diagnosis, as the system appeared stable during initial boot and testing phases.
Contributing Factors
- The kernel panic was delayed rather than immediate, obscuring the direct link to the configuration change.
- The issue only occurred under specific runtime conditions (VM fully loaded and network-connected).
- Limited prior testing of PCI passthrough behavior with this specific HBA model and kernel/host configuration.
Resolution and Recovery
The issue was resolved by disabling PCI passthrough for the HBA card and reverting the VM to a stable configuration. Once the passthrough configuration was removed, the kernel panic no longer occurred and normal operation resumed.
Preventative Actions / Lessons Learned
- Perform staged testing of hardware passthrough changes, including sustained runtime and network-connected scenarios, before applying them to production systems.
- Schedule hardware and low-level virtualization changes during defined maintenance windows.
- Capture and review kernel logs immediately following configuration changes to detect delayed or non-obvious failures.
- Document compatibility and known issues for passthrough devices used in the environment.
Status: Resolved
Follow-up Required: Yes – further testing of HBA passthrough in a non-production environment before deployment.