64 lines
3.1 KiB
Markdown
64 lines
3.1 KiB
Markdown
# 02-02-26-postmortem
|
||
|
||
# Postmortem: Server Downtime on February 2
|
||
|
||
**Date of Incident:** February 2 2026
|
||
**Duration:** ~2 hours
|
||
**Impact:** Service hosted on the affected machine was unavailable during the incident window.
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
On February 2, the primary server experienced approximately two hours of downtime following a hardware and configuration change. The incident was caused by the installation of a new HBA (Host Bus Adapter) card and the subsequent activation of PCI passthrough for that device. This change triggered a delayed kernel panic that only manifested after the VM had fully loaded and established an active internet connection, resulting in repeated crashes and service unavailability.
|
||
|
||
---
|
||
|
||
## Timeline
|
||
|
||
- **T0:** Installation of new HBA card completed successfully.
|
||
- **T0 + X:** PCI passthrough enabled for the HBA device and assigned to the VM.
|
||
- **T0 + Y:** VM booted successfully with no immediate errors.
|
||
- **T0 + Y + Z:** Once the VM completed boot and established an internet connection, a delayed kernel panic occurred, causing the VM to crash.
|
||
- **Following ~2 hours:** Multiple attempts to diagnose and stabilize the VM; issue was initially difficult to reproduce consistently due to the delayed and conditional nature of the panic.
|
||
- **Resolution:** PCI passthrough for the HBA card was disabled and the configuration was rolled back, restoring system stability and service availability.
|
||
|
||
---
|
||
|
||
## Root Cause
|
||
|
||
The root cause of the outage was an incompatibility or instability introduced by enabling PCI passthrough on the newly installed HBA card. This configuration resulted in a delayed kernel panic that:
|
||
|
||
- Did not occur immediately at boot
|
||
- Only triggered once the VM was fully loaded
|
||
- Only manifested when the VM was connected to the internet
|
||
|
||
This behavior significantly complicated diagnosis, as the system appeared stable during initial boot and testing phases.
|
||
|
||
---
|
||
|
||
## Contributing Factors
|
||
|
||
- The kernel panic was delayed rather than immediate, obscuring the direct link to the configuration change.
|
||
- The issue only occurred under specific runtime conditions (VM fully loaded and network-connected).
|
||
- Limited prior testing of PCI passthrough behavior with this specific HBA model and kernel/host configuration.
|
||
|
||
---
|
||
|
||
## Resolution and Recovery
|
||
|
||
The issue was resolved by disabling PCI passthrough for the HBA card and reverting the VM to a stable configuration. Once the passthrough configuration was removed, the kernel panic no longer occurred and normal operation resumed.
|
||
|
||
---
|
||
|
||
## Preventative Actions / Lessons Learned
|
||
|
||
- Perform staged testing of hardware passthrough changes, including sustained runtime and network-connected scenarios, before applying them to production systems.
|
||
- Schedule hardware and low-level virtualization changes during defined maintenance windows.
|
||
- Capture and review kernel logs immediately following configuration changes to detect delayed or non-obvious failures.
|
||
- Document compatibility and known issues for passthrough devices used in the environment.
|
||
|
||
---
|
||
|
||
**Status:** Resolved
|
||
**Follow-up Required:** Yes – further testing of HBA passthrough in a non-production environment before deployment.
|