Update README.md

This commit is contained in:
kneesox 2026-02-02 15:14:35 +00:00
parent dd8b41a920
commit 7b50fc5353

View file

@ -1,2 +1,64 @@
# 02-02-26-postmortem # 02-02-26-postmortem
# Postmortem: Server Downtime on February 2
**Date of Incident:** February 2 2026
**Duration:** ~2 hours
**Impact:** Service hosted on the affected machine was unavailable during the incident window.
---
## Summary
On February 2, the primary server experienced approximately two hours of downtime following a hardware and configuration change. The incident was caused by the installation of a new HBA (Host Bus Adapter) card and the subsequent activation of PCI passthrough for that device. This change triggered a delayed kernel panic that only manifested after the VM had fully loaded and established an active internet connection, resulting in repeated crashes and service unavailability.
---
## Timeline
- **T0:** Installation of new HBA card completed successfully.
- **T0 + X:** PCI passthrough enabled for the HBA device and assigned to the VM.
- **T0 + Y:** VM booted successfully with no immediate errors.
- **T0 + Y + Z:** Once the VM completed boot and established an internet connection, a delayed kernel panic occurred, causing the VM to crash.
- **Following ~2 hours:** Multiple attempts to diagnose and stabilize the VM; issue was initially difficult to reproduce consistently due to the delayed and conditional nature of the panic.
- **Resolution:** PCI passthrough for the HBA card was disabled and the configuration was rolled back, restoring system stability and service availability.
---
## Root Cause
The root cause of the outage was an incompatibility or instability introduced by enabling PCI passthrough on the newly installed HBA card. This configuration resulted in a delayed kernel panic that:
- Did not occur immediately at boot
- Only triggered once the VM was fully loaded
- Only manifested when the VM was connected to the internet
This behavior significantly complicated diagnosis, as the system appeared stable during initial boot and testing phases.
---
## Contributing Factors
- The kernel panic was delayed rather than immediate, obscuring the direct link to the configuration change.
- The issue only occurred under specific runtime conditions (VM fully loaded and network-connected).
- Limited prior testing of PCI passthrough behavior with this specific HBA model and kernel/host configuration.
---
## Resolution and Recovery
The issue was resolved by disabling PCI passthrough for the HBA card and reverting the VM to a stable configuration. Once the passthrough configuration was removed, the kernel panic no longer occurred and normal operation resumed.
---
## Preventative Actions / Lessons Learned
- Perform staged testing of hardware passthrough changes, including sustained runtime and network-connected scenarios, before applying them to production systems.
- Schedule hardware and low-level virtualization changes during defined maintenance windows.
- Capture and review kernel logs immediately following configuration changes to detect delayed or non-obvious failures.
- Document compatibility and known issues for passthrough devices used in the environment.
---
**Status:** Resolved
**Follow-up Required:** Yes further testing of HBA passthrough in a non-production environment before deployment.