diff --git a/README.md b/README.md index 8172279..4a44142 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,64 @@ # 02-02-26-postmortem +# Postmortem: Server Downtime on February 2 + +**Date of Incident:** February 2 2026 +**Duration:** ~2 hours +**Impact:** Service hosted on the affected machine was unavailable during the incident window. + +--- + +## Summary + +On February 2, the primary server experienced approximately two hours of downtime following a hardware and configuration change. The incident was caused by the installation of a new HBA (Host Bus Adapter) card and the subsequent activation of PCI passthrough for that device. This change triggered a delayed kernel panic that only manifested after the VM had fully loaded and established an active internet connection, resulting in repeated crashes and service unavailability. + +--- + +## Timeline + +- **T0:** Installation of new HBA card completed successfully. +- **T0 + X:** PCI passthrough enabled for the HBA device and assigned to the VM. +- **T0 + Y:** VM booted successfully with no immediate errors. +- **T0 + Y + Z:** Once the VM completed boot and established an internet connection, a delayed kernel panic occurred, causing the VM to crash. +- **Following ~2 hours:** Multiple attempts to diagnose and stabilize the VM; issue was initially difficult to reproduce consistently due to the delayed and conditional nature of the panic. +- **Resolution:** PCI passthrough for the HBA card was disabled and the configuration was rolled back, restoring system stability and service availability. + +--- + +## Root Cause + +The root cause of the outage was an incompatibility or instability introduced by enabling PCI passthrough on the newly installed HBA card. This configuration resulted in a delayed kernel panic that: + +- Did not occur immediately at boot +- Only triggered once the VM was fully loaded +- Only manifested when the VM was connected to the internet + +This behavior significantly complicated diagnosis, as the system appeared stable during initial boot and testing phases. + +--- + +## Contributing Factors + +- The kernel panic was delayed rather than immediate, obscuring the direct link to the configuration change. +- The issue only occurred under specific runtime conditions (VM fully loaded and network-connected). +- Limited prior testing of PCI passthrough behavior with this specific HBA model and kernel/host configuration. + +--- + +## Resolution and Recovery + +The issue was resolved by disabling PCI passthrough for the HBA card and reverting the VM to a stable configuration. Once the passthrough configuration was removed, the kernel panic no longer occurred and normal operation resumed. + +--- + +## Preventative Actions / Lessons Learned + +- Perform staged testing of hardware passthrough changes, including sustained runtime and network-connected scenarios, before applying them to production systems. +- Schedule hardware and low-level virtualization changes during defined maintenance windows. +- Capture and review kernel logs immediately following configuration changes to detect delayed or non-obvious failures. +- Document compatibility and known issues for passthrough devices used in the environment. + +--- + +**Status:** Resolved +**Follow-up Required:** Yes – further testing of HBA passthrough in a non-production environment before deployment.