Update README.md
This commit is contained in:
parent
dd8b41a920
commit
7b50fc5353
1 changed files with 62 additions and 0 deletions
62
README.md
62
README.md
|
|
@ -1,2 +1,64 @@
|
|||
# 02-02-26-postmortem
|
||||
|
||||
# Postmortem: Server Downtime on February 2
|
||||
|
||||
**Date of Incident:** February 2 2026
|
||||
**Duration:** ~2 hours
|
||||
**Impact:** Service hosted on the affected machine was unavailable during the incident window.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
On February 2, the primary server experienced approximately two hours of downtime following a hardware and configuration change. The incident was caused by the installation of a new HBA (Host Bus Adapter) card and the subsequent activation of PCI passthrough for that device. This change triggered a delayed kernel panic that only manifested after the VM had fully loaded and established an active internet connection, resulting in repeated crashes and service unavailability.
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
- **T0:** Installation of new HBA card completed successfully.
|
||||
- **T0 + X:** PCI passthrough enabled for the HBA device and assigned to the VM.
|
||||
- **T0 + Y:** VM booted successfully with no immediate errors.
|
||||
- **T0 + Y + Z:** Once the VM completed boot and established an internet connection, a delayed kernel panic occurred, causing the VM to crash.
|
||||
- **Following ~2 hours:** Multiple attempts to diagnose and stabilize the VM; issue was initially difficult to reproduce consistently due to the delayed and conditional nature of the panic.
|
||||
- **Resolution:** PCI passthrough for the HBA card was disabled and the configuration was rolled back, restoring system stability and service availability.
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
The root cause of the outage was an incompatibility or instability introduced by enabling PCI passthrough on the newly installed HBA card. This configuration resulted in a delayed kernel panic that:
|
||||
|
||||
- Did not occur immediately at boot
|
||||
- Only triggered once the VM was fully loaded
|
||||
- Only manifested when the VM was connected to the internet
|
||||
|
||||
This behavior significantly complicated diagnosis, as the system appeared stable during initial boot and testing phases.
|
||||
|
||||
---
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
- The kernel panic was delayed rather than immediate, obscuring the direct link to the configuration change.
|
||||
- The issue only occurred under specific runtime conditions (VM fully loaded and network-connected).
|
||||
- Limited prior testing of PCI passthrough behavior with this specific HBA model and kernel/host configuration.
|
||||
|
||||
---
|
||||
|
||||
## Resolution and Recovery
|
||||
|
||||
The issue was resolved by disabling PCI passthrough for the HBA card and reverting the VM to a stable configuration. Once the passthrough configuration was removed, the kernel panic no longer occurred and normal operation resumed.
|
||||
|
||||
---
|
||||
|
||||
## Preventative Actions / Lessons Learned
|
||||
|
||||
- Perform staged testing of hardware passthrough changes, including sustained runtime and network-connected scenarios, before applying them to production systems.
|
||||
- Schedule hardware and low-level virtualization changes during defined maintenance windows.
|
||||
- Capture and review kernel logs immediately following configuration changes to detect delayed or non-obvious failures.
|
||||
- Document compatibility and known issues for passthrough devices used in the environment.
|
||||
|
||||
---
|
||||
|
||||
**Status:** Resolved
|
||||
**Follow-up Required:** Yes – further testing of HBA passthrough in a non-production environment before deployment.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue