# 02-02-26-postmortem # Postmortem: Server Downtime on February 2 **Date of Incident:** February 2 2026 **Duration:** ~2 hours **Impact:** Service hosted on the affected machine was unavailable during the incident window. --- ## Summary On February 2, the primary server experienced approximately two hours of downtime following a hardware and configuration change. The incident was caused by the installation of a new HBA (Host Bus Adapter) card and the subsequent activation of PCI passthrough for that device. This change triggered a delayed kernel panic that only manifested after the VM had fully loaded and established an active internet connection, resulting in repeated crashes and service unavailability. --- ## Timeline - **T0:** Installation of new HBA card completed successfully. - **T0 + X:** PCI passthrough enabled for the HBA device and assigned to the VM. - **T0 + Y:** VM booted successfully with no immediate errors. - **T0 + Y + Z:** Once the VM completed boot and established an internet connection, a delayed kernel panic occurred, causing the VM to crash. - **Following ~2 hours:** Multiple attempts to diagnose and stabilize the VM; issue was initially difficult to reproduce consistently due to the delayed and conditional nature of the panic. - **Resolution:** PCI passthrough for the HBA card was disabled and the configuration was rolled back, restoring system stability and service availability. --- ## Root Cause The root cause of the outage was an incompatibility or instability introduced by enabling PCI passthrough on the newly installed HBA card. This configuration resulted in a delayed kernel panic that: - Did not occur immediately at boot - Only triggered once the VM was fully loaded - Only manifested when the VM was connected to the internet This behavior significantly complicated diagnosis, as the system appeared stable during initial boot and testing phases. --- ## Contributing Factors - The kernel panic was delayed rather than immediate, obscuring the direct link to the configuration change. - The issue only occurred under specific runtime conditions (VM fully loaded and network-connected). - Limited prior testing of PCI passthrough behavior with this specific HBA model and kernel/host configuration. --- ## Resolution and Recovery The issue was resolved by disabling PCI passthrough for the HBA card and reverting the VM to a stable configuration. Once the passthrough configuration was removed, the kernel panic no longer occurred and normal operation resumed. --- ## Preventative Actions / Lessons Learned - Perform staged testing of hardware passthrough changes, including sustained runtime and network-connected scenarios, before applying them to production systems. - Schedule hardware and low-level virtualization changes during defined maintenance windows. - Capture and review kernel logs immediately following configuration changes to detect delayed or non-obvious failures. - Document compatibility and known issues for passthrough devices used in the environment. --- **Status:** Resolved **Follow-up Required:** Yes – further testing of HBA passthrough in a non-production environment before deployment.