Preventing Outages: The Case for Dual-Partition Systems and Atomic Updates
The recent CrowdStrike outage serves as a much-needed wake-up call. It reminds me of a clip from Netflix’s Space Force featuring John Malkovich:
To avoid similar issues in the future, Microsoft could consider implementing a solution akin to Google ChromeBooks’ dual-partition system with versioned kernels. This setup, also known as A/B partitioning, involves having two partitions with different versions of the operating system. When an update is applied, it’s done in the background on the inactive partition. Once the update is complete, the system can reboot into the new version. This method minimizes downtime, as the update process doesn’t interrupt the user. It also provides a safety net: if the new version has issues, the system can revert to the previous partition, ensuring stability.
Additionally, adopting a system similar to OSTree could further enhance reliability. OSTree is a technology used in some Linux-based operating systems for managing bootable, immutable, versioned filesystems. It facilitates atomic updates, ensuring consistency and reducing the risk of incomplete updates. While Microsoft Windows doesn’t currently use OSTree, implementing a similar concept could provide the same benefits, allowing for seamless updates and rollback capabilities.