As a result of prior OCP initiative in developing Enhanced EDAC driver to collect fine-granular memory corrected error logs, we can observe and create error profile that was not feasible in the past. It is helping us in making data-driven decision to manage such faults more effectively in our cloud infrastructure.
This presentation consists of two parts first we plan to share the results of large-scale study of memory corrected errors over past nine-months, then we will share our proposal to extend similar approach to other types of hardware corrected errors, e.g., PCIe, cache, and inter-CPU link errors. This is in line with OCPs OS-first methodology for Hardware corrected error reporting to eliminate dependency on SMI based methods. We will also discuss various mitigation options we are considering managing the impact of such memory corrected errors.