Back to Browse

Memory Corrected Error profiling via Linux EDAC Driver within large scale cloud infrastructur

644 views
Nov 18, 2021
26:06

As a result of prior OCP initiative in developing Enhanced EDAC driver to collect fine-granular memory corrected error logs, we can observe and create error profile that was not feasible in the past. It is helping us in making data-driven decision to manage such faults more effectively in our cloud infrastructure. This presentation consists of two parts first we plan to share the results of large-scale study of memory corrected errors over past nine-months, then we will share our proposal to extend similar approach to other types of hardware corrected errors, e.g., PCIe, cache, and inter-CPU link errors. This is in line with OCPs OS-first methodology for Hardware corrected error reporting to eliminate dependency on SMI based methods. We will also discuss various mitigation options we are considering managing the impact of such memory corrected errors.

Download

1 formats

Video Formats

360pmp434.6 MB

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.

Memory Corrected Error profiling via Linux EDAC Driver within large scale cloud infrastructur | NatokHD