Memory Corrected Error profiling via Linux EDAC Driver within large scale cloud infrastructur

Name: Memory Corrected Error profiling via Linux EDAC Driver within large scale cloud infrastructur
Uploaded: Nov 18, 2021
Duration: 1566 s

Open Compute Project23.8K subscribers

644 views

Nov 18, 2021

26:06

As a result of prior OCP initiative in developing Enhanced EDAC driver to collect fine-granular memory corrected error logs, we can observe and create error profile that was not feasible in the past. It is helping us in making data-driven decision to manage such faults more effectively in our cloud infrastructure. This presentation consists of two parts first we plan to share the results of large-scale study of memory corrected errors over past nine-months, then we will share our proposal to extend similar approach to other types of hardware corrected errors, e.g., PCIe, cache, and inter-CPU link errors. This is in line with OCPs OS-first methodology for Hardware corrected error reporting to eliminate dependency on SMI based methods. We will also discuss various mitigation options we are considering managing the impact of such memory corrected errors.

Download

1 formats

Video Formats

360pmp434.6 MB

Download

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.