[RFC] Kernel Support of Memory Error Detection.

From: Jiaqi Yan <jiaqiyan@google.com>
To: naoya.horiguchi@nec.com, tony.luck@intel.com,
	dave.hansen@linux.intel.com,  david@redhat.com
Cc: erdemaktas@google.com, pgonda@google.com, rientjes@google.com,
	 duenwen@google.com, Vilas.Sridharan@amd.com,
	mike.malvestuto@intel.com,  gthelen@google.com,
	linux-mm@kvack.org, jiaqiyan@google.com,  jthoughton@google.com
Subject: [RFC] Kernel Support of Memory Error Detection.
Date: Thu,  3 Nov 2022 15:50:29 +0000	[thread overview]
Message-ID: <20221103155029.2451105-1-jiaqiyan@google.com> (raw)

This RFC is a followup for [1]. We’d like to first revisit the problem
statement, then explain the motivation for kernel support of memory
error detection. We attempt to answer two key questions raised in the
initial memory-scanning based solution: what memory to scan and how the
scanner should be designed. Different from what [1] originally proposed,
we think a kernel-driven design similar to khugepaged/kcompactd would
work better than the userspace-driven design.

Problem Statement
=================
The ever increasing DRAM size and cost has brought the memory subsystem
reliability to the forefront of large fleet owners’ concern. Memory
errors are one of the top hardware failures that cause server and
workload crashes. Simply deploying extra-reliable DRAM hardware to a
large-scale computing fleet adds significant cost, e.g., 10% extra cost
on DRAM can amount to hundreds of millions of dollars.

Reactive memory poison recovery (MPR), e.g., recovering from MCEs raised
during an execution context (the kernel mechanisms are MCE handler +
CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found
effective in keeping systems resilient from memory errors. However,
reactive memory poison recovery has several major drawbacks:
- It requires software systems that access poisoned memory to
  be specifically designed and implemented to recover from memory errors.
  Uncorrectable (UC) errors are random, which may happen outside of the
  enlightened address spaces or execution contexts. The added error
  recovery capability comes at the cost of added complexity and often
  impossible to enlighten in 3rd party software.
- In a virtualized environment, the injected MCEs introduce the same
  challenge to the guest.
- It only covers MCEs raised by CPU accesses, but the scope of memory
  error issue is far beyond that. For example, PCIe devices (e.g. NIC and
  GPU) accessing poisoned memory cause host crashes when
  on certain machine configs.

We want to upstream a patch set that proactively scans the memory DIMMs
at a configurable rate to detect UC memory errors, and attempts to
recover the detected memory errors. We call it proactive MPR, which
provides three benefits to tackle the memory error problem:
- Proactively scanning memory DIMMs reduces the chance of a correctable
  error becoming uncorrectable.
- Once detected, UC errors caught in unallocated memory pages are
  isolated and prevented from being allocated to an application or the OS.
- The probability of software/hardware products encountering memory
  errors is reduced, as they are only exposed to memory errors developed
  over a window of T, where T stands for the period of scrubbing the
  entire memory space. Any memory errors that occurred more than T ago
  should have resulted in custom recovery actions. For example, in a cloud
  environment VMs can be live migrated to another healthy host.

Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to
prevent the build up of memory errors. In comparison software memory
error detector (SW) has pros and cons:
- SW supports adaptive scanning, i.e. speeds up/down scanning, turns
  on/off scanning, and yields its own CPU cycles and memory bandwidth.
  All of these can happen on-the-fly based on the system workload status
  or administrator’s choice. HPS doesn’t have all these flexibilities.
  Its patrol speed is usually only configurable at boot time, and it is
  not able to consider system state. (Note: HPS is a memory controller
  feature and usually doesn’t consume CPU time).
- SW can expose controls to scan by memory types, while HPS always scans
  full system memory. For example, an administrator can use SW to only
  scan hugetlb memory on the system.
- SW can scan memory at a finer granularity, for example, having different
  scan rate per node, or entirely disabled on some node. HPS, however,
  currently only supports per host scanning.
- SW can make scan statistics (e.g. X bytes has been scanned for the
  last Y seconds and Z memory errors are found) easily visible to
  datacenter administrators, who can schedule maintenance (e.g. migrating
  running jobs before repairing DIMMs) accordingly.
- SW’s functionality is consistent across hardware platforms. HPS’s
  functionality varies from vendor to vendor. For example, some vendors
  support shorter scrubbing periods than others, and some vendors may not
  support memory scrubbing at all.
- HPS usually doesn’t consume CPU cores but does consume memory
  controller cycles and memory bandwidth. SW consumes both CPU cycles
  and memory bandwidth, but is only a problem if administrators opt into
  the scanning after weighing the cost benefit.
- As CPU cores are not consumed by HPS, there won’t be any cache impact.
  SW can utilize prefetchnta (for x86) [4] and equivalent hints for other
  architectures [5] to minimize cache impact (in case of prefetchnta,
  completely avoiding L1/L2 cache impact).

Solution Proposals
==================

What to Scan
============
The initial RFC proposed to scan the **entire system memory**, which
raised the question of what memory is scannable (i.e. memory accessible
from kernel direct mapping). We attempt to address this question by
breaking down the memory types as follows:
- Static memory types: memory that either stays scannable or unscannable.
  Well defined examples are hugetlb vs regular memory, node-local memory
  vs far memory (e.g. CXL or PMEM). While most static memory types are
  scannable, administrators could disable scanning far memory to avoid
  messing with the promotion and demotion logic in memory tiring
  solutions. (The implementation will allow administrators to disable
  scanning on scannable memory).
- Memory type related to virtualization, including ballooned-away memory
  and unaccepted memory. Not all balloon implementations are compatible
  with memory scanning (i.e. reading memory mapped into the direct map) in
  guest. For example, with the virtio-mem devices [6] in the hypervisor,
  reading unplugged memory can cause undefined behavior. The same applies
  to unaccepted memory in confidential VMs [7]. Since memory error
  detection on the host side already benefits its guests transparently,
  (i.e., spending no extra guest CPU cycle), there is very limited benefit
  for a guest to scan memory by itself. We recommend disabling the memory
  error detection within the virtualization world.
- Dynamic memory type: memory that turns into unscannable or scannable
  dynamically. One significant example is guest private memory backing
  confidential VM. At the software level, guest private memory pages
  become unscannable as they will soon be unmapped from kernel direct
  mapping [8]. Scanning guest private memory pages is still possible by
  IO remapping with foreseen performance sacrifice and proper page fault
  handling (skip the page) if all means of mapping fail. At the hardware
  level, the memory access implementation done by hardware vendors today
  puts memory integrity checking prior to memory ownership checking,
  which means memory errors are still surfaced to the OS while scanning.
  For the scanning scheme to work for the future, we need the hardware
  vendors to keep providing similar error detection behavior in their
  confidential VM hardware. We believe this is a reasonable ask to them
  as their hardware patrol scrubbers also adopt the same scanning scheme
  and therefore rely on such promise from themselves. Otherwise we can
  switch to whatever the new scheme used by the patrol scrubbers when
  they break the promise.

How Scanning is Designed
====================
We can support kernel memory error detection in two styles: whether kernel
itself or userspace application drives the detection.

In the first style, the kernel itself can create a kthread on each NUMA
node for scanning node-local memory (i.e. IORESOURCE_SYSTEM_RAM). These
scanning kthreads are scheduled in the way similar to how khugepaged or
kcompactd works. Basic configurations of the ever-schedulable background
kthreads can be exposed to userspace via sysfs, for example, sleeping
for X milliseconds after scanning Y raw memory pages. Scanning statistics
can also be visible to userspace via sysfs, for example, number of pages
actually scanned and number of memory errors found.

On the other hand, memory error detection can be driven by root userspace
applications with sufficient support from the kernel. For example,
a process can scrutinize physical memory under its own virtual memory
space on demand. The supports from kernel are the most basic operations
of specifically designed memory read access (e.g. avoid the CPU erratum,
minimize cache pollution, and avoid leaking the memory content etc), and
machine check exception handling plus memory failure handling [9] when
memory error is detected.

The pros and cons of in-kernel background scanning are:
- A simple and independent component for scanning system memory constantly
  and regularly, which improves the machine fleet’s memory health (e.g.,
  for hyperscalers, cloud providers, etc).
- The rest of the OS (both kernel and application) can benefit from it
  without explicit modifications.
- The efficiency of this approach is easily configurable by scan rate.
- It cannot offer an on-the-spot guarantee. There is no good way to
  prioritize certain chunks of memory.
- The implementation of this approach needs to deal with the question of
  if a memory page is scannable.

The pros and cons of application driven approach are:
- An application can scan a specific chunk of memory on the spot, and is
  able to prioritize scanning on some memory regions or memory types.
- A memory error detection agent needs to be designed to proactively,
  constantly and regularly scan the entire memory.
- A kernel API needs to be designed to provide userspace enough power of
  scanning physical memory. For example, Memory regions requested by
  multiple applications may overlap. Should the kernel API support
  combined scanning?
- Application is exposed to the question of if memory is scannable, and
  needs to deal with the complexity of ensuring memory stays scannable
  during the scanning process.

We prefer the in-kernel background approach for its simplicity, but open
to all opinions from the upstream community.

[1] https://lore.kernel.org/linux-mm/20220425163451.3818838-1-juew@google.com
[2] https://developer.amd.com/wordpress/media/2012/10/325591.pdf
[3] https://community.intel.com/t5/Server-Products/Uncorrectable-Memory-Error-amp-Patrol-Scrub/td-p/545123
[4] https://www.amd.com/system/files/TechDocs/24594.pdf, page 285
[5] https://developer.arm.com/documentation/den0024/a/The-A64-instruction-set/Memory-access-instructions/Non-temporal-load-and-store-pair
[6] https://lore.kernel.org/kvm/20200311171422.10484-1-david@redhat.com
[7] https://lore.kernel.org/linux-mm/20220718172159.4vwjzrfthelovcty@black.fi.intel.com/t/
[8] https://lore.kernel.org/linux-mm/20220706082016.2603916-1-chao.p.peng@linux.intel.com
[9] https://www.kernel.org/doc/Documentation/vm/hwpoison.rst

-- 
2.38.1.273.g43a17bfeac-goog