Re: [RFC] Kernel Support of Memory Error Detection.

From: David Rientjes <rientjes@google.com>
To: Jiaqi Yan <jiaqiyan@google.com>,
	"Ghannam, Yazen" <Yazen.Ghannam@amd.com>
Cc: naoya.horiguchi@nec.com, tony.luck@intel.com,
	dave.hansen@linux.intel.com,  david@redhat.com,
	erdemaktas@google.com, pgonda@google.com,  duenwen@google.com,
	Vilas.Sridharan@amd.com, mike.malvestuto@intel.com,
	 gthelen@google.com, linux-mm@kvack.org, jthoughton@google.com
Subject: Re: [RFC] Kernel Support of Memory Error Detection.
Date: Tue, 29 Nov 2022 21:31:15 -0800 (PST)	[thread overview]
Message-ID: <6bb93638-5702-076c-b72a-f33b39f35842@google.com> (raw)
In-Reply-To: <20221103155029.2451105-1-jiaqiyan@google.com>

[-- Attachment #1: Type: text/plain, Size: 15057 bytes --]

On Thu, 3 Nov 2022, Jiaqi Yan wrote:

> This RFC is a followup for [1]. We’d like to first revisit the problem
> statement, then explain the motivation for kernel support of memory
> error detection. We attempt to answer two key questions raised in the
> initial memory-scanning based solution: what memory to scan and how the
> scanner should be designed. Different from what [1] originally proposed,
> we think a kernel-driven design similar to khugepaged/kcompactd would
> work better than the userspace-driven design.
> 

Lots of great discussion in this thread, thanks Jiaqi for a very detailed 
overview of what is trying to be addressed and the multiple options that 
we can consider.

I think this thread has been a very useful starting point for us to 
discuss what should comprise the first patchset.  I haven't seen any 
objections to enlightening the kernel for this support, but any additional 
feedback would indeed be useful.

Let me suggest a possible way forward: if we can agree on an kernel driven 
approach and its design allows for it to be extended for future use cases, 
then it should be possible to introduce something generally useful that 
can then be built upon later if needed.

I can think about a couple future use cases that may arise that will 
impact the minimal design that you intend to introduce: (1) the ability to 
configure a hardware patrol scrubber depending on the platform, if 
possible, as a substitute for driving the scanning by a kthread, and (2) 
the ability to scan different types of memory rather than all system 
memory.

Imagining the simplest possible design, I assume we could introuce a
/sys/devices/system/node/nodeN/mcescan/* for each NUMA node on the system.  
As a foundation, this can include only a "stat" file which provides the 
interface to the memory poison subsystem that describes detected errors 
and their resolution (this would be a good starting point).

Building on that, and using your reference to khugepaged, we can add 
pages_to_scan and scan_sleep_millisecs files.  This will allow us to 
control scanning on demotion nodes differently.  We'd want the kthread to 
be NUMA aware for the memory it is scanning, so this would simply control 
when each thread wakes up and how much memory it scans before going to 
sleep.  Defaults would be disabled, so no kthreads are forked.

If this needs to be extended later for a hardware patrol scrubber, we'd 
make this a request to cpu vendors to make configurable on a per socket 
basis and used only with an ACPI capability that would put it under the 
control of the kernel in place of the kthread (there would be a single 
source of truth for the scan configuration).  If this is not possible, 
we'd decouple the software and hardware approach and configure the HPS 
through the ACPI subsystem independently.

Subsequently, if there is a need to only scan certain types of memory per 
NUMA node, we could introduce a "type" file later under the mcescan 
directory.  Idea would be to specify a bitmask to include certain memory 
types into the scan.  Bits for things such as buddy pages, pcp pages, 
hugetlb pages, etc.

 [ And if userspace, perhaps non-root, wanted to trigger a scan of its own 
   virtual memory, for example, another future extension could allow you 
   to explicitly trigger a scan of the calling process, but this would be 
   done in process context, not by the kthreads. ]

If this is deemed acceptable, the minimal viable patchset would:

 - introduce the per-node mcescan directories

 - introduce a "stat" file that would describe the state of memory errors
   on each NUMA node and their disposition

 - introduce a per-node kthread driven by pages_to_scan and
   scan_sleep_millisecs to do software controlled memory scanning

All future possible use cases could be extended using this later if the 
demand arises.

Thoughts?  It would be very useful to agree on a path forward since I 
think this would be generally useful for the kernel.

> Problem Statement
> =================
> The ever increasing DRAM size and cost has brought the memory subsystem
> reliability to the forefront of large fleet owners’ concern. Memory
> errors are one of the top hardware failures that cause server and
> workload crashes. Simply deploying extra-reliable DRAM hardware to a
> large-scale computing fleet adds significant cost, e.g., 10% extra cost
> on DRAM can amount to hundreds of millions of dollars.
> 
> Reactive memory poison recovery (MPR), e.g., recovering from MCEs raised
> during an execution context (the kernel mechanisms are MCE handler +
> CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found
> effective in keeping systems resilient from memory errors. However,
> reactive memory poison recovery has several major drawbacks:
> - It requires software systems that access poisoned memory to
>   be specifically designed and implemented to recover from memory errors.
>   Uncorrectable (UC) errors are random, which may happen outside of the
>   enlightened address spaces or execution contexts. The added error
>   recovery capability comes at the cost of added complexity and often
>   impossible to enlighten in 3rd party software.
> - In a virtualized environment, the injected MCEs introduce the same
>   challenge to the guest.
> - It only covers MCEs raised by CPU accesses, but the scope of memory
>   error issue is far beyond that. For example, PCIe devices (e.g. NIC and
>   GPU) accessing poisoned memory cause host crashes when
>   on certain machine configs.
> 
> We want to upstream a patch set that proactively scans the memory DIMMs
> at a configurable rate to detect UC memory errors, and attempts to
> recover the detected memory errors. We call it proactive MPR, which
> provides three benefits to tackle the memory error problem:
> - Proactively scanning memory DIMMs reduces the chance of a correctable
>   error becoming uncorrectable.
> - Once detected, UC errors caught in unallocated memory pages are
>   isolated and prevented from being allocated to an application or the OS.
> - The probability of software/hardware products encountering memory
>   errors is reduced, as they are only exposed to memory errors developed
>   over a window of T, where T stands for the period of scrubbing the
>   entire memory space. Any memory errors that occurred more than T ago
>   should have resulted in custom recovery actions. For example, in a cloud
>   environment VMs can be live migrated to another healthy host.
> 
> Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to
> prevent the build up of memory errors. In comparison software memory
> error detector (SW) has pros and cons:
> - SW supports adaptive scanning, i.e. speeds up/down scanning, turns
>   on/off scanning, and yields its own CPU cycles and memory bandwidth.
>   All of these can happen on-the-fly based on the system workload status
>   or administrator’s choice. HPS doesn’t have all these flexibilities.
>   Its patrol speed is usually only configurable at boot time, and it is
>   not able to consider system state. (Note: HPS is a memory controller
>   feature and usually doesn’t consume CPU time).
> - SW can expose controls to scan by memory types, while HPS always scans
>   full system memory. For example, an administrator can use SW to only
>   scan hugetlb memory on the system.
> - SW can scan memory at a finer granularity, for example, having different
>   scan rate per node, or entirely disabled on some node. HPS, however,
>   currently only supports per host scanning.
> - SW can make scan statistics (e.g. X bytes has been scanned for the
>   last Y seconds and Z memory errors are found) easily visible to
>   datacenter administrators, who can schedule maintenance (e.g. migrating
>   running jobs before repairing DIMMs) accordingly.
> - SW’s functionality is consistent across hardware platforms. HPS’s
>   functionality varies from vendor to vendor. For example, some vendors
>   support shorter scrubbing periods than others, and some vendors may not
>   support memory scrubbing at all.
> - HPS usually doesn’t consume CPU cores but does consume memory
>   controller cycles and memory bandwidth. SW consumes both CPU cycles
>   and memory bandwidth, but is only a problem if administrators opt into
>   the scanning after weighing the cost benefit.
> - As CPU cores are not consumed by HPS, there won’t be any cache impact.
>   SW can utilize prefetchnta (for x86) [4] and equivalent hints for other
>   architectures [5] to minimize cache impact (in case of prefetchnta,
>   completely avoiding L1/L2 cache impact).
> 
> Solution Proposals
> ==================
> 
> What to Scan
> ============
> The initial RFC proposed to scan the **entire system memory**, which
> raised the question of what memory is scannable (i.e. memory accessible
> from kernel direct mapping). We attempt to address this question by
> breaking down the memory types as follows:
> - Static memory types: memory that either stays scannable or unscannable.
>   Well defined examples are hugetlb vs regular memory, node-local memory
>   vs far memory (e.g. CXL or PMEM). While most static memory types are
>   scannable, administrators could disable scanning far memory to avoid
>   messing with the promotion and demotion logic in memory tiring
>   solutions. (The implementation will allow administrators to disable
>   scanning on scannable memory).
> - Memory type related to virtualization, including ballooned-away memory
>   and unaccepted memory. Not all balloon implementations are compatible
>   with memory scanning (i.e. reading memory mapped into the direct map) in
>   guest. For example, with the virtio-mem devices [6] in the hypervisor,
>   reading unplugged memory can cause undefined behavior. The same applies
>   to unaccepted memory in confidential VMs [7]. Since memory error
>   detection on the host side already benefits its guests transparently,
>   (i.e., spending no extra guest CPU cycle), there is very limited benefit
>   for a guest to scan memory by itself. We recommend disabling the memory
>   error detection within the virtualization world.
> - Dynamic memory type: memory that turns into unscannable or scannable
>   dynamically. One significant example is guest private memory backing
>   confidential VM. At the software level, guest private memory pages
>   become unscannable as they will soon be unmapped from kernel direct
>   mapping [8]. Scanning guest private memory pages is still possible by
>   IO remapping with foreseen performance sacrifice and proper page fault
>   handling (skip the page) if all means of mapping fail. At the hardware
>   level, the memory access implementation done by hardware vendors today
>   puts memory integrity checking prior to memory ownership checking,
>   which means memory errors are still surfaced to the OS while scanning.
>   For the scanning scheme to work for the future, we need the hardware
>   vendors to keep providing similar error detection behavior in their
>   confidential VM hardware. We believe this is a reasonable ask to them
>   as their hardware patrol scrubbers also adopt the same scanning scheme
>   and therefore rely on such promise from themselves. Otherwise we can
>   switch to whatever the new scheme used by the patrol scrubbers when
>   they break the promise.
> 
> How Scanning is Designed
> ====================
> We can support kernel memory error detection in two styles: whether kernel
> itself or userspace application drives the detection.
> 
> In the first style, the kernel itself can create a kthread on each NUMA
> node for scanning node-local memory (i.e. IORESOURCE_SYSTEM_RAM). These
> scanning kthreads are scheduled in the way similar to how khugepaged or
> kcompactd works. Basic configurations of the ever-schedulable background
> kthreads can be exposed to userspace via sysfs, for example, sleeping
> for X milliseconds after scanning Y raw memory pages. Scanning statistics
> can also be visible to userspace via sysfs, for example, number of pages
> actually scanned and number of memory errors found.
> 
> On the other hand, memory error detection can be driven by root userspace
> applications with sufficient support from the kernel. For example,
> a process can scrutinize physical memory under its own virtual memory
> space on demand. The supports from kernel are the most basic operations
> of specifically designed memory read access (e.g. avoid the CPU erratum,
> minimize cache pollution, and avoid leaking the memory content etc), and
> machine check exception handling plus memory failure handling [9] when
> memory error is detected.
> 
> The pros and cons of in-kernel background scanning are:
> - A simple and independent component for scanning system memory constantly
>   and regularly, which improves the machine fleet’s memory health (e.g.,
>   for hyperscalers, cloud providers, etc).
> - The rest of the OS (both kernel and application) can benefit from it
>   without explicit modifications.
> - The efficiency of this approach is easily configurable by scan rate.
> - It cannot offer an on-the-spot guarantee. There is no good way to
>   prioritize certain chunks of memory.
> - The implementation of this approach needs to deal with the question of
>   if a memory page is scannable.
> 
> The pros and cons of application driven approach are:
> - An application can scan a specific chunk of memory on the spot, and is
>   able to prioritize scanning on some memory regions or memory types.
> - A memory error detection agent needs to be designed to proactively,
>   constantly and regularly scan the entire memory.
> - A kernel API needs to be designed to provide userspace enough power of
>   scanning physical memory. For example, Memory regions requested by
>   multiple applications may overlap. Should the kernel API support
>   combined scanning?
> - Application is exposed to the question of if memory is scannable, and
>   needs to deal with the complexity of ensuring memory stays scannable
>   during the scanning process.
> 
> We prefer the in-kernel background approach for its simplicity, but open
> to all opinions from the upstream community.
> 
> [1] https://lore.kernel.org/linux-mm/20220425163451.3818838-1-juew@google.com
> [2] https://developer.amd.com/wordpress/media/2012/10/325591.pdf
> [3] https://community.intel.com/t5/Server-Products/Uncorrectable-Memory-Error-amp-Patrol-Scrub/td-p/545123
> [4] https://www.amd.com/system/files/TechDocs/24594.pdf, page 285
> [5] https://developer.arm.com/documentation/den0024/a/The-A64-instruction-set/Memory-access-instructions/Non-temporal-load-and-store-pair
> [6] https://lore.kernel.org/kvm/20200311171422.10484-1-david@redhat.com
> [7] https://lore.kernel.org/linux-mm/20220718172159.4vwjzrfthelovcty@black.fi.intel.com/t/
> [8] https://lore.kernel.org/linux-mm/20220706082016.2603916-1-chao.p.peng@linux.intel.com
> [9] https://www.kernel.org/doc/Documentation/vm/hwpoison.rst
> 
> -- 
> 2.38.1.273.g43a17bfeac-goog
> 
>