All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Kernel Support of Memory Error Detection.
@ 2022-11-03 15:50 Jiaqi Yan
  2022-11-03 16:27 ` Luck, Tony
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Jiaqi Yan @ 2022-11-03 15:50 UTC (permalink / raw)
  To: naoya.horiguchi, tony.luck, dave.hansen, david
  Cc: erdemaktas, pgonda, rientjes, duenwen, Vilas.Sridharan,
	mike.malvestuto, gthelen, linux-mm, jiaqiyan, jthoughton

This RFC is a followup for [1]. We’d like to first revisit the problem
statement, then explain the motivation for kernel support of memory
error detection. We attempt to answer two key questions raised in the
initial memory-scanning based solution: what memory to scan and how the
scanner should be designed. Different from what [1] originally proposed,
we think a kernel-driven design similar to khugepaged/kcompactd would
work better than the userspace-driven design.

Problem Statement
=================
The ever increasing DRAM size and cost has brought the memory subsystem
reliability to the forefront of large fleet owners’ concern. Memory
errors are one of the top hardware failures that cause server and
workload crashes. Simply deploying extra-reliable DRAM hardware to a
large-scale computing fleet adds significant cost, e.g., 10% extra cost
on DRAM can amount to hundreds of millions of dollars.

Reactive memory poison recovery (MPR), e.g., recovering from MCEs raised
during an execution context (the kernel mechanisms are MCE handler +
CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found
effective in keeping systems resilient from memory errors. However,
reactive memory poison recovery has several major drawbacks:
- It requires software systems that access poisoned memory to
  be specifically designed and implemented to recover from memory errors.
  Uncorrectable (UC) errors are random, which may happen outside of the
  enlightened address spaces or execution contexts. The added error
  recovery capability comes at the cost of added complexity and often
  impossible to enlighten in 3rd party software.
- In a virtualized environment, the injected MCEs introduce the same
  challenge to the guest.
- It only covers MCEs raised by CPU accesses, but the scope of memory
  error issue is far beyond that. For example, PCIe devices (e.g. NIC and
  GPU) accessing poisoned memory cause host crashes when
  on certain machine configs.

We want to upstream a patch set that proactively scans the memory DIMMs
at a configurable rate to detect UC memory errors, and attempts to
recover the detected memory errors. We call it proactive MPR, which
provides three benefits to tackle the memory error problem:
- Proactively scanning memory DIMMs reduces the chance of a correctable
  error becoming uncorrectable.
- Once detected, UC errors caught in unallocated memory pages are
  isolated and prevented from being allocated to an application or the OS.
- The probability of software/hardware products encountering memory
  errors is reduced, as they are only exposed to memory errors developed
  over a window of T, where T stands for the period of scrubbing the
  entire memory space. Any memory errors that occurred more than T ago
  should have resulted in custom recovery actions. For example, in a cloud
  environment VMs can be live migrated to another healthy host.

Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to
prevent the build up of memory errors. In comparison software memory
error detector (SW) has pros and cons:
- SW supports adaptive scanning, i.e. speeds up/down scanning, turns
  on/off scanning, and yields its own CPU cycles and memory bandwidth.
  All of these can happen on-the-fly based on the system workload status
  or administrator’s choice. HPS doesn’t have all these flexibilities.
  Its patrol speed is usually only configurable at boot time, and it is
  not able to consider system state. (Note: HPS is a memory controller
  feature and usually doesn’t consume CPU time).
- SW can expose controls to scan by memory types, while HPS always scans
  full system memory. For example, an administrator can use SW to only
  scan hugetlb memory on the system.
- SW can scan memory at a finer granularity, for example, having different
  scan rate per node, or entirely disabled on some node. HPS, however,
  currently only supports per host scanning.
- SW can make scan statistics (e.g. X bytes has been scanned for the
  last Y seconds and Z memory errors are found) easily visible to
  datacenter administrators, who can schedule maintenance (e.g. migrating
  running jobs before repairing DIMMs) accordingly.
- SW’s functionality is consistent across hardware platforms. HPS’s
  functionality varies from vendor to vendor. For example, some vendors
  support shorter scrubbing periods than others, and some vendors may not
  support memory scrubbing at all.
- HPS usually doesn’t consume CPU cores but does consume memory
  controller cycles and memory bandwidth. SW consumes both CPU cycles
  and memory bandwidth, but is only a problem if administrators opt into
  the scanning after weighing the cost benefit.
- As CPU cores are not consumed by HPS, there won’t be any cache impact.
  SW can utilize prefetchnta (for x86) [4] and equivalent hints for other
  architectures [5] to minimize cache impact (in case of prefetchnta,
  completely avoiding L1/L2 cache impact).

Solution Proposals
==================

What to Scan
============
The initial RFC proposed to scan the **entire system memory**, which
raised the question of what memory is scannable (i.e. memory accessible
from kernel direct mapping). We attempt to address this question by
breaking down the memory types as follows:
- Static memory types: memory that either stays scannable or unscannable.
  Well defined examples are hugetlb vs regular memory, node-local memory
  vs far memory (e.g. CXL or PMEM). While most static memory types are
  scannable, administrators could disable scanning far memory to avoid
  messing with the promotion and demotion logic in memory tiring
  solutions. (The implementation will allow administrators to disable
  scanning on scannable memory).
- Memory type related to virtualization, including ballooned-away memory
  and unaccepted memory. Not all balloon implementations are compatible
  with memory scanning (i.e. reading memory mapped into the direct map) in
  guest. For example, with the virtio-mem devices [6] in the hypervisor,
  reading unplugged memory can cause undefined behavior. The same applies
  to unaccepted memory in confidential VMs [7]. Since memory error
  detection on the host side already benefits its guests transparently,
  (i.e., spending no extra guest CPU cycle), there is very limited benefit
  for a guest to scan memory by itself. We recommend disabling the memory
  error detection within the virtualization world.
- Dynamic memory type: memory that turns into unscannable or scannable
  dynamically. One significant example is guest private memory backing
  confidential VM. At the software level, guest private memory pages
  become unscannable as they will soon be unmapped from kernel direct
  mapping [8]. Scanning guest private memory pages is still possible by
  IO remapping with foreseen performance sacrifice and proper page fault
  handling (skip the page) if all means of mapping fail. At the hardware
  level, the memory access implementation done by hardware vendors today
  puts memory integrity checking prior to memory ownership checking,
  which means memory errors are still surfaced to the OS while scanning.
  For the scanning scheme to work for the future, we need the hardware
  vendors to keep providing similar error detection behavior in their
  confidential VM hardware. We believe this is a reasonable ask to them
  as their hardware patrol scrubbers also adopt the same scanning scheme
  and therefore rely on such promise from themselves. Otherwise we can
  switch to whatever the new scheme used by the patrol scrubbers when
  they break the promise.

How Scanning is Designed
====================
We can support kernel memory error detection in two styles: whether kernel
itself or userspace application drives the detection.

In the first style, the kernel itself can create a kthread on each NUMA
node for scanning node-local memory (i.e. IORESOURCE_SYSTEM_RAM). These
scanning kthreads are scheduled in the way similar to how khugepaged or
kcompactd works. Basic configurations of the ever-schedulable background
kthreads can be exposed to userspace via sysfs, for example, sleeping
for X milliseconds after scanning Y raw memory pages. Scanning statistics
can also be visible to userspace via sysfs, for example, number of pages
actually scanned and number of memory errors found.

On the other hand, memory error detection can be driven by root userspace
applications with sufficient support from the kernel. For example,
a process can scrutinize physical memory under its own virtual memory
space on demand. The supports from kernel are the most basic operations
of specifically designed memory read access (e.g. avoid the CPU erratum,
minimize cache pollution, and avoid leaking the memory content etc), and
machine check exception handling plus memory failure handling [9] when
memory error is detected.

The pros and cons of in-kernel background scanning are:
- A simple and independent component for scanning system memory constantly
  and regularly, which improves the machine fleet’s memory health (e.g.,
  for hyperscalers, cloud providers, etc).
- The rest of the OS (both kernel and application) can benefit from it
  without explicit modifications.
- The efficiency of this approach is easily configurable by scan rate.
- It cannot offer an on-the-spot guarantee. There is no good way to
  prioritize certain chunks of memory.
- The implementation of this approach needs to deal with the question of
  if a memory page is scannable.

The pros and cons of application driven approach are:
- An application can scan a specific chunk of memory on the spot, and is
  able to prioritize scanning on some memory regions or memory types.
- A memory error detection agent needs to be designed to proactively,
  constantly and regularly scan the entire memory.
- A kernel API needs to be designed to provide userspace enough power of
  scanning physical memory. For example, Memory regions requested by
  multiple applications may overlap. Should the kernel API support
  combined scanning?
- Application is exposed to the question of if memory is scannable, and
  needs to deal with the complexity of ensuring memory stays scannable
  during the scanning process.

We prefer the in-kernel background approach for its simplicity, but open
to all opinions from the upstream community.

[1] https://lore.kernel.org/linux-mm/20220425163451.3818838-1-juew@google.com
[2] https://developer.amd.com/wordpress/media/2012/10/325591.pdf
[3] https://community.intel.com/t5/Server-Products/Uncorrectable-Memory-Error-amp-Patrol-Scrub/td-p/545123
[4] https://www.amd.com/system/files/TechDocs/24594.pdf, page 285
[5] https://developer.arm.com/documentation/den0024/a/The-A64-instruction-set/Memory-access-instructions/Non-temporal-load-and-store-pair
[6] https://lore.kernel.org/kvm/20200311171422.10484-1-david@redhat.com
[7] https://lore.kernel.org/linux-mm/20220718172159.4vwjzrfthelovcty@black.fi.intel.com/t/
[8] https://lore.kernel.org/linux-mm/20220706082016.2603916-1-chao.p.peng@linux.intel.com
[9] https://www.kernel.org/doc/Documentation/vm/hwpoison.rst

-- 
2.38.1.273.g43a17bfeac-goog



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] Kernel Support of Memory Error Detection.
  2022-11-03 15:50 [RFC] Kernel Support of Memory Error Detection Jiaqi Yan
@ 2022-11-03 16:27 ` Luck, Tony
  2022-11-03 16:40   ` Nadav Amit
  2022-11-07 16:59 ` Sridharan, Vilas
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2022-11-03 16:27 UTC (permalink / raw)
  To: Jiaqi Yan, naoya.horiguchi, dave.hansen, david
  Cc: Aktas, Erdem, pgonda, rientjes, Hsiao, Duen-wen, Vilas.Sridharan,
	Malvestuto, Mike, gthelen, linux-mm, jthoughton

>- HPS usually doesn’t consume CPU cores but does consume memory
>  controller cycles and memory bandwidth. SW consumes both CPU cycles
>  and memory bandwidth, but is only a problem if administrators opt into
>  the scanning after weighing the cost benefit.

Maybe there is a middle ground on platforms that support some s/w programmable
DMA engine that can detect memory errors in a way that doesn't signal a
fatal system error. Your s/w scanner can direct that DMA engine to read from
the regions of memory that you want to scan, at a frequency that is compatible
with your system load requirements and risk assessments.

If your idea gets traction, maybe structure the code so that it can either use
a CPU core scan a block of memory, or pass requests to a platform driver that can
use a DMA engine to perform the scan.

-Tony



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-03 16:27 ` Luck, Tony
@ 2022-11-03 16:40   ` Nadav Amit
  2022-11-08  2:24     ` Jiaqi Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Nadav Amit @ 2022-11-03 16:40 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jiaqi Yan, naoya.horiguchi, dave.hansen, David Hildenbrand,
	Aktas, Erdem, pgonda, rientjes, Hsiao, Duen-wen, Vilas.Sridharan,
	Malvestuto, Mike, gthelen, linux-mm, jthoughton

On Nov 3, 2022, at 9:27 AM, Luck, Tony <tony.luck@intel.com> wrote:

>> - HPS usually doesn’t consume CPU cores but does consume memory
>> controller cycles and memory bandwidth. SW consumes both CPU cycles
>> and memory bandwidth, but is only a problem if administrators opt into
>> the scanning after weighing the cost benefit.
> 
> Maybe there is a middle ground on platforms that support some s/w programmable
> DMA engine that can detect memory errors in a way that doesn't signal a
> fatal system error. Your s/w scanner can direct that DMA engine to read from
> the regions of memory that you want to scan, at a frequency that is compatible
> with your system load requirements and risk assessments.
> 
> If your idea gets traction, maybe structure the code so that it can either use
> a CPU core scan a block of memory, or pass requests to a platform driver that can
> use a DMA engine to perform the scan.

That’s exactly what I was about the write. :)

Quickassist can be perfect for that. The IOMMU can be programmed to make the
memory uncachable.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] Kernel Support of Memory Error Detection.
  2022-11-03 15:50 [RFC] Kernel Support of Memory Error Detection Jiaqi Yan
  2022-11-03 16:27 ` Luck, Tony
@ 2022-11-07 16:59 ` Sridharan, Vilas
  2022-11-09  5:29 ` HORIGUCHI NAOYA(堀口 直也)
  2022-11-30  5:31 ` David Rientjes
  3 siblings, 0 replies; 20+ messages in thread
From: Sridharan, Vilas @ 2022-11-07 16:59 UTC (permalink / raw)
  To: Jiaqi Yan, naoya.horiguchi, tony.luck, dave.hansen, david
  Cc: erdemaktas, pgonda, rientjes, duenwen, mike.malvestuto, gthelen,
	linux-mm, jthoughton, Ghannam, Yazen

[AMD Official Use Only - General]

+Yazen from AMD

-----Original Message-----
From: Jiaqi Yan <jiaqiyan@google.com> 
Sent: Thursday, November 3, 2022 11:50 AM
To: naoya.horiguchi@nec.com; tony.luck@intel.com; dave.hansen@linux.intel.com; david@redhat.com
Cc: erdemaktas@google.com; pgonda@google.com; rientjes@google.com; duenwen@google.com; Sridharan, Vilas <Vilas.Sridharan@amd.com>; mike.malvestuto@intel.com; gthelen@google.com; linux-mm@kvack.org; jiaqiyan@google.com; jthoughton@google.com
Subject: [RFC] Kernel Support of Memory Error Detection.

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


This RFC is a followup for [1]. We'd like to first revisit the problem statement, then explain the motivation for kernel support of memory error detection. We attempt to answer two key questions raised in the initial memory-scanning based solution: what memory to scan and how the scanner should be designed. Different from what [1] originally proposed, we think a kernel-driven design similar to khugepaged/kcompactd would work better than the userspace-driven design.

Problem Statement
=================
The ever increasing DRAM size and cost has brought the memory subsystem reliability to the forefront of large fleet owners' concern. Memory errors are one of the top hardware failures that cause server and workload crashes. Simply deploying extra-reliable DRAM hardware to a large-scale computing fleet adds significant cost, e.g., 10% extra cost on DRAM can amount to hundreds of millions of dollars.

Reactive memory poison recovery (MPR), e.g., recovering from MCEs raised during an execution context (the kernel mechanisms are MCE handler + CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found effective in keeping systems resilient from memory errors. However, reactive memory poison recovery has several major drawbacks:
- It requires software systems that access poisoned memory to
  be specifically designed and implemented to recover from memory errors.
  Uncorrectable (UC) errors are random, which may happen outside of the
  enlightened address spaces or execution contexts. The added error
  recovery capability comes at the cost of added complexity and often
  impossible to enlighten in 3rd party software.
- In a virtualized environment, the injected MCEs introduce the same
  challenge to the guest.
- It only covers MCEs raised by CPU accesses, but the scope of memory
  error issue is far beyond that. For example, PCIe devices (e.g. NIC and
  GPU) accessing poisoned memory cause host crashes when
  on certain machine configs.

We want to upstream a patch set that proactively scans the memory DIMMs at a configurable rate to detect UC memory errors, and attempts to recover the detected memory errors. We call it proactive MPR, which provides three benefits to tackle the memory error problem:
- Proactively scanning memory DIMMs reduces the chance of a correctable
  error becoming uncorrectable.
- Once detected, UC errors caught in unallocated memory pages are
  isolated and prevented from being allocated to an application or the OS.
- The probability of software/hardware products encountering memory
  errors is reduced, as they are only exposed to memory errors developed
  over a window of T, where T stands for the period of scrubbing the
  entire memory space. Any memory errors that occurred more than T ago
  should have resulted in custom recovery actions. For example, in a cloud
  environment VMs can be live migrated to another healthy host.

Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to prevent the build up of memory errors. In comparison software memory error detector (SW) has pros and cons:
- SW supports adaptive scanning, i.e. speeds up/down scanning, turns
  on/off scanning, and yields its own CPU cycles and memory bandwidth.
  All of these can happen on-the-fly based on the system workload status
  or administrator's choice. HPS doesn't have all these flexibilities.
  Its patrol speed is usually only configurable at boot time, and it is
  not able to consider system state. (Note: HPS is a memory controller
  feature and usually doesn't consume CPU time).
- SW can expose controls to scan by memory types, while HPS always scans
  full system memory. For example, an administrator can use SW to only
  scan hugetlb memory on the system.
- SW can scan memory at a finer granularity, for example, having different
  scan rate per node, or entirely disabled on some node. HPS, however,
  currently only supports per host scanning.
- SW can make scan statistics (e.g. X bytes has been scanned for the
  last Y seconds and Z memory errors are found) easily visible to
  datacenter administrators, who can schedule maintenance (e.g. migrating
  running jobs before repairing DIMMs) accordingly.
- SW's functionality is consistent across hardware platforms. HPS's
  functionality varies from vendor to vendor. For example, some vendors
  support shorter scrubbing periods than others, and some vendors may not
  support memory scrubbing at all.
- HPS usually doesn't consume CPU cores but does consume memory
  controller cycles and memory bandwidth. SW consumes both CPU cycles
  and memory bandwidth, but is only a problem if administrators opt into
  the scanning after weighing the cost benefit.
- As CPU cores are not consumed by HPS, there won't be any cache impact.
  SW can utilize prefetchnta (for x86) [4] and equivalent hints for other
  architectures [5] to minimize cache impact (in case of prefetchnta,
  completely avoiding L1/L2 cache impact).

Solution Proposals
==================

What to Scan
============
The initial RFC proposed to scan the **entire system memory**, which raised the question of what memory is scannable (i.e. memory accessible from kernel direct mapping). We attempt to address this question by breaking down the memory types as follows:
- Static memory types: memory that either stays scannable or unscannable.
  Well defined examples are hugetlb vs regular memory, node-local memory
  vs far memory (e.g. CXL or PMEM). While most static memory types are
  scannable, administrators could disable scanning far memory to avoid
  messing with the promotion and demotion logic in memory tiring
  solutions. (The implementation will allow administrators to disable
  scanning on scannable memory).
- Memory type related to virtualization, including ballooned-away memory
  and unaccepted memory. Not all balloon implementations are compatible
  with memory scanning (i.e. reading memory mapped into the direct map) in
  guest. For example, with the virtio-mem devices [6] in the hypervisor,
  reading unplugged memory can cause undefined behavior. The same applies
  to unaccepted memory in confidential VMs [7]. Since memory error
  detection on the host side already benefits its guests transparently,
  (i.e., spending no extra guest CPU cycle), there is very limited benefit
  for a guest to scan memory by itself. We recommend disabling the memory
  error detection within the virtualization world.
- Dynamic memory type: memory that turns into unscannable or scannable
  dynamically. One significant example is guest private memory backing
  confidential VM. At the software level, guest private memory pages
  become unscannable as they will soon be unmapped from kernel direct
  mapping [8]. Scanning guest private memory pages is still possible by
  IO remapping with foreseen performance sacrifice and proper page fault
  handling (skip the page) if all means of mapping fail. At the hardware
  level, the memory access implementation done by hardware vendors today
  puts memory integrity checking prior to memory ownership checking,
  which means memory errors are still surfaced to the OS while scanning.
  For the scanning scheme to work for the future, we need the hardware
  vendors to keep providing similar error detection behavior in their
  confidential VM hardware. We believe this is a reasonable ask to them
  as their hardware patrol scrubbers also adopt the same scanning scheme
  and therefore rely on such promise from themselves. Otherwise we can
  switch to whatever the new scheme used by the patrol scrubbers when
  they break the promise.

How Scanning is Designed
====================
We can support kernel memory error detection in two styles: whether kernel itself or userspace application drives the detection.

In the first style, the kernel itself can create a kthread on each NUMA node for scanning node-local memory (i.e. IORESOURCE_SYSTEM_RAM). These scanning kthreads are scheduled in the way similar to how khugepaged or kcompactd works. Basic configurations of the ever-schedulable background kthreads can be exposed to userspace via sysfs, for example, sleeping for X milliseconds after scanning Y raw memory pages. Scanning statistics can also be visible to userspace via sysfs, for example, number of pages actually scanned and number of memory errors found.

On the other hand, memory error detection can be driven by root userspace applications with sufficient support from the kernel. For example, a process can scrutinize physical memory under its own virtual memory space on demand. The supports from kernel are the most basic operations of specifically designed memory read access (e.g. avoid the CPU erratum, minimize cache pollution, and avoid leaking the memory content etc), and machine check exception handling plus memory failure handling [9] when memory error is detected.

The pros and cons of in-kernel background scanning are:
- A simple and independent component for scanning system memory constantly
  and regularly, which improves the machine fleet's memory health (e.g.,
  for hyperscalers, cloud providers, etc).
- The rest of the OS (both kernel and application) can benefit from it
  without explicit modifications.
- The efficiency of this approach is easily configurable by scan rate.
- It cannot offer an on-the-spot guarantee. There is no good way to
  prioritize certain chunks of memory.
- The implementation of this approach needs to deal with the question of
  if a memory page is scannable.

The pros and cons of application driven approach are:
- An application can scan a specific chunk of memory on the spot, and is
  able to prioritize scanning on some memory regions or memory types.
- A memory error detection agent needs to be designed to proactively,
  constantly and regularly scan the entire memory.
- A kernel API needs to be designed to provide userspace enough power of
  scanning physical memory. For example, Memory regions requested by
  multiple applications may overlap. Should the kernel API support
  combined scanning?
- Application is exposed to the question of if memory is scannable, and
  needs to deal with the complexity of ensuring memory stays scannable
  during the scanning process.

We prefer the in-kernel background approach for its simplicity, but open to all opinions from the upstream community.




--
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-03 16:40   ` Nadav Amit
@ 2022-11-08  2:24     ` Jiaqi Yan
  2022-11-08 16:17       ` Luck, Tony
  0 siblings, 1 reply; 20+ messages in thread
From: Jiaqi Yan @ 2022-11-08  2:24 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Luck, Tony, naoya.horiguchi, dave.hansen, David Hildenbrand,
	Aktas, Erdem, pgonda, rientjes, Hsiao, Duen-wen, Vilas.Sridharan,
	Malvestuto, Mike, gthelen, linux-mm, jthoughton

On Thu, Nov 3, 2022 at 9:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> On Nov 3, 2022, at 9:27 AM, Luck, Tony <tony.luck@intel.com> wrote:
>
> >> - HPS usually doesn’t consume CPU cores but does consume memory
> >> controller cycles and memory bandwidth. SW consumes both CPU cycles
> >> and memory bandwidth, but is only a problem if administrators opt into
> >> the scanning after weighing the cost benefit.
> >
> > Maybe there is a middle ground on platforms that support some s/w programmable
> > DMA engine that can detect memory errors in a way that doesn't signal a
> > fatal system error. Your s/w scanner can direct that DMA engine to read from
> > the regions of memory that you want to scan, at a frequency that is compatible
> > with your system load requirements and risk assessments.
> >
> > If your idea gets traction, maybe structure the code so that it can either use
> > a CPU core scan a block of memory, or pass requests to a platform driver that can
> > use a DMA engine to perform the scan.
>
> That’s exactly what I was about the write. :)
>
> Quickassist can be perfect for that. The IOMMU can be programmed to make the
> memory uncachable.
>

Agree, the kernel code will abstract away the part that does the
actual memory scanning with an internal "API",
so that we can plug in different scanners, e.g. CPU, DMA device.

If it is feasible in future that hardware vendors can make patrol
scrubber programmable, we can even direct the scanning to patrol
scrubber.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] Kernel Support of Memory Error Detection.
  2022-11-08  2:24     ` Jiaqi Yan
@ 2022-11-08 16:17       ` Luck, Tony
  2022-11-09  5:04         ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2022-11-08 16:17 UTC (permalink / raw)
  To: Jiaqi Yan, Nadav Amit
  Cc: naoya.horiguchi, dave.hansen, David Hildenbrand, Aktas, Erdem,
	pgonda, rientjes, Hsiao, Duen-wen, Vilas.Sridharan, Malvestuto,
	Mike, gthelen, linux-mm, jthoughton

> If it is feasible in future that hardware vendors can make patrol
> scrubber programmable, we can even direct the scanning to patrol
> scrubber.

There was an attempt to create an ACPI interface for this. I don't know if it made
it into the standard. I didn't do anything with it for Linux because the interface was
quite complex.

From a h/w perspective it might always be complex. Consecutive system physical
addresses are generally interleaved across multiple memory controllers, channels,
DIMMs and ranks. While patrol scrubbing may be done by each memory controller
at the channel level.

So a simple request to scan a few megabytes of system physical address would
require address translation to figure out the channel addresses on each of the
memory controllers and programming each to scan the pieces they contribute to
the target range.

-Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-08 16:17       ` Luck, Tony
@ 2022-11-09  5:04         ` HORIGUCHI NAOYA(堀口 直也)
  2022-11-10 20:23           ` Jiaqi Yan
  2022-11-18  1:19           ` Jiaqi Yan
  0 siblings, 2 replies; 20+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-11-09  5:04 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jiaqi Yan, Nadav Amit, dave.hansen, David Hildenbrand, Aktas,
	Erdem, pgonda, rientjes, Hsiao, Duen-wen, Vilas.Sridharan,
	Malvestuto, Mike, gthelen, linux-mm, jthoughton

On Tue, Nov 08, 2022 at 04:17:06PM +0000, Luck, Tony wrote:
> > If it is feasible in future that hardware vendors can make patrol
> > scrubber programmable, we can even direct the scanning to patrol
> > scrubber.
> 
> There was an attempt to create an ACPI interface for this. I don't know if it made
> it into the standard.

I briefly checked the latest ACPI spec, and it seems that some interfaces
to control (h/w based) patrol scrubbing are defined.

https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#acpi-ras-feature-table-rasf

> I didn't do anything with it for Linux because the interface was
> quite complex.
> 
> From a h/w perspective it might always be complex. Consecutive system physical
> addresses are generally interleaved across multiple memory controllers, channels,
> DIMMs and ranks. While patrol scrubbing may be done by each memory controller
> at the channel level.
> 
> So a simple request to scan a few megabytes of system physical address would
> require address translation to figure out the channel addresses on each of the
> memory controllers and programming each to scan the pieces they contribute to
> the target range.

I expect that the physical address visible to the kernel is transparently
translated to the real address in which DIMM in which channel.

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-03 15:50 [RFC] Kernel Support of Memory Error Detection Jiaqi Yan
  2022-11-03 16:27 ` Luck, Tony
  2022-11-07 16:59 ` Sridharan, Vilas
@ 2022-11-09  5:29 ` HORIGUCHI NAOYA(堀口 直也)
  2022-11-09 16:15   ` Luck, Tony
  2022-11-10 20:23   ` Jiaqi Yan
  2022-11-30  5:31 ` David Rientjes
  3 siblings, 2 replies; 20+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-11-09  5:29 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: tony.luck, dave.hansen, david, erdemaktas, pgonda, rientjes,
	duenwen, Vilas.Sridharan, mike.malvestuto, gthelen, linux-mm,
	jthoughton, Ghannam, Yazen

On Thu, Nov 03, 2022 at 03:50:29PM +0000, Jiaqi Yan wrote:
> This RFC is a followup for [1]. We’d like to first revisit the problem
> statement, then explain the motivation for kernel support of memory
> error detection. We attempt to answer two key questions raised in the
> initial memory-scanning based solution: what memory to scan and how the
> scanner should be designed. Different from what [1] originally proposed,
> we think a kernel-driven design similar to khugepaged/kcompactd would
> work better than the userspace-driven design.
> 
> Problem Statement
> =================
> The ever increasing DRAM size and cost has brought the memory subsystem
> reliability to the forefront of large fleet owners’ concern. Memory
> errors are one of the top hardware failures that cause server and
> workload crashes. Simply deploying extra-reliable DRAM hardware to a
> large-scale computing fleet adds significant cost, e.g., 10% extra cost
> on DRAM can amount to hundreds of millions of dollars.
> 
> Reactive memory poison recovery (MPR), e.g., recovering from MCEs raised
> during an execution context (the kernel mechanisms are MCE handler +
> CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found
> effective in keeping systems resilient from memory errors. However,
> reactive memory poison recovery has several major drawbacks:
> - It requires software systems that access poisoned memory to
>   be specifically designed and implemented to recover from memory errors.
>   Uncorrectable (UC) errors are random, which may happen outside of the
>   enlightened address spaces or execution contexts. The added error
>   recovery capability comes at the cost of added complexity and often
>   impossible to enlighten in 3rd party software.
> - In a virtualized environment, the injected MCEs introduce the same
>   challenge to the guest.
> - It only covers MCEs raised by CPU accesses, but the scope of memory
>   error issue is far beyond that. For example, PCIe devices (e.g. NIC and
>   GPU) accessing poisoned memory cause host crashes when
>   on certain machine configs.
> 
> We want to upstream a patch set that proactively scans the memory DIMMs
> at a configurable rate to detect UC memory errors, and attempts to
> recover the detected memory errors. We call it proactive MPR, which
> provides three benefits to tackle the memory error problem:
> - Proactively scanning memory DIMMs reduces the chance of a correctable
>   error becoming uncorrectable.
> - Once detected, UC errors caught in unallocated memory pages are
>   isolated and prevented from being allocated to an application or the OS.
> - The probability of software/hardware products encountering memory
>   errors is reduced, as they are only exposed to memory errors developed
>   over a window of T, where T stands for the period of scrubbing the
>   entire memory space. Any memory errors that occurred more than T ago
>   should have resulted in custom recovery actions. For example, in a cloud
>   environment VMs can be live migrated to another healthy host.
> 
> Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to
> prevent the build up of memory errors. In comparison software memory
> error detector (SW) has pros and cons:
> - SW supports adaptive scanning, i.e. speeds up/down scanning, turns
>   on/off scanning, and yields its own CPU cycles and memory bandwidth.
>   All of these can happen on-the-fly based on the system workload status
>   or administrator’s choice. HPS doesn’t have all these flexibilities.
>   Its patrol speed is usually only configurable at boot time, and it is
>   not able to consider system state. (Note: HPS is a memory controller
>   feature and usually doesn’t consume CPU time).
> - SW can expose controls to scan by memory types, while HPS always scans
>   full system memory. For example, an administrator can use SW to only
>   scan hugetlb memory on the system.
> - SW can scan memory at a finer granularity, for example, having different
>   scan rate per node, or entirely disabled on some node. HPS, however,
>   currently only supports per host scanning.
> - SW can make scan statistics (e.g. X bytes has been scanned for the
>   last Y seconds and Z memory errors are found) easily visible to
>   datacenter administrators, who can schedule maintenance (e.g. migrating
>   running jobs before repairing DIMMs) accordingly.

I think that exposing memory error info in the system to usespace is
useful independent of the new scanner.

> - SW’s functionality is consistent across hardware platforms. HPS’s
>   functionality varies from vendor to vendor. For example, some vendors
>   support shorter scrubbing periods than others, and some vendors may not
>   support memory scrubbing at all.
> - HPS usually doesn’t consume CPU cores but does consume memory
>   controller cycles and memory bandwidth. SW consumes both CPU cycles
>   and memory bandwidth, but is only a problem if administrators opt into
>   the scanning after weighing the cost benefit.
> - As CPU cores are not consumed by HPS, there won’t be any cache impact.
>   SW can utilize prefetchnta (for x86) [4] and equivalent hints for other
>   architectures [5] to minimize cache impact (in case of prefetchnta,
>   completely avoiding L1/L2 cache impact).
> 
> Solution Proposals
> ==================
> 
> What to Scan
> ============
> The initial RFC proposed to scan the **entire system memory**, which
> raised the question of what memory is scannable (i.e. memory accessible
> from kernel direct mapping). We attempt to address this question by
> breaking down the memory types as follows:
> - Static memory types: memory that either stays scannable or unscannable.
>   Well defined examples are hugetlb vs regular memory, node-local memory
>   vs far memory (e.g. CXL or PMEM). While most static memory types are
>   scannable, administrators could disable scanning far memory to avoid
>   messing with the promotion and demotion logic in memory tiring
>   solutions. (The implementation will allow administrators to disable
>   scanning on scannable memory).

I think that another viewpoint of how we prioritize memory type to scan
is kernel vs userspace memory. Current hwpoison mechanism does little to
recover from errors in kernel pages (slab, reserved), so there seesm
little benefit to detect such errors proactively and beforehand.  If the
resource for scanning is limited, the user might think of focusing on
scanning userspace memory.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] Kernel Support of Memory Error Detection.
  2022-11-09  5:29 ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-11-09 16:15   ` Luck, Tony
  2022-11-10 20:25     ` Jiaqi Yan
  2022-11-10 20:23   ` Jiaqi Yan
  1 sibling, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2022-11-09 16:15 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), Jiaqi Yan
  Cc: dave.hansen, david, Aktas, Erdem, pgonda, rientjes, Hsiao,
	Duen-wen, Vilas.Sridharan, Malvestuto, Mike, gthelen, linux-mm,
	jthoughton, Ghannam, Yazen

> I think that another viewpoint of how we prioritize memory type to scan
> is kernel vs userspace memory. Current hwpoison mechanism does little to
> recover from errors in kernel pages (slab, reserved), so there seesm
> little benefit to detect such errors proactively and beforehand.  If the
> resource for scanning is limited, the user might think of focusing on
> scanning userspace memory.

Page cache is (in some many use cases) a large user of kernel memory, and there
would be options for recovery if errors were pre-emptively found: clean page ->
re-read from storage, modified page -> mark in some way to force EIO for read()
and fail(?) mmap().

-Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-09  5:04         ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-11-10 20:23           ` Jiaqi Yan
  2022-11-18  1:19           ` Jiaqi Yan
  1 sibling, 0 replies; 20+ messages in thread
From: Jiaqi Yan @ 2022-11-10 20:23 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Luck, Tony, Nadav Amit, dave.hansen, David Hildenbrand, Aktas,
	Erdem, pgonda, rientjes, Hsiao, Duen-wen, Vilas.Sridharan,
	Malvestuto, Mike, gthelen, linux-mm, jthoughton

On Tue, Nov 8, 2022 at 9:04 PM HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Tue, Nov 08, 2022 at 04:17:06PM +0000, Luck, Tony wrote:
> > > If it is feasible in future that hardware vendors can make patrol
> > > scrubber programmable, we can even direct the scanning to patrol
> > > scrubber.
> >
> > There was an attempt to create an ACPI interface for this. I don't know if it made
> > it into the standard.
>
> I briefly checked the latest ACPI spec, and it seems that some interfaces
> to control (h/w based) patrol scrubbing are defined.
>
> https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#acpi-ras-feature-table-rasf

Thanks for the link!
Once the "how to scan" part reaches consensus, we will make sure the
concrete API/implementation is compatible/able to direct scan to
patrol scrubber.

>
> > I didn't do anything with it for Linux because the interface was
> > quite complex.
> >
> > From a h/w perspective it might always be complex. Consecutive system physical
> > addresses are generally interleaved across multiple memory controllers, channels,
> > DIMMs and ranks. While patrol scrubbing may be done by each memory controller
> > at the channel level.
> >
> > So a simple request to scan a few megabytes of system physical address would
> > require address translation to figure out the channel addresses on each of the
> > memory controllers and programming each to scan the pieces they contribute to
> > the target range.
>
> I expect that the physical address visible to the kernel is transparently
> translated to the real address in which DIMM in which channel.
>
> - Naoya Horiguchi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-09  5:29 ` HORIGUCHI NAOYA(堀口 直也)
  2022-11-09 16:15   ` Luck, Tony
@ 2022-11-10 20:23   ` Jiaqi Yan
  1 sibling, 0 replies; 20+ messages in thread
From: Jiaqi Yan @ 2022-11-10 20:23 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: tony.luck, dave.hansen, david, erdemaktas, pgonda, rientjes,
	duenwen, Vilas.Sridharan, mike.malvestuto, gthelen, linux-mm,
	jthoughton, Ghannam, Yazen

On Tue, Nov 8, 2022 at 9:29 PM HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Thu, Nov 03, 2022 at 03:50:29PM +0000, Jiaqi Yan wrote:
> > This RFC is a followup for [1]. We’d like to first revisit the problem
> > statement, then explain the motivation for kernel support of memory
> > error detection. We attempt to answer two key questions raised in the
> > initial memory-scanning based solution: what memory to scan and how the
> > scanner should be designed. Different from what [1] originally proposed,
> > we think a kernel-driven design similar to khugepaged/kcompactd would
> > work better than the userspace-driven design.
> >
> > Problem Statement
> > =================
> > The ever increasing DRAM size and cost has brought the memory subsystem
> > reliability to the forefront of large fleet owners’ concern. Memory
> > errors are one of the top hardware failures that cause server and
> > workload crashes. Simply deploying extra-reliable DRAM hardware to a
> > large-scale computing fleet adds significant cost, e.g., 10% extra cost
> > on DRAM can amount to hundreds of millions of dollars.
> >
> > Reactive memory poison recovery (MPR), e.g., recovering from MCEs raised
> > during an execution context (the kernel mechanisms are MCE handler +
> > CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found
> > effective in keeping systems resilient from memory errors. However,
> > reactive memory poison recovery has several major drawbacks:
> > - It requires software systems that access poisoned memory to
> >   be specifically designed and implemented to recover from memory errors.
> >   Uncorrectable (UC) errors are random, which may happen outside of the
> >   enlightened address spaces or execution contexts. The added error
> >   recovery capability comes at the cost of added complexity and often
> >   impossible to enlighten in 3rd party software.
> > - In a virtualized environment, the injected MCEs introduce the same
> >   challenge to the guest.
> > - It only covers MCEs raised by CPU accesses, but the scope of memory
> >   error issue is far beyond that. For example, PCIe devices (e.g. NIC and
> >   GPU) accessing poisoned memory cause host crashes when
> >   on certain machine configs.
> >
> > We want to upstream a patch set that proactively scans the memory DIMMs
> > at a configurable rate to detect UC memory errors, and attempts to
> > recover the detected memory errors. We call it proactive MPR, which
> > provides three benefits to tackle the memory error problem:
> > - Proactively scanning memory DIMMs reduces the chance of a correctable
> >   error becoming uncorrectable.
> > - Once detected, UC errors caught in unallocated memory pages are
> >   isolated and prevented from being allocated to an application or the OS.
> > - The probability of software/hardware products encountering memory
> >   errors is reduced, as they are only exposed to memory errors developed
> >   over a window of T, where T stands for the period of scrubbing the
> >   entire memory space. Any memory errors that occurred more than T ago
> >   should have resulted in custom recovery actions. For example, in a cloud
> >   environment VMs can be live migrated to another healthy host.
> >
> > Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to
> > prevent the build up of memory errors. In comparison software memory
> > error detector (SW) has pros and cons:
> > - SW supports adaptive scanning, i.e. speeds up/down scanning, turns
> >   on/off scanning, and yields its own CPU cycles and memory bandwidth.
> >   All of these can happen on-the-fly based on the system workload status
> >   or administrator’s choice. HPS doesn’t have all these flexibilities.
> >   Its patrol speed is usually only configurable at boot time, and it is
> >   not able to consider system state. (Note: HPS is a memory controller
> >   feature and usually doesn’t consume CPU time).
> > - SW can expose controls to scan by memory types, while HPS always scans
> >   full system memory. For example, an administrator can use SW to only
> >   scan hugetlb memory on the system.
> > - SW can scan memory at a finer granularity, for example, having different
> >   scan rate per node, or entirely disabled on some node. HPS, however,
> >   currently only supports per host scanning.
> > - SW can make scan statistics (e.g. X bytes has been scanned for the
> >   last Y seconds and Z memory errors are found) easily visible to
> >   datacenter administrators, who can schedule maintenance (e.g. migrating
> >   running jobs before repairing DIMMs) accordingly.
>
> I think that exposing memory error info in the system to usespace is
> useful independent of the new scanner.

Agreed. The error info exposure interface is independently useful.
If we have the interface, today it probably only has data when memory
error access happens and is recovered.
When the scanning is running on a machine and detects memory errors
(generating data),
the interface will be more meaningful because now it has more data to expose.

>
> > - SW’s functionality is consistent across hardware platforms. HPS’s
> >   functionality varies from vendor to vendor. For example, some vendors
> >   support shorter scrubbing periods than others, and some vendors may not
> >   support memory scrubbing at all.
> > - HPS usually doesn’t consume CPU cores but does consume memory
> >   controller cycles and memory bandwidth. SW consumes both CPU cycles
> >   and memory bandwidth, but is only a problem if administrators opt into
> >   the scanning after weighing the cost benefit.
> > - As CPU cores are not consumed by HPS, there won’t be any cache impact.
> >   SW can utilize prefetchnta (for x86) [4] and equivalent hints for other
> >   architectures [5] to minimize cache impact (in case of prefetchnta,
> >   completely avoiding L1/L2 cache impact).
> >
> > Solution Proposals
> > ==================
> >
> > What to Scan
> > ============
> > The initial RFC proposed to scan the **entire system memory**, which
> > raised the question of what memory is scannable (i.e. memory accessible
> > from kernel direct mapping). We attempt to address this question by
> > breaking down the memory types as follows:
> > - Static memory types: memory that either stays scannable or unscannable.
> >   Well defined examples are hugetlb vs regular memory, node-local memory
> >   vs far memory (e.g. CXL or PMEM). While most static memory types are
> >   scannable, administrators could disable scanning far memory to avoid
> >   messing with the promotion and demotion logic in memory tiring
> >   solutions. (The implementation will allow administrators to disable
> >   scanning on scannable memory).
>
> I think that another viewpoint of how we prioritize memory type to scan
> is kernel vs userspace memory. Current hwpoison mechanism does little to
> recover from errors in kernel pages (slab, reserved), so there seesm
> little benefit to detect such errors proactively and beforehand.  If the
> resource for scanning is limited, the user might think of focusing on
> scanning userspace memory.

I definitely agree that scanning userspace is important, but I want to
argue scanning kernel memory is also necessary.
Memory error found in userspace => (almost) never causes panic
Memory error found in kernel space:
- For allocated pages => little recovery, no better comparing to do not scan
- For free pages => take off from buddy allocator to prevent future
usage, better than do not scan
(The scanner is going to access the memory without reading content,
and properly fixup the kernel access to memory error using EXTABLE.)
Overall, scanning kernel memory proactively improves things.

>
> Thanks,
> Naoya Horiguchi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-09 16:15   ` Luck, Tony
@ 2022-11-10 20:25     ` Jiaqi Yan
  0 siblings, 0 replies; 20+ messages in thread
From: Jiaqi Yan @ 2022-11-10 20:25 UTC (permalink / raw)
  To: Luck, Tony, HORIGUCHI NAOYA(堀口 直也)
  Cc: dave.hansen, david, Aktas, Erdem, pgonda, rientjes, Hsiao,
	Duen-wen, Vilas.Sridharan, Malvestuto, Mike, gthelen, linux-mm,
	jthoughton, Ghannam, Yazen, Sean Christopherson

On Wed, Nov 9, 2022 at 8:16 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> > I think that another viewpoint of how we prioritize memory type to scan
> > is kernel vs userspace memory. Current hwpoison mechanism does little to
> > recover from errors in kernel pages (slab, reserved), so there seesm
> > little benefit to detect such errors proactively and beforehand.  If the
> > resource for scanning is limited, the user might think of focusing on
> > scanning userspace memory.
>
> Page cache is (in some many use cases) a large user of kernel memory, and there
> would be options for recovery if errors were pre-emptively found: clean page ->
> re-read from storage, modified page -> mark in some way to force EIO for read()
> and fail(?) mmap().
>
> -Tony

Adding the page cache into discussion, I would like to separate the
memory scanner from mm's recovery mechanism.

We want to build an agnostic in-kernel scanner that safely detects
memory errors in physical memory.
(e.g. for IntelX86 all usable physical pages in e820), ideally without
the need to know the "memory type" (owned by user vs kernel? free vs
allocated? page cache dirty vs clean? owned by virtualization guest vs
host).
After the scanner detects a PFN has a memory error, it reports to the
memory-failure module, who classifies the type of the memory page and
takes recovery actions accordingly.
(For example, page cache will be handled by me_pagecache_dirty/clean,
I believe that's basically what Tony described)
So the proactive scanner should always improve the kernel's memory
reliability by recovering more error pages and recover proactively
(not waiting until someone's access).

That being said, prioritizing scanning a certain type of memory is
then hard (if not impossible).
Because the in-kernel background thread design sees all memory the
same type, physical memory, to make things simple.

The alternative is we assume there is a caller to drive the scanner.
This caller can be either userspace or kernel space (our RFC chooses userspace).
Then the caller can prioritize or only scan a certain type of memory,
but caller has to secure the memory regions before passing to scanner.

The "How to Scan" section in RFC has more details. Please do share
your opinion/preference for the two designs.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-09  5:04         ` HORIGUCHI NAOYA(堀口 直也)
  2022-11-10 20:23           ` Jiaqi Yan
@ 2022-11-18  1:19           ` Jiaqi Yan
  2022-11-18 14:38             ` Sridharan, Vilas
  1 sibling, 1 reply; 20+ messages in thread
From: Jiaqi Yan @ 2022-11-18  1:19 UTC (permalink / raw)
  To: Vilas.Sridharan@amd.com, Malvestuto, Mike
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Nadav Amit, David Hildenbrand, Aktas, Erdem, pgonda, rientjes,
	Hsiao, Duen-wen, gthelen, linux-mm, jthoughton, dave.hansen,
	Luck, Tony

On Tue, Nov 8, 2022 at 9:04 PM HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Tue, Nov 08, 2022 at 04:17:06PM +0000, Luck, Tony wrote:
> > > If it is feasible in future that hardware vendors can make patrol
> > > scrubber programmable, we can even direct the scanning to patrol
> > > scrubber.
> >
> > There was an attempt to create an ACPI interface for this. I don't know if it made
> > it into the standard.
>
> I briefly checked the latest ACPI spec, and it seems that some interfaces
> to control (h/w based) patrol scrubbing are defined.
>
> https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#acpi-ras-feature-table-rasf

A followup question to Intel and AMD RAS folks (Mike and Vilas), what
is your position on the ACPI interface to control hw patrol scrubber,
and further make it programmable by kernel? Is this something you are
willing to consider?

>
> > I didn't do anything with it for Linux because the interface was
> > quite complex.
> >
> > From a h/w perspective it might always be complex. Consecutive system physical
> > addresses are generally interleaved across multiple memory controllers, channels,
> > DIMMs and ranks. While patrol scrubbing may be done by each memory controller
> > at the channel level.
> >
> > So a simple request to scan a few megabytes of system physical address would
> > require address translation to figure out the channel addresses on each of the
> > memory controllers and programming each to scan the pieces they contribute to
> > the target range.
>
> I expect that the physical address visible to the kernel is transparently
> translated to the real address in which DIMM in which channel.
>
> - Naoya Horiguchi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] Kernel Support of Memory Error Detection.
  2022-11-18  1:19           ` Jiaqi Yan
@ 2022-11-18 14:38             ` Sridharan, Vilas
  2022-11-18 17:10               ` Luck, Tony
  0 siblings, 1 reply; 20+ messages in thread
From: Sridharan, Vilas @ 2022-11-18 14:38 UTC (permalink / raw)
  To: Jiaqi Yan, Malvestuto, Mike, Ghannam, Yazen
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Nadav Amit, David Hildenbrand, Aktas, Erdem, pgonda, rientjes,
	Hsiao, Duen-wen, gthelen, linux-mm, jthoughton, dave.hansen,
	Luck, Tony

[AMD Official Use Only - General]

Please include Yazen from AMD on this discussion.

Making the patrol scrubber accessible to the OS would very likely not work without other changes. It is possible (even likely) that other entities in the system are manipulating the patrol scrubber, and there's no way to resolve any conflicts or race conditions.

So, if this was exposed to ACPI, it would need to be exposed through a capability and that capability would only be supported if the processors added support for OS-dedicated patrol scrubber hardware, or if a specific product could guarantee no other entities are using the patrol scrubber.

     -Vilas

-----Original Message-----
From: Jiaqi Yan <jiaqiyan@google.com> 
Sent: Thursday, November 17, 2022 8:20 PM
To: Sridharan, Vilas <Vilas.Sridharan@amd.com>; Malvestuto, Mike <mike.malvestuto@intel.com>
Cc: HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com>; Nadav Amit <nadav.amit@gmail.com>; David Hildenbrand <david@redhat.com>; Aktas, Erdem <erdemaktas@google.com>; pgonda@google.com; rientjes@google.com; Hsiao, Duen-wen <duenwen@google.com>; gthelen@google.com; linux-mm@kvack.org; jthoughton@google.com; dave.hansen@linux.intel.com; Luck, Tony <tony.luck@intel.com>
Subject: Re: [RFC] Kernel Support of Memory Error Detection.

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


On Tue, Nov 8, 2022 at 9:04 PM HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Tue, Nov 08, 2022 at 04:17:06PM +0000, Luck, Tony wrote:
> > > If it is feasible in future that hardware vendors can make patrol 
> > > scrubber programmable, we can even direct the scanning to patrol 
> > > scrubber.
> >
> > There was an attempt to create an ACPI interface for this. I don't 
> > know if it made it into the standard.
>
> I briefly checked the latest ACPI spec, and it seems that some 
> interfaces to control (h/w based) patrol scrubbing are defined.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuefi
> .org%2Fspecs%2FACPI%2F6.5%2F05_ACPI_Software_Programming_Model.html%23
> acpi-ras-feature-table-rasf&amp;data=05%7C01%7Cvilas.sridharan%40amd.c
> om%7C757b6941a0a7432c826408dac903006a%7C3dd8961fe4884e608e11a82d994e18
> 3d%7C0%7C0%7C638043311988593656%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&a
> mp;sdata=7%2B4WJc9wS%2B21TAgLw3E1P8qNwSs8V9LFkbDAGU8kgyE%3D&amp;reserv
> ed=0

A followup question to Intel and AMD RAS folks (Mike and Vilas), what is your position on the ACPI interface to control hw patrol scrubber, and further make it programmable by kernel? Is this something you are willing to consider?

>
> > I didn't do anything with it for Linux because the interface was 
> > quite complex.
> >
> > From a h/w perspective it might always be complex. Consecutive 
> > system physical addresses are generally interleaved across multiple 
> > memory controllers, channels, DIMMs and ranks. While patrol 
> > scrubbing may be done by each memory controller at the channel level.
> >
> > So a simple request to scan a few megabytes of system physical 
> > address would require address translation to figure out the channel 
> > addresses on each of the memory controllers and programming each to 
> > scan the pieces they contribute to the target range.
>
> I expect that the physical address visible to the kernel is 
> transparently translated to the real address in which DIMM in which channel.
>
> - Naoya Horiguchi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] Kernel Support of Memory Error Detection.
  2022-11-18 14:38             ` Sridharan, Vilas
@ 2022-11-18 17:10               ` Luck, Tony
  0 siblings, 0 replies; 20+ messages in thread
From: Luck, Tony @ 2022-11-18 17:10 UTC (permalink / raw)
  To: Sridharan, Vilas, Jiaqi Yan, Malvestuto, Mike, Ghannam, Yazen
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Nadav Amit, David Hildenbrand, Aktas, Erdem, pgonda, rientjes,
	Hsiao, Duen-wen, gthelen, linux-mm, jthoughton, dave.hansen

Last time somebody asked me if Linux was using ACPI RASF, I said "no". So it
is possible that current Intel platforms are no longer supporting RASF in BIOS.

Memory topologies have become more complex since RASF was created (I think
it dates back to ACPI version 4, or maybe 5). So I wonder whether it would be able
to handle a system where some part of memory was connected via CXL.

-Tony



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-03 15:50 [RFC] Kernel Support of Memory Error Detection Jiaqi Yan
                   ` (2 preceding siblings ...)
  2022-11-09  5:29 ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-11-30  5:31 ` David Rientjes
  2022-12-13  9:27   ` HORIGUCHI NAOYA(堀口 直也)
  3 siblings, 1 reply; 20+ messages in thread
From: David Rientjes @ 2022-11-30  5:31 UTC (permalink / raw)
  To: Jiaqi Yan, Ghannam, Yazen
  Cc: naoya.horiguchi, tony.luck, dave.hansen, david, erdemaktas,
	pgonda, duenwen, Vilas.Sridharan, mike.malvestuto, gthelen,
	linux-mm, jthoughton

[-- Attachment #1: Type: text/plain, Size: 15057 bytes --]

On Thu, 3 Nov 2022, Jiaqi Yan wrote:

> This RFC is a followup for [1]. We’d like to first revisit the problem
> statement, then explain the motivation for kernel support of memory
> error detection. We attempt to answer two key questions raised in the
> initial memory-scanning based solution: what memory to scan and how the
> scanner should be designed. Different from what [1] originally proposed,
> we think a kernel-driven design similar to khugepaged/kcompactd would
> work better than the userspace-driven design.
> 

Lots of great discussion in this thread, thanks Jiaqi for a very detailed 
overview of what is trying to be addressed and the multiple options that 
we can consider.

I think this thread has been a very useful starting point for us to 
discuss what should comprise the first patchset.  I haven't seen any 
objections to enlightening the kernel for this support, but any additional 
feedback would indeed be useful.

Let me suggest a possible way forward: if we can agree on an kernel driven 
approach and its design allows for it to be extended for future use cases, 
then it should be possible to introduce something generally useful that 
can then be built upon later if needed.

I can think about a couple future use cases that may arise that will 
impact the minimal design that you intend to introduce: (1) the ability to 
configure a hardware patrol scrubber depending on the platform, if 
possible, as a substitute for driving the scanning by a kthread, and (2) 
the ability to scan different types of memory rather than all system 
memory.

Imagining the simplest possible design, I assume we could introuce a
/sys/devices/system/node/nodeN/mcescan/* for each NUMA node on the system.  
As a foundation, this can include only a "stat" file which provides the 
interface to the memory poison subsystem that describes detected errors 
and their resolution (this would be a good starting point).

Building on that, and using your reference to khugepaged, we can add 
pages_to_scan and scan_sleep_millisecs files.  This will allow us to 
control scanning on demotion nodes differently.  We'd want the kthread to 
be NUMA aware for the memory it is scanning, so this would simply control 
when each thread wakes up and how much memory it scans before going to 
sleep.  Defaults would be disabled, so no kthreads are forked.

If this needs to be extended later for a hardware patrol scrubber, we'd 
make this a request to cpu vendors to make configurable on a per socket 
basis and used only with an ACPI capability that would put it under the 
control of the kernel in place of the kthread (there would be a single 
source of truth for the scan configuration).  If this is not possible, 
we'd decouple the software and hardware approach and configure the HPS 
through the ACPI subsystem independently.

Subsequently, if there is a need to only scan certain types of memory per 
NUMA node, we could introduce a "type" file later under the mcescan 
directory.  Idea would be to specify a bitmask to include certain memory 
types into the scan.  Bits for things such as buddy pages, pcp pages, 
hugetlb pages, etc.

 [ And if userspace, perhaps non-root, wanted to trigger a scan of its own 
   virtual memory, for example, another future extension could allow you 
   to explicitly trigger a scan of the calling process, but this would be 
   done in process context, not by the kthreads. ]

If this is deemed acceptable, the minimal viable patchset would:

 - introduce the per-node mcescan directories

 - introduce a "stat" file that would describe the state of memory errors
   on each NUMA node and their disposition

 - introduce a per-node kthread driven by pages_to_scan and
   scan_sleep_millisecs to do software controlled memory scanning

All future possible use cases could be extended using this later if the 
demand arises.

Thoughts?  It would be very useful to agree on a path forward since I 
think this would be generally useful for the kernel.

> Problem Statement
> =================
> The ever increasing DRAM size and cost has brought the memory subsystem
> reliability to the forefront of large fleet owners’ concern. Memory
> errors are one of the top hardware failures that cause server and
> workload crashes. Simply deploying extra-reliable DRAM hardware to a
> large-scale computing fleet adds significant cost, e.g., 10% extra cost
> on DRAM can amount to hundreds of millions of dollars.
> 
> Reactive memory poison recovery (MPR), e.g., recovering from MCEs raised
> during an execution context (the kernel mechanisms are MCE handler +
> CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found
> effective in keeping systems resilient from memory errors. However,
> reactive memory poison recovery has several major drawbacks:
> - It requires software systems that access poisoned memory to
>   be specifically designed and implemented to recover from memory errors.
>   Uncorrectable (UC) errors are random, which may happen outside of the
>   enlightened address spaces or execution contexts. The added error
>   recovery capability comes at the cost of added complexity and often
>   impossible to enlighten in 3rd party software.
> - In a virtualized environment, the injected MCEs introduce the same
>   challenge to the guest.
> - It only covers MCEs raised by CPU accesses, but the scope of memory
>   error issue is far beyond that. For example, PCIe devices (e.g. NIC and
>   GPU) accessing poisoned memory cause host crashes when
>   on certain machine configs.
> 
> We want to upstream a patch set that proactively scans the memory DIMMs
> at a configurable rate to detect UC memory errors, and attempts to
> recover the detected memory errors. We call it proactive MPR, which
> provides three benefits to tackle the memory error problem:
> - Proactively scanning memory DIMMs reduces the chance of a correctable
>   error becoming uncorrectable.
> - Once detected, UC errors caught in unallocated memory pages are
>   isolated and prevented from being allocated to an application or the OS.
> - The probability of software/hardware products encountering memory
>   errors is reduced, as they are only exposed to memory errors developed
>   over a window of T, where T stands for the period of scrubbing the
>   entire memory space. Any memory errors that occurred more than T ago
>   should have resulted in custom recovery actions. For example, in a cloud
>   environment VMs can be live migrated to another healthy host.
> 
> Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to
> prevent the build up of memory errors. In comparison software memory
> error detector (SW) has pros and cons:
> - SW supports adaptive scanning, i.e. speeds up/down scanning, turns
>   on/off scanning, and yields its own CPU cycles and memory bandwidth.
>   All of these can happen on-the-fly based on the system workload status
>   or administrator’s choice. HPS doesn’t have all these flexibilities.
>   Its patrol speed is usually only configurable at boot time, and it is
>   not able to consider system state. (Note: HPS is a memory controller
>   feature and usually doesn’t consume CPU time).
> - SW can expose controls to scan by memory types, while HPS always scans
>   full system memory. For example, an administrator can use SW to only
>   scan hugetlb memory on the system.
> - SW can scan memory at a finer granularity, for example, having different
>   scan rate per node, or entirely disabled on some node. HPS, however,
>   currently only supports per host scanning.
> - SW can make scan statistics (e.g. X bytes has been scanned for the
>   last Y seconds and Z memory errors are found) easily visible to
>   datacenter administrators, who can schedule maintenance (e.g. migrating
>   running jobs before repairing DIMMs) accordingly.
> - SW’s functionality is consistent across hardware platforms. HPS’s
>   functionality varies from vendor to vendor. For example, some vendors
>   support shorter scrubbing periods than others, and some vendors may not
>   support memory scrubbing at all.
> - HPS usually doesn’t consume CPU cores but does consume memory
>   controller cycles and memory bandwidth. SW consumes both CPU cycles
>   and memory bandwidth, but is only a problem if administrators opt into
>   the scanning after weighing the cost benefit.
> - As CPU cores are not consumed by HPS, there won’t be any cache impact.
>   SW can utilize prefetchnta (for x86) [4] and equivalent hints for other
>   architectures [5] to minimize cache impact (in case of prefetchnta,
>   completely avoiding L1/L2 cache impact).
> 
> Solution Proposals
> ==================
> 
> What to Scan
> ============
> The initial RFC proposed to scan the **entire system memory**, which
> raised the question of what memory is scannable (i.e. memory accessible
> from kernel direct mapping). We attempt to address this question by
> breaking down the memory types as follows:
> - Static memory types: memory that either stays scannable or unscannable.
>   Well defined examples are hugetlb vs regular memory, node-local memory
>   vs far memory (e.g. CXL or PMEM). While most static memory types are
>   scannable, administrators could disable scanning far memory to avoid
>   messing with the promotion and demotion logic in memory tiring
>   solutions. (The implementation will allow administrators to disable
>   scanning on scannable memory).
> - Memory type related to virtualization, including ballooned-away memory
>   and unaccepted memory. Not all balloon implementations are compatible
>   with memory scanning (i.e. reading memory mapped into the direct map) in
>   guest. For example, with the virtio-mem devices [6] in the hypervisor,
>   reading unplugged memory can cause undefined behavior. The same applies
>   to unaccepted memory in confidential VMs [7]. Since memory error
>   detection on the host side already benefits its guests transparently,
>   (i.e., spending no extra guest CPU cycle), there is very limited benefit
>   for a guest to scan memory by itself. We recommend disabling the memory
>   error detection within the virtualization world.
> - Dynamic memory type: memory that turns into unscannable or scannable
>   dynamically. One significant example is guest private memory backing
>   confidential VM. At the software level, guest private memory pages
>   become unscannable as they will soon be unmapped from kernel direct
>   mapping [8]. Scanning guest private memory pages is still possible by
>   IO remapping with foreseen performance sacrifice and proper page fault
>   handling (skip the page) if all means of mapping fail. At the hardware
>   level, the memory access implementation done by hardware vendors today
>   puts memory integrity checking prior to memory ownership checking,
>   which means memory errors are still surfaced to the OS while scanning.
>   For the scanning scheme to work for the future, we need the hardware
>   vendors to keep providing similar error detection behavior in their
>   confidential VM hardware. We believe this is a reasonable ask to them
>   as their hardware patrol scrubbers also adopt the same scanning scheme
>   and therefore rely on such promise from themselves. Otherwise we can
>   switch to whatever the new scheme used by the patrol scrubbers when
>   they break the promise.
> 
> How Scanning is Designed
> ====================
> We can support kernel memory error detection in two styles: whether kernel
> itself or userspace application drives the detection.
> 
> In the first style, the kernel itself can create a kthread on each NUMA
> node for scanning node-local memory (i.e. IORESOURCE_SYSTEM_RAM). These
> scanning kthreads are scheduled in the way similar to how khugepaged or
> kcompactd works. Basic configurations of the ever-schedulable background
> kthreads can be exposed to userspace via sysfs, for example, sleeping
> for X milliseconds after scanning Y raw memory pages. Scanning statistics
> can also be visible to userspace via sysfs, for example, number of pages
> actually scanned and number of memory errors found.
> 
> On the other hand, memory error detection can be driven by root userspace
> applications with sufficient support from the kernel. For example,
> a process can scrutinize physical memory under its own virtual memory
> space on demand. The supports from kernel are the most basic operations
> of specifically designed memory read access (e.g. avoid the CPU erratum,
> minimize cache pollution, and avoid leaking the memory content etc), and
> machine check exception handling plus memory failure handling [9] when
> memory error is detected.
> 
> The pros and cons of in-kernel background scanning are:
> - A simple and independent component for scanning system memory constantly
>   and regularly, which improves the machine fleet’s memory health (e.g.,
>   for hyperscalers, cloud providers, etc).
> - The rest of the OS (both kernel and application) can benefit from it
>   without explicit modifications.
> - The efficiency of this approach is easily configurable by scan rate.
> - It cannot offer an on-the-spot guarantee. There is no good way to
>   prioritize certain chunks of memory.
> - The implementation of this approach needs to deal with the question of
>   if a memory page is scannable.
> 
> The pros and cons of application driven approach are:
> - An application can scan a specific chunk of memory on the spot, and is
>   able to prioritize scanning on some memory regions or memory types.
> - A memory error detection agent needs to be designed to proactively,
>   constantly and regularly scan the entire memory.
> - A kernel API needs to be designed to provide userspace enough power of
>   scanning physical memory. For example, Memory regions requested by
>   multiple applications may overlap. Should the kernel API support
>   combined scanning?
> - Application is exposed to the question of if memory is scannable, and
>   needs to deal with the complexity of ensuring memory stays scannable
>   during the scanning process.
> 
> We prefer the in-kernel background approach for its simplicity, but open
> to all opinions from the upstream community.
> 
> [1] https://lore.kernel.org/linux-mm/20220425163451.3818838-1-juew@google.com
> [2] https://developer.amd.com/wordpress/media/2012/10/325591.pdf
> [3] https://community.intel.com/t5/Server-Products/Uncorrectable-Memory-Error-amp-Patrol-Scrub/td-p/545123
> [4] https://www.amd.com/system/files/TechDocs/24594.pdf, page 285
> [5] https://developer.arm.com/documentation/den0024/a/The-A64-instruction-set/Memory-access-instructions/Non-temporal-load-and-store-pair
> [6] https://lore.kernel.org/kvm/20200311171422.10484-1-david@redhat.com
> [7] https://lore.kernel.org/linux-mm/20220718172159.4vwjzrfthelovcty@black.fi.intel.com/t/
> [8] https://lore.kernel.org/linux-mm/20220706082016.2603916-1-chao.p.peng@linux.intel.com
> [9] https://www.kernel.org/doc/Documentation/vm/hwpoison.rst
> 
> -- 
> 2.38.1.273.g43a17bfeac-goog
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-11-30  5:31 ` David Rientjes
@ 2022-12-13  9:27   ` HORIGUCHI NAOYA(堀口 直也)
  2022-12-13 18:09     ` Luck, Tony
  0 siblings, 1 reply; 20+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-12-13  9:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: Jiaqi Yan, Ghannam, Yazen, tony.luck, dave.hansen, david,
	erdemaktas, pgonda, duenwen, Vilas.Sridharan, mike.malvestuto,
	gthelen, linux-mm, jthoughton

On Tue, Nov 29, 2022 at 09:31:15PM -0800, David Rientjes wrote:
> On Thu, 3 Nov 2022, Jiaqi Yan wrote:
> 
> > This RFC is a followup for [1]. We’d like to first revisit the problem
> > statement, then explain the motivation for kernel support of memory
> > error detection. We attempt to answer two key questions raised in the
> > initial memory-scanning based solution: what memory to scan and how the
> > scanner should be designed. Different from what [1] originally proposed,
> > we think a kernel-driven design similar to khugepaged/kcompactd would
> > work better than the userspace-driven design.
> > 
> 
> Lots of great discussion in this thread, thanks Jiaqi for a very detailed 
> overview of what is trying to be addressed and the multiple options that 
> we can consider.
> 
> I think this thread has been a very useful starting point for us to 
> discuss what should comprise the first patchset.  I haven't seen any 
> objections to enlightening the kernel for this support, but any additional 
> feedback would indeed be useful.
> 
> Let me suggest a possible way forward: if we can agree on an kernel driven 
> approach and its design allows for it to be extended for future use cases, 
> then it should be possible to introduce something generally useful that 
> can then be built upon later if needed.
> 
> I can think about a couple future use cases that may arise that will 
> impact the minimal design that you intend to introduce: (1) the ability to 
> configure a hardware patrol scrubber depending on the platform, if 
> possible, as a substitute for driving the scanning by a kthread, and (2) 
> the ability to scan different types of memory rather than all system 
> memory.
> 
> Imagining the simplest possible design, I assume we could introuce a
> /sys/devices/system/node/nodeN/mcescan/* for each NUMA node on the system.  
> As a foundation, this can include only a "stat" file which provides the 
> interface to the memory poison subsystem that describes detected errors 
> and their resolution (this would be a good starting point).
> 
> Building on that, and using your reference to khugepaged, we can add 
> pages_to_scan and scan_sleep_millisecs files.  This will allow us to 
> control scanning on demotion nodes differently.  We'd want the kthread to 
> be NUMA aware for the memory it is scanning, so this would simply control 
> when each thread wakes up and how much memory it scans before going to 
> sleep.  Defaults would be disabled, so no kthreads are forked.
> 
> If this needs to be extended later for a hardware patrol scrubber, we'd 
> make this a request to cpu vendors to make configurable on a per socket 
> basis and used only with an ACPI capability that would put it under the 
> control of the kernel in place of the kthread (there would be a single 
> source of truth for the scan configuration).  If this is not possible, 
> we'd decouple the software and hardware approach and configure the HPS 
> through the ACPI subsystem independently.
> 
> Subsequently, if there is a need to only scan certain types of memory per 
> NUMA node, we could introduce a "type" file later under the mcescan 
> directory.  Idea would be to specify a bitmask to include certain memory 
> types into the scan.  Bits for things such as buddy pages, pcp pages, 
> hugetlb pages, etc.
> 
>  [ And if userspace, perhaps non-root, wanted to trigger a scan of its own 
>    virtual memory, for example, another future extension could allow you 
>    to explicitly trigger a scan of the calling process, but this would be 
>    done in process context, not by the kthreads. ]
> 
> If this is deemed acceptable, the minimal viable patchset would:
> 
>  - introduce the per-node mcescan directories
> 
>  - introduce a "stat" file that would describe the state of memory errors
>    on each NUMA node and their disposition
> 
>  - introduce a per-node kthread driven by pages_to_scan and
>    scan_sleep_millisecs to do software controlled memory scanning
> 
> All future possible use cases could be extended using this later if the 
> demand arises.
> 
> Thoughts?  It would be very useful to agree on a path forward since I 
> think this would be generally useful for the kernel.

Thank you for the ideas, the above looks to me simple enough to start with.
I think that one point not mentioned yet is how the in-kernel scanner finds
a broken page before the page is marked by PG_hwpoison.  Some mechanism
similar to mcsafe-memcpy could be used, but maybe memcpy is not necessary
because we just want to check the healthiness of pages.  So a core routine
like mcsafe-read would be introduced in the first patchset (or we already
have it)?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] Kernel Support of Memory Error Detection.
  2022-12-13  9:27   ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-12-13 18:09     ` Luck, Tony
  2022-12-13 19:03       ` Jiaqi Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2022-12-13 18:09 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也),
	David Rientjes
  Cc: Jiaqi Yan, Ghannam, Yazen, dave.hansen, david, Aktas, Erdem,
	pgonda, Hsiao, Duen-wen, Vilas.Sridharan, Malvestuto, Mike,
	gthelen, linux-mm, jthoughton

> I think that one point not mentioned yet is how the in-kernel scanner finds
> a broken page before the page is marked by PG_hwpoison.  Some mechanism
> similar to mcsafe-memcpy could be used, but maybe memcpy is not necessary
> because we just want to check the healthiness of pages.  So a core routine
> like mcsafe-read would be introduced in the first patchset (or we already
> have it)?

I don’t think that there is an existing routine to do the mcsafe-read. But it should
be easy enough to write one.  If an architecture supports a way to do this without
evicting other data from caches, that would be a bonus. X86 has a non-temporal
read that could be interesting ... but I'm not sure that it would detect poison
synchronously. I could be wrong, but I expect that you won’t see a machine check,
but you should see the memory controller log a UCNA error reported by a CMCI.

-Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-12-13 18:09     ` Luck, Tony
@ 2022-12-13 19:03       ` Jiaqi Yan
  2022-12-14 14:45         ` Yazen Ghannam
  0 siblings, 1 reply; 20+ messages in thread
From: Jiaqi Yan @ 2022-12-13 19:03 UTC (permalink / raw)
  To: Luck, Tony, Ghannam, Yazen,
	HORIGUCHI NAOYA(堀口 直也),
	Vilas.Sridharan@amd.com
  Cc: David Rientjes, dave.hansen, david, Aktas, Erdem, pgonda, Hsiao,
	Duen-wen, Malvestuto, Mike, gthelen, linux-mm, jthoughton

On Tue, Dec 13, 2022 at 10:10 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> > I think that one point not mentioned yet is how the in-kernel scanner finds
> > a broken page before the page is marked by PG_hwpoison.  Some mechanism
> > similar to mcsafe-memcpy could be used, but maybe memcpy is not necessary
> > because we just want to check the healthiness of pages.  So a core routine
> > like mcsafe-read would be introduced in the first patchset (or we already
> > have it)?
>
> I don’t think that there is an existing routine to do the mcsafe-read. But it should
> be easy enough to write one.  If an architecture supports a way to do this without
> evicting other data from caches, that would be a bonus. X86 has a non-temporal
> read that could be interesting ... but I'm not sure that it would detect poison
> synchronously. I could be wrong, but I expect that you won’t see a machine check,
> but you should see the memory controller log a UCNA error reported by a CMCI.
>
> -Tony

To Naoya: yes, we will introduce a new scanning routine. It "touches"
cacheline by cacheline of a page to detect memory error. This "touch"
is essentially an ANDQ operation of loaded cacheline with 0, to avoid
leaking user data in the register.

To Tony: thanks. I think you are referring to PREFETCHNTA before ANDQ?
(which we are using in our scanning routine to minimize cache
pollution.) We tested the attached scanning draft on Intel Skylake +
Cascadelake + Icelake CPUs, and the ANDQ instruction does raise a MC
synchronously when an injected memory error is encountered.

To Yazen and Vilas: We haven't tested on any AMD hardware. Do you have
any thoughts on PREFETCHNTA + MC?

/**
 * Detecting memory errors within a range of memory.
 *
 * Input:
 * rdi: starting address of the range.
 * rsi: exclusive ending address of the range.
 *
 * Output:
 * eax: X86_TRAP_MC if encounter poisoned memory,
 *         X86_TRAP_PF if direct kernel mapping is not established,
 *         0 if success (assume this routine never hits X86_TRAP_DE).
 */
ENTRY(kmcescand_safe_read)
  /* Zero %rax. */
  xor %rax, %rax
1:
  /* Prevent LLC pollution with non-temporal prefetch hint. */
  prefetchnta (%rdi)
2:
  /**
   * This andq with constant rax=0 prevents leaking memory
   * content (especially userspace memory content like credentials)
   * into register.
   */
  andq (%rdi), %rax
  /**
   * X86-64 CPUs read memory cacheline by cacheline (64 bytes),
   * so no need to explicitly do andq 64 bits by 64 bit;
   * instead increase directly to the next 64 byte memory address.
   */
  add $64, %rdi
  cmp %rdi, %rsi
  jne 1b
3:
  ret
  /**
   * The exception handler ex_handler_fault fills eax with
   * the exception vector (e.g. #MC or #PF).
   */
  _ASM_EXTABLE_FAULT(2b, 3b)
ENDPROC(kmcescand_safe_read)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Kernel Support of Memory Error Detection.
  2022-12-13 19:03       ` Jiaqi Yan
@ 2022-12-14 14:45         ` Yazen Ghannam
  0 siblings, 0 replies; 20+ messages in thread
From: Yazen Ghannam @ 2022-12-14 14:45 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Luck, Tony, HORIGUCHI NAOYA(堀口 直也),
	Vilas.Sridharan@amd.com, David Rientjes, dave.hansen, david,
	Aktas, Erdem, pgonda, Hsiao, Duen-wen, Malvestuto, Mike, gthelen,
	linux-mm, jthoughton

On Tue, Dec 13, 2022 at 11:03:52AM -0800, Jiaqi Yan wrote:
> On Tue, Dec 13, 2022 at 10:10 AM Luck, Tony <tony.luck@intel.com> wrote:
> >
> > > I think that one point not mentioned yet is how the in-kernel scanner finds
> > > a broken page before the page is marked by PG_hwpoison.  Some mechanism
> > > similar to mcsafe-memcpy could be used, but maybe memcpy is not necessary
> > > because we just want to check the healthiness of pages.  So a core routine
> > > like mcsafe-read would be introduced in the first patchset (or we already
> > > have it)?
> >
> > I don’t think that there is an existing routine to do the mcsafe-read. But it should
> > be easy enough to write one.  If an architecture supports a way to do this without
> > evicting other data from caches, that would be a bonus. X86 has a non-temporal
> > read that could be interesting ... but I'm not sure that it would detect poison
> > synchronously. I could be wrong, but I expect that you won’t see a machine check,
> > but you should see the memory controller log a UCNA error reported by a CMCI.
> >
> > -Tony
> 
> To Naoya: yes, we will introduce a new scanning routine. It "touches"
> cacheline by cacheline of a page to detect memory error. This "touch"
> is essentially an ANDQ operation of loaded cacheline with 0, to avoid
> leaking user data in the register.
> 
> To Tony: thanks. I think you are referring to PREFETCHNTA before ANDQ?
> (which we are using in our scanning routine to minimize cache
> pollution.) We tested the attached scanning draft on Intel Skylake +
> Cascadelake + Icelake CPUs, and the ANDQ instruction does raise a MC
> synchronously when an injected memory error is encountered.
> 
> To Yazen and Vilas: We haven't tested on any AMD hardware. Do you have
> any thoughts on PREFETCHNTA + MC?
>

Hi Jiaqi,

I'm not sure of the behavior. I think it'll require some experimentation.
The AMD APM has the following statement in the "PREFETCHlevel" description:

  "The operation of this instruction is implementation-dependent."

So it may be the case that the behavior changes between products. Maybe
this procedure should be opt-in and only apply to products that are
verified to work?

Thanks,
Yazen


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-12-14 14:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-03 15:50 [RFC] Kernel Support of Memory Error Detection Jiaqi Yan
2022-11-03 16:27 ` Luck, Tony
2022-11-03 16:40   ` Nadav Amit
2022-11-08  2:24     ` Jiaqi Yan
2022-11-08 16:17       ` Luck, Tony
2022-11-09  5:04         ` HORIGUCHI NAOYA(堀口 直也)
2022-11-10 20:23           ` Jiaqi Yan
2022-11-18  1:19           ` Jiaqi Yan
2022-11-18 14:38             ` Sridharan, Vilas
2022-11-18 17:10               ` Luck, Tony
2022-11-07 16:59 ` Sridharan, Vilas
2022-11-09  5:29 ` HORIGUCHI NAOYA(堀口 直也)
2022-11-09 16:15   ` Luck, Tony
2022-11-10 20:25     ` Jiaqi Yan
2022-11-10 20:23   ` Jiaqi Yan
2022-11-30  5:31 ` David Rientjes
2022-12-13  9:27   ` HORIGUCHI NAOYA(堀口 直也)
2022-12-13 18:09     ` Luck, Tony
2022-12-13 19:03       ` Jiaqi Yan
2022-12-14 14:45         ` Yazen Ghannam

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.