All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Expose a memory poison detector ioctl to user space.
@ 2022-04-25 16:34 Jue Wang
  2022-04-26 15:40 ` Dave Hansen
  0 siblings, 1 reply; 26+ messages in thread
From: Jue Wang @ 2022-04-25 16:34 UTC (permalink / raw)
  To: Naoya Horiguchi, Tony Luck, Dave Hansen
  Cc: Jiaqi Yan, Greg Thelen, Mina Almasry, linux-mm, Jue Wang

The ever increasing server size and cost in DRAM has brought the memory
subsystem reliability to the forefront of large fleet owners’ concern.
Memory-error-caused server and workload crashes are ranked #1 among all
hardware failures by a large margin. Deploying extra reliable DRAM adds
significant cost at a fleet size, e.g, 10% extra cost on DRAM can amount
to hundreds of millions worth of dollars spending.

“Reactive” memory poison recovery [3], i.e., recover from MCEs raised
during an execution context (the kernel mechanisms are MCE handler +
CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found
effective in keeping systems resilient from memory errors. However, it has
several major drawbacks:

1. It requires software systems that access poisoned memory to be
specifically designed and implemented to recover from memory errors:
    . Uncorrectable errors are random, which may happen outside of the
      enlightened address spaces or execution contexts.
    . The added error recovery capability comes at a cost of added
      complexity and often not possible to enlighten in 3rd party software.
2. In a virtualized environment, the injected MCEs introduce the same
challenge to the guest.
3. Because of random execution contexts, CPU erratum that are vulnerable to
speculative execution, split cache line accesses, hyperthread buddy
scheduling etc (e.g., [1]) can often turn a recoverable UC error into an
unrecoverable MCE and a system crash.
4. Aside from CPU accesses, NIC or other PCIe devices accessing poisoned
memory cause host crashes in the production system regularly.
5. In a multi-tenant environment, the “reactive” poison recovery is
less effective to the largest workloads compared to the smaller workloads
in that the smaller workloads have a much higher chance to get saved
cleanly as a victim’s neighbor rather than the victim itself.

The goal is to minimize the probability that any software / hardware
component actually gets a chance to consume an error. A possible solution
is to “proactively” look for and detect memory errors before
consumption. Here we assume system software is enlightened to drain the
affected host and migrate the running jobs off to another healthy host as
soon as an error is detected; and is able to recover from, contain, and
emulate the errors that surface up during the migration process.

The main benefits (free memory ratios come from a large production fleet):
1. Memory errors in free memory (~50%) can be completely contained without
impacting software / hardware systems later on.
2. Inside a VM guest, memory errors in free memory (~50%) can be
completely contained with the UCNA injection via CMCI capability being
added to KVM in [2].
3. It’s required to detect memory errors on allocated pages without
impacting the execution or performance of the page owners. For instance,
in the cloud world, the majority of the host memory is pre-allocated as a
guest memory pool and memory errors can emerge well after the guest memory
pool allocation.
4. Early detection and ensured containment (e.g., unmapping and
PG_HWPOISON) can effectively prevent most if not all the crashes due to
CPU erratum ([3], section 3.5.2 - 3.5.5) that “reactive” poison
recovery cannot avoid and these crashes represent >40% of all host crashes
in a production fleet.

The hardware patrol scrubber [4] is evaluated, and the type of performance
(i.e., latency between error emergence and detection are in days to hours)
and error loss rate due to downgrading (or otherwise system instability
due to SRAO MCE broadcast and overflow) do not meet the requirements
(e.g., ~30 min emergence to consumption based on simulation).

A possible solution is to have some specially purposed poison detector
userspace agent that proactively looks for memory errors by invoking some
ioctl specifically implemented to avoid the CPU erratum, minimize
performance interferences (cache pollution etc) and avoid leaking the
memory content into the registers. The detector agent runs with minimal
configurable resource consumption (e.g., 0.1 core / socket, <0.5% membw
consumption etc) and pauses itself when the host system is under heavy
load (e.g., CPU>90% or membw>75%).

The kernel ioctl may take the following form and a potential point of
discussion is whether Unmapping guest Private Memory will require zapping
the kernel direct map or not. This ioctl can be compiled off in case
incompatible with other use cases like UPM.

/* Could stop and return after the 1st poison is detected */
#define MCESCAN_IOCTL_SCAN 0

struct SysramRegion {
  /* input */
  uint64_t first_byte;   /* first page-aligned physical address to scan */
  uint64_t length;       /* page-aligned length of memory region to scan */
  /* output */
  uint32_t poisoned;     /* 1 - a poisoned page is found, 0 - otherwise */
  uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */
}

1. https://lore.kernel.org/lkml/164529415398.16921.8042682039148828519.tip-bot2@tip-bot2/
2. https://lore.kernel.org/kvm/20220412223134.1736547-1-juew@google.com/
3. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/reduce-server-crash-rate-tencent-paper.pdf
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-25 16:34 [RFC] Expose a memory poison detector ioctl to user space Jue Wang
@ 2022-04-26 15:40 ` Dave Hansen
  2022-04-26 17:57   ` Jue Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2022-04-26 15:40 UTC (permalink / raw)
  To: Jue Wang, Naoya Horiguchi, Tony Luck, Dave Hansen
  Cc: Jiaqi Yan, Greg Thelen, Mina Almasry, linux-mm

From your description, you have me mostly convinced that this is
something that needs to get fixed.  The hardware patrol scrubber(s)
address the same basic problem, but don't seem to be flexible to your
specific needs.

But, have hardware vendors been receptive at all to making the patrol
scrubbers more tunable?

On 4/25/22 09:34, Jue Wang wrote:
> /* Could stop and return after the 1st poison is detected */
> #define MCESCAN_IOCTL_SCAN 0
> 
> struct SysramRegion {
>   /* input */
>   uint64_t first_byte;   /* first page-aligned physical address to scan */
>   uint64_t length;       /* page-aligned length of memory region to scan */
>   /* output */
>   uint32_t poisoned;     /* 1 - a poisoned page is found, 0 - otherwise */
>   uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */
> }

So, the ioctl() caller has to know the physical address layout of the
system?

While this is a good start at a conversation, I think you might want to
back up a bit.  You alluded to a few requirements that you have, like:

 * Adjustable detector resource use based on system utilization
 * Adjustable scan rate to ensure issues are found at a deterministic
   rate
 * Detector must be able to find errors in allocated, in-use memory

What about SEV-SNP or TDX private memory?  It might be unmapped *and*
limited in how it can be accessed.  For instance, TDX hosts can't
practically read guest memory.  SEV-SNP hosts have special page mapping
requirements; the cost can't create arbitrary mappings with arbitrary
mapping sizes.  What would this ioctl() do if asked to scan a TDX guest
private page?

Is doing it from userspace a strict requirement?

Would the detector just read memory?

Are there any other physical addresses which are RAM but should not have
the detector used on them?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 15:40 ` Dave Hansen
@ 2022-04-26 17:57   ` Jue Wang
  2022-04-26 18:02     ` Jue Wang
  2022-04-26 18:20     ` Dave Hansen
  0 siblings, 2 replies; 26+ messages in thread
From: Jue Wang @ 2022-04-26 17:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

Hi Dave,

Thanks for the reply, some comments inline.

On Tue, Apr 26, 2022 at 8:40 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> From your description, you have me mostly convinced that this is
> something that needs to get fixed.  The hardware patrol scrubber(s)
> address the same basic problem, but don't seem to be flexible to your
> specific needs.
>
> But, have hardware vendors been receptive at all to making the patrol
> scrubbers more tunable?

We have discussed the use case in detail with Intel. There are
improvements in progress to address some of the issues like the
signaling to avoid broadcasted MCEs. But fundamentally, the needed
throughput is not quite compatible with the patrol scrubber's design
purpose and arch.

It's unclear at what generation of hardware this need may get
addressed. Thus now, we look at software assisted approaches making
use of the _whole_ CPU.
>
> On 4/25/22 09:34, Jue Wang wrote:
> > /* Could stop and return after the 1st poison is detected */
> > #define MCESCAN_IOCTL_SCAN 0
> >
> > struct SysramRegion {
> >   /* input */
> >   uint64_t first_byte;   /* first page-aligned physical address to scan */
> >   uint64_t length;       /* page-aligned length of memory region to scan */
> >   /* output */
> >   uint32_t poisoned;     /* 1 - a poisoned page is found, 0 - otherwise */
> >   uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */
> > }
>
> So, the ioctl() caller has to know the physical address layout of the
> system?

This info is available from /proc/iomem and /proc/zoneinfo already
supported / exposed by the kernel.

>
> While this is a good start at a conversation, I think you might want to
> back up a bit.  You alluded to a few requirements that you have, like:
>
>  * Adjustable detector resource use based on system utilization
>  * Adjustable scan rate to ensure issues are found at a deterministic
>    rate
>  * Detector must be able to find errors in allocated, in-use memory
>
> What about SEV-SNP or TDX private memory?  It might be unmapped *and*
> limited in how it can be accessed.  For instance, TDX hosts can't
> practically read guest memory.  SEV-SNP hosts have special page mapping
> requirements; the cost can't create arbitrary mappings with arbitrary
> mapping sizes.  What would this ioctl() do if asked to scan a TDX guest
> private page?
>

Thanks for raising the UPM case for SEV-SNP / TDX private memory. This
is what we like to get more feedback and more experts' weigh-ins.

Is reading private memory via kernel's direct mapping benign for
SEV-SNP and TDX? If true, could this be a way to let SEV-SNP and TDX
use cases benefit from this work while the user space / hypervisor
mapping is still removed?

Otherwise this feature should be defined as mutually exclusive with
incompatible features. Even in that case, I believe SEV-SNP or TDX may
still benefit from _reactive_ memory poison recovery if the MCE
handling and CONFIG_MEMORY_FAILURE still function the same on
uncorrectable error raised #MC.


> Is doing it from userspace a strict requirement?
>
> Would the detector just read memory?
>
> Are there any other physical addresses which are RAM but should not have
> the detector used on them?
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 17:57   ` Jue Wang
@ 2022-04-26 18:02     ` Jue Wang
  2022-04-26 18:21       ` Dave Hansen
  2022-04-26 18:20     ` Dave Hansen
  1 sibling, 1 reply; 26+ messages in thread
From: Jue Wang @ 2022-04-26 18:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

On Tue, Apr 26, 2022 at 10:57 AM Jue Wang <juew@google.com> wrote:
>
> Hi Dave,
>
> Thanks for the reply, some comments inline.
>
> On Tue, Apr 26, 2022 at 8:40 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > From your description, you have me mostly convinced that this is
> > something that needs to get fixed.  The hardware patrol scrubber(s)
> > address the same basic problem, but don't seem to be flexible to your
> > specific needs.
> >
> > But, have hardware vendors been receptive at all to making the patrol
> > scrubbers more tunable?
>
> We have discussed the use case in detail with Intel. There are
> improvements in progress to address some of the issues like the
> signaling to avoid broadcasted MCEs. But fundamentally, the needed
> throughput is not quite compatible with the patrol scrubber's design
> purpose and arch.
>
> It's unclear at what generation of hardware this need may get
> addressed. Thus now, we look at software assisted approaches making
> use of the _whole_ CPU.
> >
> > On 4/25/22 09:34, Jue Wang wrote:
> > > /* Could stop and return after the 1st poison is detected */
> > > #define MCESCAN_IOCTL_SCAN 0
> > >
> > > struct SysramRegion {
> > >   /* input */
> > >   uint64_t first_byte;   /* first page-aligned physical address to scan */
> > >   uint64_t length;       /* page-aligned length of memory region to scan */
> > >   /* output */
> > >   uint32_t poisoned;     /* 1 - a poisoned page is found, 0 - otherwise */
> > >   uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */
> > > }
> >
> > So, the ioctl() caller has to know the physical address layout of the
> > system?
>
> This info is available from /proc/iomem and /proc/zoneinfo already
> supported / exposed by the kernel.
>
> >
> > While this is a good start at a conversation, I think you might want to
> > back up a bit.  You alluded to a few requirements that you have, like:
> >
> >  * Adjustable detector resource use based on system utilization
> >  * Adjustable scan rate to ensure issues are found at a deterministic
> >    rate
> >  * Detector must be able to find errors in allocated, in-use memory
> >
> > What about SEV-SNP or TDX private memory?  It might be unmapped *and*
> > limited in how it can be accessed.  For instance, TDX hosts can't
> > practically read guest memory.  SEV-SNP hosts have special page mapping
> > requirements; the cost can't create arbitrary mappings with arbitrary
> > mapping sizes.  What would this ioctl() do if asked to scan a TDX guest
> > private page?
> >
>
> Thanks for raising the UPM case for SEV-SNP / TDX private memory. This
> is what we like to get more feedback and more experts' weigh-ins.
>
> Is reading private memory via kernel's direct mapping benign for
> SEV-SNP and TDX? If true, could this be a way to let SEV-SNP and TDX
> use cases benefit from this work while the user space / hypervisor
> mapping is still removed?
>
> Otherwise this feature should be defined as mutually exclusive with
> incompatible features. Even in that case, I believe SEV-SNP or TDX may
> still benefit from _reactive_ memory poison recovery if the MCE
> handling and CONFIG_MEMORY_FAILURE still function the same on
> uncorrectable error raised #MC.
>
>
> > Is doing it from userspace a strict requirement?
Not necessarily an absolute requirement.

We just found there are lots of policy and integration elements in
user space that cannot be avoided: what to scan, how fast to scan,
when to backoff given the host anticipated workload or special
customer request etc, what to do with the errors detected in term of
monitoring, telemetry, machine repair automation, scheduling systems
etc.

> >
> > Would the detector just read memory?
Yes, read transaction is sufficient to signal #MC on uncorrectable cachelines.
> >
> > Are there any other physical addresses which are RAM but should not have
> > the detector used on them?

In theory, if some physical address range are never / very rarely
accessed, they can be exempted.

> >


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 17:57   ` Jue Wang
  2022-04-26 18:02     ` Jue Wang
@ 2022-04-26 18:20     ` Dave Hansen
  2022-04-26 19:23       ` Jue Wang
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2022-04-26 18:20 UTC (permalink / raw)
  To: Jue Wang
  Cc: Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

On 4/26/22 10:57, Jue Wang wrote:
> On Tue, Apr 26, 2022 at 8:40 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> From your description, you have me mostly convinced that this is
>> something that needs to get fixed.  The hardware patrol scrubber(s)
>> address the same basic problem, but don't seem to be flexible to your
>> specific needs.
>>
>> But, have hardware vendors been receptive at all to making the patrol
>> scrubbers more tunable?
> 
> We have discussed the use case in detail with Intel. There are
> improvements in progress to address some of the issues like the
> signaling to avoid broadcasted MCEs. But fundamentally, the needed
> throughput is not quite compatible with the patrol scrubber's design
> purpose and arch.

This would be great material to cover in the changelog in some more
detail.

>> On 4/25/22 09:34, Jue Wang wrote:
>>> /* Could stop and return after the 1st poison is detected */
>>> #define MCESCAN_IOCTL_SCAN 0
>>>
>>> struct SysramRegion {
>>>   /* input */
>>>   uint64_t first_byte;   /* first page-aligned physical address to scan */
>>>   uint64_t length;       /* page-aligned length of memory region to scan */
>>>   /* output */
>>>   uint32_t poisoned;     /* 1 - a poisoned page is found, 0 - otherwise */
>>>   uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */
>>> }
>>
>> So, the ioctl() caller has to know the physical address layout of the
>> system?
> 
> This info is available from /proc/iomem and /proc/zoneinfo already
> supported / exposed by the kernel.

I don't think they are good enough.

Think of a TDX guest.  It can't touch "unaccepted" memory.  But, that
information is not present in /proc/iomem.  In a TDX host (not upstream
yet), it can't touch any guest memory.  That's also not in /proc/iomem.

What if you're in a normal (non-TDX) guest and some of the physical
address space has been ballooned away?

What does the kernel do when userspace asks it to poke a non-"System
RAM" address?

>> While this is a good start at a conversation, I think you might want to
>> back up a bit.  You alluded to a few requirements that you have, like:
>>
>>  * Adjustable detector resource use based on system utilization
>>  * Adjustable scan rate to ensure issues are found at a deterministic
>>    rate
>>  * Detector must be able to find errors in allocated, in-use memory
>>
>> What about SEV-SNP or TDX private memory?  It might be unmapped *and*
>> limited in how it can be accessed.  For instance, TDX hosts can't
>> practically read guest memory.  SEV-SNP hosts have special page mapping
>> requirements; the cost can't create arbitrary mappings with arbitrary
>> mapping sizes.  What would this ioctl() do if asked to scan a TDX guest
>> private page?
> 
> Thanks for raising the UPM case for SEV-SNP / TDX private memory. This
> is what we like to get more feedback and more experts' weigh-ins.
> 
> Is reading private memory via kernel's direct mapping benign for
> SEV-SNP and TDX? 

No.  It causes machine checks for TDX.

For SEV-SNP, I think reads of private memory read ciphertext.  I'm not
sure how benign it is or if it has any cache coherency implications.

> Otherwise this feature should be defined as mutually exclusive with
> incompatible features.

Just as an exercise, I'd suggest going and asking some of your
colleagues about this.  Surely, you're asking for this functionality
because Google wants to use it, and use it *widely*.  What would your
colleagues think if this wasn't available  at all on systems that use or
might use TDX?

For upstream, making features mutually exclusive is a deal breaker
unless it's absolutely necessary.

> Even in that case, I believe SEV-SNP or TDX may still benefit fro
> _reactive_ memory poison recovery if the MCE handling and
> CONFIG_MEMORY_FAILURE still function the same on uncorrectable error
> raised #MC.

If I remember right, the blast radius for machine checks on systems
using TDX is substantially bigger than without TDX.  I think there are
quite a few more cases that are non-recoverable, like poison detected in
TDX metadata.  TDX systems have a *stronger* requirement to proactively
find issues than non-TDX systems.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 18:02     ` Jue Wang
@ 2022-04-26 18:21       ` Dave Hansen
  2022-04-26 19:25         ` Jue Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2022-04-26 18:21 UTC (permalink / raw)
  To: Jue Wang
  Cc: Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

On 4/26/22 11:02, Jue Wang wrote:
>>> Are there any other physical addresses which are RAM but should not have
>>> the detector used on them?
> In theory, if some physical address range are never / very rarely
> accessed, they can be exempted.

How would userspace know to exempt them?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 18:20     ` Dave Hansen
@ 2022-04-26 19:23       ` Jue Wang
  2022-04-26 19:39         ` Dave Hansen
  0 siblings, 1 reply; 26+ messages in thread
From: Jue Wang @ 2022-04-26 19:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/26/22 10:57, Jue Wang wrote:
> > On Tue, Apr 26, 2022 at 8:40 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >> From your description, you have me mostly convinced that this is
> >> something that needs to get fixed.  The hardware patrol scrubber(s)
> >> address the same basic problem, but don't seem to be flexible to your
> >> specific needs.
> >>
> >> But, have hardware vendors been receptive at all to making the patrol
> >> scrubbers more tunable?
> >
> > We have discussed the use case in detail with Intel. There are
> > improvements in progress to address some of the issues like the
> > signaling to avoid broadcasted MCEs. But fundamentally, the needed
> > throughput is not quite compatible with the patrol scrubber's design
> > purpose and arch.
>
> This would be great material to cover in the changelog in some more
> detail.
>
> >> On 4/25/22 09:34, Jue Wang wrote:
> >>> /* Could stop and return after the 1st poison is detected */
> >>> #define MCESCAN_IOCTL_SCAN 0
> >>>
> >>> struct SysramRegion {
> >>>   /* input */
> >>>   uint64_t first_byte;   /* first page-aligned physical address to scan */
> >>>   uint64_t length;       /* page-aligned length of memory region to scan */
> >>>   /* output */
> >>>   uint32_t poisoned;     /* 1 - a poisoned page is found, 0 - otherwise */
> >>>   uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */
> >>> }
> >>
> >> So, the ioctl() caller has to know the physical address layout of the
> >> system?
> >
> > This info is available from /proc/iomem and /proc/zoneinfo already
> > supported / exposed by the kernel.
>
> I don't think they are good enough.
>
> Think of a TDX guest.  It can't touch "unaccepted" memory.  But, that
> information is not present in /proc/iomem.  In a TDX host (not upstream
> yet), it can't touch any guest memory.  That's also not in /proc/iomem.

We will follow up on these topics wrt the interactions with TDX/SEV-SNP.

>
> What if you're in a normal (non-TDX) guest and some of the physical
> address space has been ballooned away?

Accessing to memory that gets ballooned away will cause extra EPT
violations and have the memory faulted in on the host side, which is
transparent to the guest.
>
> What does the kernel do when userspace asks it to poke a non-"System
> RAM" address?

I expect the kernel should reject the request with -EINVAL.
>
> >> While this is a good start at a conversation, I think you might want to
> >> back up a bit.  You alluded to a few requirements that you have, like:
> >>
> >>  * Adjustable detector resource use based on system utilization
> >>  * Adjustable scan rate to ensure issues are found at a deterministic
> >>    rate
> >>  * Detector must be able to find errors in allocated, in-use memory
> >>
> >> What about SEV-SNP or TDX private memory?  It might be unmapped *and*
> >> limited in how it can be accessed.  For instance, TDX hosts can't
> >> practically read guest memory.  SEV-SNP hosts have special page mapping
> >> requirements; the cost can't create arbitrary mappings with arbitrary
> >> mapping sizes.  What would this ioctl() do if asked to scan a TDX guest
> >> private page?
> >
> > Thanks for raising the UPM case for SEV-SNP / TDX private memory. This
> > is what we like to get more feedback and more experts' weigh-ins.
> >
> > Is reading private memory via kernel's direct mapping benign for
> > SEV-SNP and TDX?
>
> No.  It causes machine checks for TDX.
>
> For SEV-SNP, I think reads of private memory read ciphertext.  I'm not
> sure how benign it is or if it has any cache coherency implications.
>
> > Otherwise this feature should be defined as mutually exclusive with
> > incompatible features.
>
> Just as an exercise, I'd suggest going and asking some of your
> colleagues about this.  Surely, you're asking for this functionality
> because Google wants to use it, and use it *widely*.  What would your
> colleagues think if this wasn't available  at all on systems that use or
> might use TDX?
>
> For upstream, making features mutually exclusive is a deal breaker
> unless it's absolutely necessary.

Ack, we will follow up within Google.

Just curious, what could be recommendations from Intel's perspective
to make proactively poison detection work on TDX / SEV-SNP?

>
> > Even in that case, I believe SEV-SNP or TDX may still benefit fro
> > _reactive_ memory poison recovery if the MCE handling and
> > CONFIG_MEMORY_FAILURE still function the same on uncorrectable error
> > raised #MC.
>
> If I remember right, the blast radius for machine checks on systems
> using TDX is substantially bigger than without TDX.  I think there are
> quite a few more cases that are non-recoverable, like poison detected in
> TDX metadata.  TDX systems have a *stronger* requirement to proactively
> find issues than non-TDX systems.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 18:21       ` Dave Hansen
@ 2022-04-26 19:25         ` Jue Wang
  2022-04-26 19:52           ` Luck, Tony
  0 siblings, 1 reply; 26+ messages in thread
From: Jue Wang @ 2022-04-26 19:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/26/22 11:02, Jue Wang wrote:
> >>> Are there any other physical addresses which are RAM but should not have
> >>> the detector used on them?
> > In theory, if some physical address range are never / very rarely
> > accessed, they can be exempted.
>
> How would userspace know to exempt them?

User space won't know, if kernel has this knowledge, I suppose an
appropriate error code can be returned to inform user space this
address region should be exempted from future scanning?
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 19:23       ` Jue Wang
@ 2022-04-26 19:39         ` Dave Hansen
  2022-04-26 19:50           ` Jue Wang
                             ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Dave Hansen @ 2022-04-26 19:39 UTC (permalink / raw)
  To: Jue Wang
  Cc: Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

On 4/26/22 12:23, Jue Wang wrote:
> On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> What if you're in a normal (non-TDX) guest and some of the physical
>> address space has been ballooned away?
> 
> Accessing to memory that gets ballooned away will cause extra EPT
> violations and have the memory faulted in on the host side, which is
> transparent to the guest.

Yeah, but it completely subverts the whole purpose of ballooning.  In
other words, this is for all intents and purposes also mutually
exclusive with ballooning.

>> What does the kernel do when userspace asks it to poke a non-"System
>> RAM" address?
> 
> I expect the kernel should reject the request with -EINVAL.

Right.  Only the kernel has the knowledge of what can actually _be_
scanned.  So, why even bother exposing physical addresses to userspace?
 Why is exposing the actual physical address any better than exposing a
cookie?

> Just curious, what could be recommendations from Intel's perspective
> to make proactively poison detection work on TDX / SEV-SNP?

I shouldn't speak for Intel as a whole, but I'll give you my personal
perspective.

Right now, hosts can't scan TDX private memory, period.  If you wanted
to do scanning, the guest has to do it or you have to kill the guest and
make the memory non-private.

Going forward, guest memory scanning could be accomplished by allowing
the VMM to migrate guest pages.  Let's say you want to scan page "A",
you could move A->B and B->A.  That would certainly touch the page.
This would need to be implemented in the TDX module.

Or, the TDX module could have a special call to just touch the page.

It would probably also need more work in the TDX module to be able to
handle machine checks.  I don't think the handling in there is very
robust today.

It could also be implemented with some new VMM-side ISA which promises
to touch the physical memory, but doesn't return any data, ignores the
"TD bit" and doesn't do any integrity checking.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 19:39         ` Dave Hansen
@ 2022-04-26 19:50           ` Jue Wang
  2022-04-28 16:15           ` Erdem Aktas
  2022-05-02 17:19           ` David Hildenbrand
  2 siblings, 0 replies; 26+ messages in thread
From: Jue Wang @ 2022-04-26 19:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

On Tue, Apr 26, 2022 at 12:36 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/26/22 12:23, Jue Wang wrote:
> > On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >> What if you're in a normal (non-TDX) guest and some of the physical
> >> address space has been ballooned away?
> >
> > Accessing to memory that gets ballooned away will cause extra EPT
> > violations and have the memory faulted in on the host side, which is
> > transparent to the guest.
>
> Yeah, but it completely subverts the whole purpose of ballooning.  In
> other words, this is for all intents and purposes also mutually
> exclusive with ballooning.

True. We haven't thought too much about the TDX/SEV-SNP use cases.

For "normal" guests, the operating model is that it's sufficient for
the scanning to happen just on the host side with the ability to
inform / inject without interrupting execution the detected errors in
"real" time via UCNA injection being added to KVM.
>
> >> What does the kernel do when userspace asks it to poke a non-"System
> >> RAM" address?
> >
> > I expect the kernel should reject the request with -EINVAL.
>
> Right.  Only the kernel has the knowledge of what can actually _be_
> scanned.  So, why even bother exposing physical addresses to userspace?
>  Why is exposing the actual physical address any better than exposing a
> cookie?

Then the API needs some re-design, naively I can see it works this way:

init_as_start_page(&cookie);

while(cookie != END_SCAN) {
   ret = ioctl(, SCAN_SIZE, numa_id, &cookie /*input_output*/ );
   /* Handle errors, poison detected etc. */
   ... ...
}
>
> > Just curious, what could be recommendations from Intel's perspective
> > to make proactively poison detection work on TDX / SEV-SNP?
>
> I shouldn't speak for Intel as a whole, but I'll give you my personal
> perspective.
>
> Right now, hosts can't scan TDX private memory, period.  If you wanted
> to do scanning, the guest has to do it or you have to kill the guest and
> make the memory non-private.
>
> Going forward, guest memory scanning could be accomplished by allowing
> the VMM to migrate guest pages.  Let's say you want to scan page "A",
> you could move A->B and B->A.  That would certainly touch the page.
> This would need to be implemented in the TDX module.
>
> Or, the TDX module could have a special call to just touch the page.
>
> It would probably also need more work in the TDX module to be able to
> handle machine checks.  I don't think the handling in there is very
> robust today.
>
> It could also be implemented with some new VMM-side ISA which promises
> to touch the physical memory, but doesn't return any data, ignores the
> "TD bit" and doesn't do any integrity checking.
Thanks for the suggestions, Dave.

We will follow up offline with Google colleagues and look to bring
this up to some RAS discussion venue with Intel.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 19:25         ` Jue Wang
@ 2022-04-26 19:52           ` Luck, Tony
  2022-04-26 20:06             ` Jue Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Luck, Tony @ 2022-04-26 19:52 UTC (permalink / raw)
  To: Jue Wang, Hansen, Dave
  Cc: Naoya Horiguchi, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

One thing that would be relatively easy to do would be pre-allocate and pre-scan memory at guest creation:

1) Request to set up a guest with X GB memory
2) Allocate X GB
3) Zero it
4) Scan for poison
5) Map memory to guest and run the guest

Should work with TDX (because you scan while host still has control/access to the pages).

But this has issues if you have long-lived guests. Or want to overcommit memory so don't
really give a guest all the physical memory that it asks for.

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 19:52           ` Luck, Tony
@ 2022-04-26 20:06             ` Jue Wang
  0 siblings, 0 replies; 26+ messages in thread
From: Jue Wang @ 2022-04-26 20:06 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Hansen, Dave, Naoya Horiguchi, Dave Hansen, Jiaqi Yan,
	Greg Thelen, Mina Almasry, linux-mm, Sean Christopherson

On Tue, Apr 26, 2022 at 12:52 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> One thing that would be relatively easy to do would be pre-allocate and pre-scan memory at guest creation:
>
> 1) Request to set up a guest with X GB memory
> 2) Allocate X GB
> 3) Zero it
> 4) Scan for poison
> 5) Map memory to guest and run the guest
>
> Should work with TDX (because you scan while host still has control/access to the pages).
>
> But this has issues if you have long-lived guests. Or want to overcommit memory so don't
> really give a guest all the physical memory that it asks for.

Thanks Tony.

I agree this could be a starting point to get TDX / SEV-SNP guest
memory scanned. It may still be much better than not scanning them.

We need to follow up on a long term solution is needed for long running guest.

>
> -Tony


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 19:39         ` Dave Hansen
  2022-04-26 19:50           ` Jue Wang
@ 2022-04-28 16:15           ` Erdem Aktas
  2022-04-28 16:34             ` Dave Hansen
  2022-05-02 17:19           ` David Hildenbrand
  2 siblings, 1 reply; 26+ messages in thread
From: Erdem Aktas @ 2022-04-28 16:15 UTC (permalink / raw)
  To: dave.hansen
  Cc: almasrymina, dave.hansen, gthelen, jiaqiyan, juew, linux-mm,
	naoya.horiguchi, seanjc, tony.luck

> On 4/26/22 12:23, Jue Wang wrote:
> > On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
> I shouldn't speak for Intel as a whole, but I'll give you my personal
> perspective.
>
> Right now, hosts can't scan TDX private memory, period.  If you wanted
> to do scanning, the guest has to do it or you have to kill the guest and
> make the memory non-private.

Actually, afaiu, the host can read tdx private memory. This should NOT generate
#MC due to integrity/TD ownership but return a fixed value of "0"s. I do not 
know if this will also trigger #MCs due to memory errors.

>
> Going forward, guest memory scanning could be accomplished by allowing
> the VMM to migrate guest pages.  Let's say you want to scan page "A",
> you could move A->B and B->A.  That would certainly touch the page.
> This would need to be implemented in the TDX module.

TDH.MEM.PAGE.RELOCATE should be able to migrate guest pages but I am not sure 
if this would be feasible depending on how often we keep relocating pages.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-28 16:15           ` Erdem Aktas
@ 2022-04-28 16:34             ` Dave Hansen
  2022-04-29 19:46               ` Jue Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2022-04-28 16:34 UTC (permalink / raw)
  To: Erdem Aktas
  Cc: almasrymina, dave.hansen, gthelen, jiaqiyan, juew, linux-mm,
	naoya.horiguchi, seanjc, tony.luck

On 4/28/22 09:15, Erdem Aktas wrote:
>> On 4/26/22 12:23, Jue Wang wrote:
>>> On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> I shouldn't speak for Intel as a whole, but I'll give you my personal
>> perspective.
>>
>> Right now, hosts can't scan TDX private memory, period.  If you wanted
>> to do scanning, the guest has to do it or you have to kill the guest and
>> make the memory non-private.
> 
> Actually, afaiu, the host can read tdx private memory. This should NOT generate
> #MC due to integrity/TD ownership but return a fixed value of "0"s. I do not 
> know if this will also trigger #MCs due to memory errors.

I think you're right, at least in the normal case where the access is
performed with the TME KeyID.  "An introductory overview of the Intel
TDX technology"[1] says:

> The TD-bit associated with the line in memory seeks to
> detect software or devices attempting to read memory
> encrypted with private KeyID, using a shared KeyID, to reveal
> the ciphertext. On such accesses, the MKTME returns a fixed
> pattern to prevent ciphertext analysis.

I guess, in practice, the read would need to go all the way out to the
memory controller to get the TD-bit.  But, it's definitely not
well-defined in the spec.

1.
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-28 16:34             ` Dave Hansen
@ 2022-04-29 19:46               ` Jue Wang
  2022-04-29 21:10                 ` Dave Hansen
  2022-05-02 15:30                 ` Dave Hansen
  0 siblings, 2 replies; 26+ messages in thread
From: Jue Wang @ 2022-04-29 19:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Erdem Aktas, almasrymina, dave.hansen, gthelen, jiaqiyan,
	linux-mm, naoya.horiguchi, seanjc, tony.luck

Thanks Erdem and Dave, as a summary:

1. Reading via direct mapping from guest private memory should not
generate #MC and it should result in expected memory error poisoning
behavior (to be confirmed).

2. Reading via direct mapping from SEV-SNP guest private memory does
not generate #MC or #PF.

Per seanjc@google.com:
TDX doesn't support #MC exception injection, but IRQ "injection" via
posted interrupts is supported. Accesses to machine check MSRs will
#VE, i.e. can be emulated by KVM, so CMCI should work fine for TDX
guests.

Proactively scanning for memory error should benefit TDX guests
preventing potential host shutdowns.

It seems the current proposed design can cover TDX & SEV-SNP if the
direct mapping to guest private memory is preserved?

Thanks,
-Jue

On Thu, Apr 28, 2022 at 9:34 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/28/22 09:15, Erdem Aktas wrote:
> >> On 4/26/22 12:23, Jue Wang wrote:
> >>> On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >> I shouldn't speak for Intel as a whole, but I'll give you my personal
> >> perspective.
> >>
> >> Right now, hosts can't scan TDX private memory, period.  If you wanted
> >> to do scanning, the guest has to do it or you have to kill the guest and
> >> make the memory non-private.
> >
> > Actually, afaiu, the host can read tdx private memory. This should NOT generate
> > #MC due to integrity/TD ownership but return a fixed value of "0"s. I do not
> > know if this will also trigger #MCs due to memory errors.
>
> I think you're right, at least in the normal case where the access is
> performed with the TME KeyID.  "An introductory overview of the Intel
> TDX technology"[1] says:
>
> > The TD-bit associated with the line in memory seeks to
> > detect software or devices attempting to read memory
> > encrypted with private KeyID, using a shared KeyID, to reveal
> > the ciphertext. On such accesses, the MKTME returns a fixed
> > pattern to prevent ciphertext analysis.
>
> I guess, in practice, the read would need to go all the way out to the
> memory controller to get the TD-bit.  But, it's definitely not
> well-defined in the spec.
>
> 1.
> https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-29 19:46               ` Jue Wang
@ 2022-04-29 21:10                 ` Dave Hansen
  2022-04-29 21:32                   ` Jue Wang
  2022-05-02 15:30                 ` Dave Hansen
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2022-04-29 21:10 UTC (permalink / raw)
  To: Jue Wang
  Cc: Erdem Aktas, almasrymina, dave.hansen, gthelen, jiaqiyan,
	linux-mm, naoya.horiguchi, seanjc, tony.luck

On 4/29/22 12:46, Jue Wang wrote:
> Per seanjc@google.com:
> TDX doesn't support #MC exception injection, but IRQ "injection" via
> posted interrupts is supported. Accesses to machine check MSRs will
> #VE, i.e. can be emulated by KVM, so CMCI should work fine for TDX
> guests.
> 
> Proactively scanning for memory error should benefit TDX guests
> preventing potential host shutdowns.

It also need to know to avoid unaccepted memory in TDX guests at *least*.

> It seems the current proposed design can cover TDX & SEV-SNP if the
> direct mapping to guest private memory is preserved?

I wouldn't go that far.  The unaccepted TDX guest memory thing is just
the obvious one at the moment.  There are a whole ton of other guest
ballooning mechanisms out there and I'm not sure that all of them are
happy to let you touch ballooned-away memory.

But, the bigger issue is that those cases had not even been considered.
 It means that there is a *LOT* of homework needed to seek out and cover
all the other weird cases.

I also think the proposed ABI -- exposing physical addresses to
userspace as a part of the design -- is an utter non-starter.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-29 21:10                 ` Dave Hansen
@ 2022-04-29 21:32                   ` Jue Wang
  2022-04-29 21:44                     ` Jue Wang
  2022-04-29 22:29                     ` Dave Hansen
  0 siblings, 2 replies; 26+ messages in thread
From: Jue Wang @ 2022-04-29 21:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Erdem Aktas, almasrymina, dave.hansen, gthelen, jiaqiyan,
	linux-mm, naoya.horiguchi, seanjc, tony.luck

On Fri, Apr 29, 2022 at 2:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/29/22 12:46, Jue Wang wrote:
> > Per seanjc@google.com:
> > TDX doesn't support #MC exception injection, but IRQ "injection" via
> > posted interrupts is supported. Accesses to machine check MSRs will
> > #VE, i.e. can be emulated by KVM, so CMCI should work fine for TDX
> > guests.
> >
> > Proactively scanning for memory error should benefit TDX guests
> > preventing potential host shutdowns.
>
> It also need to know to avoid unaccepted memory in TDX guests at *least*.
>
> > It seems the current proposed design can cover TDX & SEV-SNP if the
> > direct mapping to guest private memory is preserved?
>
> I wouldn't go that far.  The unaccepted TDX guest memory thing is just
> the obvious one at the moment.  There are a whole ton of other guest
> ballooning mechanisms out there and I'm not sure that all of them are
> happy to let you touch ballooned-away memory.

This type of scanning is intended to be run on the host side. That
should avoid concerns around the guest ballooning or any effects to
the host side reclaim that's based on the guest's working set.

I don't know why a guest wants to spend its CPU cycles and pollute its
caches etc to run this scanner, anyway. This should be a benefit
provide by the cloud platform transparently to the guest.


>
> But, the bigger issue is that those cases had not even been considered.
>  It means that there is a *LOT* of homework needed to seek out and cover
> all the other weird cases.
>
> I also think the proposed ABI -- exposing physical addresses to
> userspace as a part of the design -- is an utter non-starter.

This can be addressed with a different design.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-29 21:32                   ` Jue Wang
@ 2022-04-29 21:44                     ` Jue Wang
  2022-04-29 22:29                     ` Dave Hansen
  1 sibling, 0 replies; 26+ messages in thread
From: Jue Wang @ 2022-04-29 21:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Erdem Aktas, almasrymina, dave.hansen, gthelen, jiaqiyan,
	linux-mm, naoya.horiguchi, seanjc, tony.luck

On Fri, Apr 29, 2022 at 2:32 PM Jue Wang <juew@google.com> wrote:
>
> On Fri, Apr 29, 2022 at 2:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 4/29/22 12:46, Jue Wang wrote:
> > > Per seanjc@google.com:
> > > TDX doesn't support #MC exception injection, but IRQ "injection" via
> > > posted interrupts is supported. Accesses to machine check MSRs will
> > > #VE, i.e. can be emulated by KVM, so CMCI should work fine for TDX
> > > guests.
> > >
> > > Proactively scanning for memory error should benefit TDX guests
> > > preventing potential host shutdowns.
> >
> > It also need to know to avoid unaccepted memory in TDX guests at *least*.
> >
> > > It seems the current proposed design can cover TDX & SEV-SNP if the
> > > direct mapping to guest private memory is preserved?
> >
> > I wouldn't go that far.  The unaccepted TDX guest memory thing is just
> > the obvious one at the moment.  There are a whole ton of other guest
> > ballooning mechanisms out there and I'm not sure that all of them are
> > happy to let you touch ballooned-away memory.
>
> This type of scanning is intended to be run on the host side. That
> should avoid concerns around the guest ballooning or any effects to
> the host side reclaim that's based on the guest's working set.
>
> I don't know why a guest wants to spend its CPU cycles and pollute its
> caches etc to run this scanner, anyway. This should be a benefit
> provide by the cloud platform transparently to the guest.

The coverage of a guest scanning its own memory does not provide the
benefit that a host wide scanning can in terms of preventing fatal
system crashes or on memory that affects this guest but is not
accessible to the guest.

>
>
> >
> > But, the bigger issue is that those cases had not even been considered.
> >  It means that there is a *LOT* of homework needed to seek out and cover
> > all the other weird cases.
> >
> > I also think the proposed ABI -- exposing physical addresses to
> > userspace as a part of the design -- is an utter non-starter.
>
> This can be addressed with a different design.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-29 21:32                   ` Jue Wang
  2022-04-29 21:44                     ` Jue Wang
@ 2022-04-29 22:29                     ` Dave Hansen
  2022-04-29 22:53                       ` Jue Wang
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2022-04-29 22:29 UTC (permalink / raw)
  To: Jue Wang
  Cc: Erdem Aktas, almasrymina, dave.hansen, gthelen, jiaqiyan,
	linux-mm, naoya.horiguchi, seanjc, tony.luck

On 4/29/22 14:32, Jue Wang wrote:
> On Fri, Apr 29, 2022 at 2:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> I wouldn't go that far.  The unaccepted TDX guest memory thing is just
>> the obvious one at the moment.  There are a whole ton of other guest
>> ballooning mechanisms out there and I'm not sure that all of them are
>> happy to let you touch ballooned-away memory.
> 
> This type of scanning is intended to be run on the host side. That
> should avoid concerns around the guest ballooning or any effects to
> the host side reclaim that's based on the guest's working set.

Hint: Talk is cheap.  Just saying how it is intended doesn't avoid
concerns.

Saying how it is intended, then backing up that intent with code and
deliberate design that matches that intent would be nice.

> I don't know why a guest wants to spend its CPU cycles and pollute its
> caches etc to run this scanner, anyway. This should be a benefit
> provide by the cloud platform transparently to the guest.

"This should only be used by and made available by cloud providers!" ...
says the cloud provider. ;)

Also, who said anything about polluting the caches?  Aren't there lots
of reasons for a memory poison detector to intentionally not use the
caches?  First, you really *do* always want to go to memory.  That's
kinda the point.  If this code hits the caches, it's kinda pointless.

Second, you want this code to have a low profile.  Not polluting the
caches seems like a good way to have a low profile.

>> But, the bigger issue is that those cases had not even been considered.
>>  It means that there is a *LOT* of homework needed to seek out and cover
>> all the other weird cases.
>>
>> I also think the proposed ABI -- exposing physical addresses to
>> userspace as a part of the design -- is an utter non-starter.
> 
> This can be addressed with a different design.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-29 22:29                     ` Dave Hansen
@ 2022-04-29 22:53                       ` Jue Wang
  0 siblings, 0 replies; 26+ messages in thread
From: Jue Wang @ 2022-04-29 22:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Erdem Aktas, almasrymina, dave.hansen, gthelen, jiaqiyan,
	linux-mm, naoya.horiguchi, seanjc, tony.luck

On Fri, Apr 29, 2022 at 3:29 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/29/22 14:32, Jue Wang wrote:
> > On Fri, Apr 29, 2022 at 2:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >> I wouldn't go that far.  The unaccepted TDX guest memory thing is just
> >> the obvious one at the moment.  There are a whole ton of other guest
> >> ballooning mechanisms out there and I'm not sure that all of them are
> >> happy to let you touch ballooned-away memory.
> >
> > This type of scanning is intended to be run on the host side. That
> > should avoid concerns around the guest ballooning or any effects to
> > the host side reclaim that's based on the guest's working set.
>
> Hint: Talk is cheap.  Just saying how it is intended doesn't avoid
> concerns.
>
> Saying how it is intended, then backing up that intent with code and
> deliberate design that matches that intent would be nice.
>
> > I don't know why a guest wants to spend its CPU cycles and pollute its
> > caches etc to run this scanner, anyway. This should be a benefit
> > provide by the cloud platform transparently to the guest.
>
> "This should only be used by and made available by cloud providers!" ...
> says the cloud provider. ;)

This is a much better way to put it.

How to express in design that some kernel component that is "best to
be used by and made available by cloud providers" is what I like to
get some feedback on. :-)

>
> Also, who said anything about polluting the caches?  Aren't there lots
> of reasons for a memory poison detector to intentionally not use the
> caches?  First, you really *do* always want to go to memory.  That's
> kinda the point.  If this code hits the caches, it's kinda pointless.
>
> Second, you want this code to have a low profile.  Not polluting the
> caches seems like a good way to have a low profile.
>

We were experimenting with some non-temporal prefetch hint
(prefetchnta) that worked as intended based on perf measurement. The
pollution to LLC is minimal but non-zero.

This is definitely an area we want to keep iterating on, love to hear feedback.

Thanks,
-Jue


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-29 19:46               ` Jue Wang
  2022-04-29 21:10                 ` Dave Hansen
@ 2022-05-02 15:30                 ` Dave Hansen
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Hansen @ 2022-05-02 15:30 UTC (permalink / raw)
  To: Jue Wang
  Cc: Erdem Aktas, almasrymina, dave.hansen, gthelen, jiaqiyan,
	linux-mm, naoya.horiguchi, seanjc, tony.luck

On 4/29/22 12:46, Jue Wang wrote:
> 1. Reading via direct mapping from guest private memory should not
> generate #MC and it should result in expected memory error poisoning
> behavior (to be confirmed).
> 
> 2. Reading via direct mapping from SEV-SNP guest private memory does
> not generate #MC or #PF.

There are two different things you need to look at here:

1. What is the *implementation* today?
2. What is the architecture to which the hardware vendors will commit?

Let's say that, today, a TDX host accessing guest-private memory doesn't
trigger the error handling that you want.  Then, this scheme simply
won't work on today's TDX hardware.  You can only hope for better
hardware in the future.

Now, consider if you get lucky: Today, a TDX host accessing
guest-private memory *DOES* trigger the error handling that you want.
That's great, but it doesn't mean that the behavior will stick.  Intel
might change it _tomorrow_ without telling anyone because folks believe
it to be software-invisible.

Either way, you need to extract promises from hardware vendors if you
want to depend on this scheme.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-04-26 19:39         ` Dave Hansen
  2022-04-26 19:50           ` Jue Wang
  2022-04-28 16:15           ` Erdem Aktas
@ 2022-05-02 17:19           ` David Hildenbrand
  2022-05-02 17:30             ` Jue Wang
  2 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2022-05-02 17:19 UTC (permalink / raw)
  To: Dave Hansen, Jue Wang
  Cc: Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan, Greg Thelen,
	Mina Almasry, linux-mm, Sean Christopherson

On 26.04.22 21:39, Dave Hansen wrote:
> On 4/26/22 12:23, Jue Wang wrote:
>> On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>> What if you're in a normal (non-TDX) guest and some of the physical
>>> address space has been ballooned away?
>>
>> Accessing to memory that gets ballooned away will cause extra EPT
>> violations and have the memory faulted in on the host side, which is
>> transparent to the guest.
> 
> Yeah, but it completely subverts the whole purpose of ballooning.  In
> other words, this is for all intents and purposes also mutually
> exclusive with ballooning.

Some balloon (or balloon-like) implementations don't support reading
memory that's mapped into the direct map. For example, with never
virtio-mem devices in the hypervisor, reading unplugged memory can
result in undefined behavior (in the worst case, you'll get your VM zapped).

Reading random physical memory ranges without further checks is a very
bad idea. There are more corner cases, that we e.g., exclude when
reading /proc/kcore.

Take a look at read_kcore() KCORE_RAM case, where we e.g., exclude
reading PageOffline(), is_page_hwpoison() and !pfn_is_ram(). Unaccepted
memory might be another case we want to exclude there in the future.


I assume something as you imagine could be implemented in user space
just by relying on /proc/iomem and /proc/kcore right now in an unsafe
way. So you might want something similar, however, obviously without
exporting page content to user space and requiring root permissions.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-05-02 17:19           ` David Hildenbrand
@ 2022-05-02 17:30             ` Jue Wang
  2022-05-02 17:33               ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Jue Wang @ 2022-05-02 17:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan,
	Greg Thelen, Mina Almasry, linux-mm, Sean Christopherson

On Mon, May 2, 2022 at 10:19 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.04.22 21:39, Dave Hansen wrote:
> > On 4/26/22 12:23, Jue Wang wrote:
> >> On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >>> What if you're in a normal (non-TDX) guest and some of the physical
> >>> address space has been ballooned away?
> >>
> >> Accessing to memory that gets ballooned away will cause extra EPT
> >> violations and have the memory faulted in on the host side, which is
> >> transparent to the guest.
> >
> > Yeah, but it completely subverts the whole purpose of ballooning.  In
> > other words, this is for all intents and purposes also mutually
> > exclusive with ballooning.
>
> Some balloon (or balloon-like) implementations don't support reading
> memory that's mapped into the direct map. For example, with never
> virtio-mem devices in the hypervisor, reading unplugged memory can
> result in undefined behavior (in the worst case, you'll get your VM zapped).
>
> Reading random physical memory ranges without further checks is a very
> bad idea. There are more corner cases, that we e.g., exclude when
> reading /proc/kcore.
>
> Take a look at read_kcore() KCORE_RAM case, where we e.g., exclude
> reading PageOffline(), is_page_hwpoison() and !pfn_is_ram(). Unaccepted
> memory might be another case we want to exclude there in the future.
>
>
> I assume something as you imagine could be implemented in user space
> just by relying on /proc/iomem and /proc/kcore right now in an unsafe
> way. So you might want something similar, however, obviously without
> exporting page content to user space and requiring root permissions.

Thanks.

Are the following cases benign if the scan only happens on the host side?

. virtio-mem - unplugged memory
. Unaccepted memory


>
> --
> Thanks,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-05-02 17:30             ` Jue Wang
@ 2022-05-02 17:33               ` David Hildenbrand
  2022-05-02 17:36                 ` Jue Wang
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2022-05-02 17:33 UTC (permalink / raw)
  To: Jue Wang
  Cc: Dave Hansen, Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan,
	Greg Thelen, Mina Almasry, linux-mm, Sean Christopherson

On 02.05.22 19:30, Jue Wang wrote:
> On Mon, May 2, 2022 at 10:19 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.04.22 21:39, Dave Hansen wrote:
>>> On 4/26/22 12:23, Jue Wang wrote:
>>>> On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>>>> What if you're in a normal (non-TDX) guest and some of the physical
>>>>> address space has been ballooned away?
>>>>
>>>> Accessing to memory that gets ballooned away will cause extra EPT
>>>> violations and have the memory faulted in on the host side, which is
>>>> transparent to the guest.
>>>
>>> Yeah, but it completely subverts the whole purpose of ballooning.  In
>>> other words, this is for all intents and purposes also mutually
>>> exclusive with ballooning.
>>
>> Some balloon (or balloon-like) implementations don't support reading
>> memory that's mapped into the direct map. For example, with never
>> virtio-mem devices in the hypervisor, reading unplugged memory can
>> result in undefined behavior (in the worst case, you'll get your VM zapped).
>>
>> Reading random physical memory ranges without further checks is a very
>> bad idea. There are more corner cases, that we e.g., exclude when
>> reading /proc/kcore.
>>
>> Take a look at read_kcore() KCORE_RAM case, where we e.g., exclude
>> reading PageOffline(), is_page_hwpoison() and !pfn_is_ram(). Unaccepted
>> memory might be another case we want to exclude there in the future.
>>
>>
>> I assume something as you imagine could be implemented in user space
>> just by relying on /proc/iomem and /proc/kcore right now in an unsafe
>> way. So you might want something similar, however, obviously without
>> exporting page content to user space and requiring root permissions.
> 
> Thanks.
> 
> Are the following cases benign if the scan only happens on the host side?
> 
> . virtio-mem - unplugged memory
> . Unaccepted memory

No, only in virtualized worlds.

I assume GART memory that implements the pfn_is_ram() callback is around
on physical machines.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-05-02 17:33               ` David Hildenbrand
@ 2022-05-02 17:36                 ` Jue Wang
  2022-05-02 17:38                   ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Jue Wang @ 2022-05-02 17:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan,
	Greg Thelen, Mina Almasry, linux-mm, Sean Christopherson

On Mon, May 2, 2022 at 10:33 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.05.22 19:30, Jue Wang wrote:
> > On Mon, May 2, 2022 at 10:19 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 26.04.22 21:39, Dave Hansen wrote:
> >>> On 4/26/22 12:23, Jue Wang wrote:
> >>>> On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >>>>> What if you're in a normal (non-TDX) guest and some of the physical
> >>>>> address space has been ballooned away?
> >>>>
> >>>> Accessing to memory that gets ballooned away will cause extra EPT
> >>>> violations and have the memory faulted in on the host side, which is
> >>>> transparent to the guest.
> >>>
> >>> Yeah, but it completely subverts the whole purpose of ballooning.  In
> >>> other words, this is for all intents and purposes also mutually
> >>> exclusive with ballooning.
> >>
> >> Some balloon (or balloon-like) implementations don't support reading
> >> memory that's mapped into the direct map. For example, with never
> >> virtio-mem devices in the hypervisor, reading unplugged memory can
> >> result in undefined behavior (in the worst case, you'll get your VM zapped).
> >>
> >> Reading random physical memory ranges without further checks is a very
> >> bad idea. There are more corner cases, that we e.g., exclude when
> >> reading /proc/kcore.
> >>
> >> Take a look at read_kcore() KCORE_RAM case, where we e.g., exclude
> >> reading PageOffline(), is_page_hwpoison() and !pfn_is_ram(). Unaccepted
> >> memory might be another case we want to exclude there in the future.
> >>
> >>
> >> I assume something as you imagine could be implemented in user space
> >> just by relying on /proc/iomem and /proc/kcore right now in an unsafe
> >> way. So you might want something similar, however, obviously without
> >> exporting page content to user space and requiring root permissions.
> >
> > Thanks.
> >
> > Are the following cases benign if the scan only happens on the host side?
> >
> > . virtio-mem - unplugged memory
> > . Unaccepted memory
>
> No, only in virtualized worlds.
>
> I assume GART memory that implements the pfn_is_ram() callback is around
> on physical machines.

I think host E820 provides an accurate view of which address range is
ram or not?
>
>
> --
> Thanks,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Expose a memory poison detector ioctl to user space.
  2022-05-02 17:36                 ` Jue Wang
@ 2022-05-02 17:38                   ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2022-05-02 17:38 UTC (permalink / raw)
  To: Jue Wang
  Cc: Dave Hansen, Naoya Horiguchi, Tony Luck, Dave Hansen, Jiaqi Yan,
	Greg Thelen, Mina Almasry, linux-mm, Sean Christopherson

On 02.05.22 19:36, Jue Wang wrote:
> On Mon, May 2, 2022 at 10:33 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 02.05.22 19:30, Jue Wang wrote:
>>> On Mon, May 2, 2022 at 10:19 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 26.04.22 21:39, Dave Hansen wrote:
>>>>> On 4/26/22 12:23, Jue Wang wrote:
>>>>>> On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>>>>>> What if you're in a normal (non-TDX) guest and some of the physical
>>>>>>> address space has been ballooned away?
>>>>>>
>>>>>> Accessing to memory that gets ballooned away will cause extra EPT
>>>>>> violations and have the memory faulted in on the host side, which is
>>>>>> transparent to the guest.
>>>>>
>>>>> Yeah, but it completely subverts the whole purpose of ballooning.  In
>>>>> other words, this is for all intents and purposes also mutually
>>>>> exclusive with ballooning.
>>>>
>>>> Some balloon (or balloon-like) implementations don't support reading
>>>> memory that's mapped into the direct map. For example, with never
>>>> virtio-mem devices in the hypervisor, reading unplugged memory can
>>>> result in undefined behavior (in the worst case, you'll get your VM zapped).
>>>>
>>>> Reading random physical memory ranges without further checks is a very
>>>> bad idea. There are more corner cases, that we e.g., exclude when
>>>> reading /proc/kcore.
>>>>
>>>> Take a look at read_kcore() KCORE_RAM case, where we e.g., exclude
>>>> reading PageOffline(), is_page_hwpoison() and !pfn_is_ram(). Unaccepted
>>>> memory might be another case we want to exclude there in the future.
>>>>
>>>>
>>>> I assume something as you imagine could be implemented in user space
>>>> just by relying on /proc/iomem and /proc/kcore right now in an unsafe
>>>> way. So you might want something similar, however, obviously without
>>>> exporting page content to user space and requiring root permissions.
>>>
>>> Thanks.
>>>
>>> Are the following cases benign if the scan only happens on the host side?
>>>
>>> . virtio-mem - unplugged memory
>>> . Unaccepted memory
>>
>> No, only in virtualized worlds.
>>
>> I assume GART memory that implements the pfn_is_ram() callback is around
>> on physical machines.
> 
> I think host E820 provides an accurate view of which address range is
> ram or not?

On most physical machines maybe to some degree. It doesn't hold for
physically hot(un)plugged memory and I remember GART memory is special.
No idea how that is exposed in e820.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-05-02 17:38 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-25 16:34 [RFC] Expose a memory poison detector ioctl to user space Jue Wang
2022-04-26 15:40 ` Dave Hansen
2022-04-26 17:57   ` Jue Wang
2022-04-26 18:02     ` Jue Wang
2022-04-26 18:21       ` Dave Hansen
2022-04-26 19:25         ` Jue Wang
2022-04-26 19:52           ` Luck, Tony
2022-04-26 20:06             ` Jue Wang
2022-04-26 18:20     ` Dave Hansen
2022-04-26 19:23       ` Jue Wang
2022-04-26 19:39         ` Dave Hansen
2022-04-26 19:50           ` Jue Wang
2022-04-28 16:15           ` Erdem Aktas
2022-04-28 16:34             ` Dave Hansen
2022-04-29 19:46               ` Jue Wang
2022-04-29 21:10                 ` Dave Hansen
2022-04-29 21:32                   ` Jue Wang
2022-04-29 21:44                     ` Jue Wang
2022-04-29 22:29                     ` Dave Hansen
2022-04-29 22:53                       ` Jue Wang
2022-05-02 15:30                 ` Dave Hansen
2022-05-02 17:19           ` David Hildenbrand
2022-05-02 17:30             ` Jue Wang
2022-05-02 17:33               ` David Hildenbrand
2022-05-02 17:36                 ` Jue Wang
2022-05-02 17:38                   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.