linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* RE: [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support
       [not found] <20240215194048.141411-1-Benjamin.Cheatham@amd.com>
@ 2024-02-15 23:43 ` Dan Williams
  2024-03-25 15:15   ` Ben Cheatham
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Williams @ 2024-02-15 23:43 UTC (permalink / raw)
  To: Ben Cheatham, linux-cxl, linux-pci
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, bhelgaas,
	benjamin.cheatham, linux-mm

[ add linux-mm because the implications of this feature weigh much
  heavier on mm/ than drivers/ ]

Ben Cheatham wrote:
> Implement initial support for CXL.mem Timeout & Isolation (CXL 3.0
> 12.3.2). This series implements support for CXL.mem enabling and
> programming CXL.mem transaction timeout, CXL.mem error isolation,
> and error isolation interrupts for CXL-enabled PCIe root ports that
> implement the CXL Timeout & Isolation capability.
> 
> I am operating under the assumption that recovery from error isolation
> will be more involved than just resetting the port and turning off
> isolation, so that flow is not implemented here.

That needs to be answered first.

The notification infrastructure is trivial in comparison. The use case
needs to be clear *before* we start adding infrastructure to Linux that
may end up just being dead code, and certainly before we further burden
portdrv.c with more entanglements.

I put together the write-up below in a separate context for my thoughts
on the error isolation capability and that Linux really needs an end
user to stand up and say, "yes, even with all those caveats CXL Error
Isolation is still useful for our environment." The tl;dr is "are you
abosolutely sure you would not just rather reboot?"

> There is also no support for CXL.cache, but I plan to eventually
> implement both.

To date there are zero upstream drivers for CXL.cache initiators, and
CXL 3.0 introduced HDM-DB to supplant HDM-D. All that to say, if there
are zero mass-market (devices with upstream drivers) adopters of HDM-D,
and there are zero HDM-DB platforms available today it seems Linux has
time for this to continue to evolve. If anyone reading this has a
CXL.cache initiator driver waiting in the wings, do reach out on the
list to clarify what commons belong in the Linux CXL and/or PCI core.

> The series also introduces a PCIe port bus driver dependency on the CXL
> core. I expect to be able to remove that when when my team submits
> patches for a future rework of the PCIe port bus driver.

We have time to wait for that work to settle. Do not make the job harder
in the near term by adding one more dependency to unwind.

> I have done some testing using QEMU by adding the isolation registers
> and a hacked-up QMP command to test the interrupt flow, but I *DID NOT*
> implement the actual isolation feature and the subsequent device
> behavior. I'd be willing to share these changes (and my config) if
> anyone is interested in testing this.
> 
> Any thoughts/comments would be greatly appreciated!

---
Memory Error Isolation is a mechanism that *might* allow recovery of CXL
Root Port Link Down conditions or CXL transaction timeouts. When the
event occurs, outstanding writes are dropped, and outstanding reads
terminate with machine check. In order to exit Isolation, the link must
transition through a "link down" status. Effectively Isolation behaves
like a combination of a machine check storm until system software can
evacuate all users and then a surprise hot-removal (unplug) of the
memory before the device can be recovered. Where recovery is effectively
a hot re-plug of the device with a full re-enumeration thereafter. From
the Linux perspective all memory contents are considered forfeited. This
poses several challenges for how to utilize the memory to achieve
reliable recovery. The criteria for evaluating the Linux upstream
maintenance cost for overcoming those challenges is whether the sum
total of those mitigations remains an improvement over a system-reboot
to recover from the same Isolation event. Add to that the fact that not
fully accounting for all the mentioned challenges is still going to
result in a reboot due to kernel panic.

In order to understand the limits of Isolation recovery relative to
reboot recovery, it is important to understand the fundamental
limitations of Linux machine check recovery and memory-hotremove.
Machine checks are typically only recoverable when they hit user memory.
Roughly, if a machine check occurs in a kernel mapped page the machine
check handler triggers a kernel panic. Machine checks are often
recoverable because failures are limited to a few cachelines at a time
and the page allocation distribution is heavily skewed towards
user-memory.

CXL Isolation takes down an entire CXL root port, and with interleaving
can lead to a region spanning multiple ports to be taken down. Even if
the interleave configuration forfeited bandwidth to contain isolation
events to a single CXL root port, it is still on the order of 100s of
GBs that will start throwing machine checks on access all at once. If
that memory is being used by the kernel as typical System-RAM, some of
it is likely to be kernel mapped. Marking all CXL memory as ZONE_MOVABLE
to avoid kernel allocations is not a reliable mitigation for this
scenario as the kernel always needs some ratio of ZONE_NORMAL to access
ZONE_MOVABLE memory.

The "kernel memory" problem similarly effects surprise removal events.
The kernel can only reliably unplug memory that it can force offline and
there is no facility to force-offline ZONE_NORMAL memory. Long term
memory pinning like guests VMs with assigned devices or RDMA, or even
short-term pins from transient device DMA, can hold off memory removal
indefinitely which means recovery may be held off indefinitely. For
System-RAM there is no facility to notify all users to evacuate, instead
the memory hot-removal code walks every page in the range to be offlined
and aborts if any single page has an elevated reference count.

It follows from the above that Isolation requires that the CXL memory
ranges to be recovered must never be online as System-RAM, the kernel
cannot offer typical memory management services for memory that is
subject to "surprise removal". Instead, device-dax is a facility that
has some properties that may allow recovery to be reliable. The
device-dax facility arranges for a given memory range to be mappable via
a device-file. This effectively allows userspace management of that
memory, but at the cost of application changes.

If, for example, the intended use case of Isolation capable CXL memory
is to place VMs that can die during an Isolation event while keeping the
rest of the system up, then that could be achieved with device-dax. For
example, allocate a device-dax instance per-VM and specify it as a
"memory-backend-file" to qemu-kvm.

Again the loss of typical core mm memory semantics and the need for
application changes raises the question of whether reboot is preferred
over Isolation recovery. Unlike System-RAM that supports anonymous
mappings disconnected from the physical memory device, device-dax
implements file-backed mappings which includes methods to reverse-map
all users and revoke their access to the memory range that has gone into
Isolation.
---

So if someone says, "yes I can tolerate losing a root port at a time and
I can tolerate deploying my workloads with userspace memory management,
and this is preferable to a reboot", then maybe Linux should entertain
CXL Error Isolation. Until such an end use case gains clear uptake it
seems too early to worry about plumbing the notification mechanism.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support
  2024-02-15 23:43 ` [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support Dan Williams
@ 2024-03-25 15:15   ` Ben Cheatham
  2024-03-25 15:54     ` Dan Williams
  0 siblings, 1 reply; 4+ messages in thread
From: Ben Cheatham @ 2024-03-25 15:15 UTC (permalink / raw)
  To: Dan Williams, linux-cxl, linux-pci
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, bhelgaas, linux-mm

Hey Dan,

Sorry that I went radio silent on this, I've been thinking about the approach to take
from here on out. More below!

On 2/15/24 5:43 PM, Dan Williams wrote:

[snip]

>> The series also introduces a PCIe port bus driver dependency on the CXL
>> core. I expect to be able to remove that when when my team submits
>> patches for a future rework of the PCIe port bus driver.
> 
> We have time to wait for that work to settle. Do not make the job harder
> in the near term by adding one more dependency to unwind.
> 

For sure, I'll withhold any work I do that touches the port bus driver
until that's sent upstream. I'll send anything that doesn't touch
the port driver, i.e. CXL region changes, as RFC as well.

[snip]

> So if someone says, "yes I can tolerate losing a root port at a time and
> I can tolerate deploying my workloads with userspace memory management,
> and this is preferable to a reboot", then maybe Linux should entertain
> CXL Error Isolation. Until such an end use case gains clear uptake it
> seems too early to worry about plumbing the notification mechanism.

So in my mind the primary use case for this feature is in a server/data center environment.
Losing a root port and the attached memory is definitely preferable to a reboot in this environment,
so I think that isolation would still be useful even if it isn't plug-and-play.

I agree with your assessment of what has to happen before we can make this feature work. From what I understand
these are the main requirements for making isolation work:
	1. The memory region can't be onlined as System RAM
	2. The region needs to be mapped as device-dax
	3. There need to be safeguards such that someone can't remap the region to System RAM with
	error isolation enabled (added by me)

Considering these all have to do with mm, I think that's a good place to start. What I'm thinking of starting with
is adding a module parameter (or config option) to enable isolation for CXL dax regions by default, as well as
a sysfs option for toggling error isolation for the CXL region. When the module parameter is specified, the CXL
region driver would create the region as a device-dax region by default instead of onlining the region as system RAM.
The sysfs would allow toggling error isolation for CXL regions that are already device-dax.

That would cover requirements 1 & 2, but would still allow someone to reconfigure the region as a system RAM region
with error isolation still enabled using sysfs/daxctl. To make sure the this doesn't happen, my plan is to have the
CXL region driver automatically disable error isolation when the underlying memory region is going offline, then
check to make sure the memory isn't onlined as System RAM before re-enabling isolation.

So, with that in mind, isolation would most likely go from something that is enabled by default when compiled in to a
feature for correctly-configured CXL regions that has to be enabled by the end user. If that is sounds good, here's
what my roadmap would be going forward:

	1. Enable creating device-dax mapped CXL regions by default
	2. Add support for checking a CXL region meets the requirements for enabling isolation (i.e. all devices in
	dax region(s) are device-dax)
	3. Add back in the error isolation enablement and notification mechanisms in this patch series

And then after those are in place:
	4. Add back in timeout mechanisms
	5. Recovery flows/support

I'll also be dropping CXL.cache support until there is more support. I'd like to hear your (or anyone else's) thoughts
on this!

Thanks,
Ben


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support
  2024-03-25 15:15   ` Ben Cheatham
@ 2024-03-25 15:54     ` Dan Williams
  2024-04-01 19:41       ` Ben Cheatham
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Williams @ 2024-03-25 15:54 UTC (permalink / raw)
  To: Ben Cheatham, Dan Williams, linux-cxl, linux-pci
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, bhelgaas, linux-mm

Ben Cheatham wrote:
> Hey Dan,
> 
> Sorry that I went radio silent on this, I've been thinking about the approach to take
> from here on out. More below!
> 
> On 2/15/24 5:43 PM, Dan Williams wrote:
> 
> [snip]
> 
> >> The series also introduces a PCIe port bus driver dependency on the CXL
> >> core. I expect to be able to remove that when when my team submits
> >> patches for a future rework of the PCIe port bus driver.
> > 
> > We have time to wait for that work to settle. Do not make the job harder
> > in the near term by adding one more dependency to unwind.
> > 
> 
> For sure, I'll withhold any work I do that touches the port bus driver
> until that's sent upstream. I'll send anything that doesn't touch
> the port driver, i.e. CXL region changes, as RFC as well.
> 
> [snip]
> 
> > So if someone says, "yes I can tolerate losing a root port at a time and
> > I can tolerate deploying my workloads with userspace memory management,
> > and this is preferable to a reboot", then maybe Linux should entertain
> > CXL Error Isolation. Until such an end use case gains clear uptake it
> > seems too early to worry about plumbing the notification mechanism.
> 
> So in my mind the primary use case for this feature is in a
> server/data center environment.  Losing a root port and the attached
> memory is definitely preferable to a reboot in this environment, so I
> think that isolation would still be useful even if it isn't
> plug-and-play.

That is not sufficient. This feature needs an implementation partner to
work through the caveats.

> I agree with your assessment of what has to happen before we can make
> this feature work. From what I understand these are the main
> requirements for making isolation work:

Requirement 1 is "Find customer".

> 	1. The memory region can't be onlined as System RAM
> 	2. The region needs to be mapped as device-dax
> 	3. There need to be safeguards such that someone can't remap the
> 	region to System RAM with error isolation enabled (added by me)

No, this policy does not belong in the kernel.

> Considering these all have to do with mm, I think that's a good place
> to start. What I'm thinking of starting with is adding a module
> parameter (or config option) to enable isolation for CXL dax regions
> by default, as well as a sysfs option for toggling error isolation for
> the CXL region. When the module parameter is specified, the CXL region
> driver would create the region as a device-dax region by default
> instead of onlining the region as system RAM.  The sysfs would allow
> toggling error isolation for CXL regions that are already device-dax.

No, definitely not going to invent module paramter policy for this. The
policy to not online dax regions is already available.

> That would cover requirements 1 & 2, but would still allow someone to
> reconfigure the region as a system RAM region with error isolation
> still enabled using sysfs/daxctl. To make sure the this doesn't
> happen, my plan is to have the CXL region driver automatically disable
> error isolation when the underlying memory region is going offline,
> then check to make sure the memory isn't onlined as System RAM before
> re-enabling isolation.

Why would you ever need to disable isolation? If it is present it at
least nominally allows software to keep running vs hang awaiting a
watchdog-timer reboot.

> So, with that in mind, isolation would most likely go from something
> that is enabled by default when compiled in to a feature for
> correctly-configured CXL regions that has to be enabled by the end
> user. If that is sounds good, here's what my roadmap would be going
> forward:

Again, this needs a customer to weigh the mitigations that the kernel
needs to carry.

> 	1. Enable creating device-dax mapped CXL regions by default

Already exists.

> 	2. Add support for checking a CXL region meets the requirements
> 	for enabling isolation (i.e. all devices in
> 	dax region(s) are device-dax)

Regions should not know or care if isolation is enabled, the
implications are identical to force unbinding the region driver.

> 	3. Add back in the error isolation enablement and notification
> 	mechanisms in this patch series

Not until it is clear that this feature has deployment value.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support
  2024-03-25 15:54     ` Dan Williams
@ 2024-04-01 19:41       ` Ben Cheatham
  0 siblings, 0 replies; 4+ messages in thread
From: Ben Cheatham @ 2024-04-01 19:41 UTC (permalink / raw)
  To: Dan Williams, linux-cxl, linux-pci
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, bhelgaas, linux-mm

Sorry for the delay, I got a bit busy for a while there. Responses inline.

On 3/25/24 10:54 AM, Dan Williams wrote:
> Ben Cheatham wrote:
>> Hey Dan,
>>
>> Sorry that I went radio silent on this, I've been thinking about the approach to take
>> from here on out. More below!
>>
>> On 2/15/24 5:43 PM, Dan Williams wrote:
>>
>> [snip]
>>
>>>> The series also introduces a PCIe port bus driver dependency on the CXL
>>>> core. I expect to be able to remove that when when my team submits
>>>> patches for a future rework of the PCIe port bus driver.
>>>
>>> We have time to wait for that work to settle. Do not make the job harder
>>> in the near term by adding one more dependency to unwind.
>>>
>>
>> For sure, I'll withhold any work I do that touches the port bus driver
>> until that's sent upstream. I'll send anything that doesn't touch
>> the port driver, i.e. CXL region changes, as RFC as well.
>>
>> [snip]
>>
>>> So if someone says, "yes I can tolerate losing a root port at a time and
>>> I can tolerate deploying my workloads with userspace memory management,
>>> and this is preferable to a reboot", then maybe Linux should entertain
>>> CXL Error Isolation. Until such an end use case gains clear uptake it
>>> seems too early to worry about plumbing the notification mechanism.
>>
>> So in my mind the primary use case for this feature is in a
>> server/data center environment.  Losing a root port and the attached
>> memory is definitely preferable to a reboot in this environment, so I
>> think that isolation would still be useful even if it isn't
>> plug-and-play.
> 
> That is not sufficient. This feature needs an implementation partner to
> work through the caveats.
> 

Got it, I'll come back to this if/when I have a customer willing to commit
to this.

>> I agree with your assessment of what has to happen before we can make
>> this feature work. From what I understand these are the main
>> requirements for making isolation work:
> 
> Requirement 1 is "Find customer".
> 
>> 	1. The memory region can't be onlined as System RAM
>> 	2. The region needs to be mapped as device-dax
>> 	3. There need to be safeguards such that someone can't remap the
>> 	region to System RAM with error isolation enabled (added by me)
> 
> No, this policy does not belong in the kernel.
> 

Understood.

>> Considering these all have to do with mm, I think that's a good place
>> to start. What I'm thinking of starting with is adding a module
>> parameter (or config option) to enable isolation for CXL dax regions
>> by default, as well as a sysfs option for toggling error isolation for
>> the CXL region. When the module parameter is specified, the CXL region
>> driver would create the region as a device-dax region by default
>> instead of onlining the region as system RAM.  The sysfs would allow
>> toggling error isolation for CXL regions that are already device-dax.
> 
> No, definitely not going to invent module paramter policy for this. The
> policy to not online dax regions is already available.
> 
>> That would cover requirements 1 & 2, but would still allow someone to
>> reconfigure the region as a system RAM region with error isolation
>> still enabled using sysfs/daxctl. To make sure the this doesn't
>> happen, my plan is to have the CXL region driver automatically disable
>> error isolation when the underlying memory region is going offline,
>> then check to make sure the memory isn't onlined as System RAM before
>> re-enabling isolation.
> 
> Why would you ever need to disable isolation? If it is present it at
> least nominally allows software to keep running vs hang awaiting a
> watchdog-timer reboot.
> 

That's a good point. I think I was just looking at this too long and
got a bit lost in the sauce, I'll keep this in mind for if/when I
come back to this.

>> So, with that in mind, isolation would most likely go from something
>> that is enabled by default when compiled in to a feature for
>> correctly-configured CXL regions that has to be enabled by the end
>> user. If that is sounds good, here's what my roadmap would be going
>> forward:
> 
> Again, this needs a customer to weigh the mitigations that the kernel
> needs to carry.
> 
>> 	1. Enable creating device-dax mapped CXL regions by default
> 
> Already exists.
> 
>> 	2. Add support for checking a CXL region meets the requirements
>> 	for enabling isolation (i.e. all devices in
>> 	dax region(s) are device-dax)
> 
> Regions should not know or care if isolation is enabled, the
> implications are identical to force unbinding the region driver.
> 

Got it.

Thanks,
Ben

>> 	3. Add back in the error isolation enablement and notification
>> 	mechanisms in this patch series
> 
> Not until it is clear that this feature has deployment value.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-04-01 19:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20240215194048.141411-1-Benjamin.Cheatham@amd.com>
2024-02-15 23:43 ` [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support Dan Williams
2024-03-25 15:15   ` Ben Cheatham
2024-03-25 15:54     ` Dan Williams
2024-04-01 19:41       ` Ben Cheatham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).