Re: Onlining CXL Type2 device coherent memory

From: Alistair Popple <apopple@nvidia.com>
To: Vikram Sethi <vsethi@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	David Hildenbrand <david@redhat.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"Natu, Mahesh" <mahesh.natu@intel.com>,
	"Rudoff, Andy" <andy.rudoff@intel.com>,
	Jeff Smith <JSMITH@nvidia.com>,
	Mark Hairgrove <mhairgrove@nvidia.com>,
	"jglisse@redhat.com" <jglisse@redhat.com>,
	Linux MM <linux-mm@kvack.org>,
	Linux ACPI <linux-acpi@vger.kernel.org>,
	"Anshuman Khandual" <anshuman.khandual@arm.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	Samer El-Haj-Mahmoud <Samer.El-Haj-Mahmoud@arm.com>,
	Shanker Donthineni <sdonthineni@nvidia.com>,
	Joao Martins <joao.m.martins@oracle.com>
Subject: Re: Onlining CXL Type2 device coherent memory
Date: Tue, 3 Nov 2020 14:56:20 +1100	[thread overview]
Message-ID: <6645807.zUfqAqQW0h@nvdebian> (raw)
In-Reply-To: <BL0PR12MB2532F7D105A1DC2E41B13DF2BD100@BL0PR12MB2532.namprd12.prod.outlook.com>

On Tuesday, 3 November 2020 6:25:23 AM AEDT Vikram Sethi wrote:
> > > > be sufficient, but depending if driver had done the add_memory in 
probe,
> > > > it perhaps would be onerous to have to remove_memory as well before 
reset,
> > > > and then add it back after reset. I realize you’re saying such a 
procedure
> > > > would be abusing hotplug framework, and we could perhaps require that
> > memory
> > > > be removed prior to reset, but not clear to me that it *must* be 
removed for
> > > > correctness.

I'm not sure exactly what you meant by "unavailable", but on some platforms 
(eg. PowerPC) it must be removed for correctness if hardware access to the 
memory is going away for any period of time. remove_memory() is what makes it 
safe to physically remove the memory as it triggers things like cache 
flushing. Without this PPC would see memory failure machine checks if it ever 
tried to writeback any dirty cache lines to the now inaccessible memory.

> > > > Another usecase of offlining without removing HDM could be around
> > > > Virtualization/passing entire device with its memory to a VM. If 
device was
> > > > being used in the host kernel, and is then unbound, and bound to vfio-
pci
> > > > (vfio-cxl?), would we expect vfio-pci to add_memory_driver_managed?
> > >
> > > At least for passing through memory to VMs (via KVM), you don't actually
> > > need struct pages / memory exposed to the buddy via
> > > add_memory_driver_managed(). Actually, doing that sounds like the wrong
> > > approach.
> > >
> > > E.g., you would "allocate" the memory via devdax/dax_hmat and directly
> > > map the resulting device into guest address space. At least that's what
> > > some people are doing with
> 
> How does memory_failure forwarding to guest work in that case?
> IIUC it doesn't without a struct page in the host. 
> For normal memory, when VM consumes poison, host kernel signals
> Userspace with SIGBUS and si-code that says Action Required, which 
> QEMU injects to the guest.
> IBM had done something like you suggest with coherent GPU memory and IIUC
> memory_failure forwarding to guest VM does not work there.
> 
> kernel https://lkml.org/lkml/2018/12/20/103 
> QEMU: https://patchwork.kernel.org/patch/10831455/

The above patches simply allow the coherent GPU physical memory ranges to get 
mapped into a guest VM in a similar way to an MMIO range (ie. without a struct 
page in the host). So you are correct in that they do not deal with forwarding 
failures to a guest VM.

Any GPU memory failure on PPC would currently get sent to the host in the same 
way as a normal system memory failure (ie. machine check). So in theory 
notification to a guest would work the same as a normal system memory failure. 
I say in theory because when I last looked at this some time back a guest 
kernel on PPC is not notified of memory errors.

 - Alistair

> I would think we *do want* memory errors to be sent to a VM. 
>
> > 
> > ...and Joao is working to see if the host kernel can skip allocating
> > 'struct page' or do it on demand if the guest ever requests host
> > kernel services on its memory. Typically it does not so host 'struct
> > page' space for devdax memory ranges goes wasted.
> Is memory_failure forwarded to and handled by guest?
>