RE: Questions about CXL device (type 3 memory) hotplug

From: Vikram Sethi <vsethi@nvidia.com>
To: Dan Williams <dan.j.williams@intel.com>,
	"Yasunori Gotou (Fujitsu)" <y-goto@fujitsu.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"catalin.marinas@arm.com" <Catalin.Marinas@arm.com>,
	James Morse <james.morse@arm.com>
Cc: "Natu, Mahesh" <mahesh.natu@intel.com>
Subject: RE: Questions about CXL device (type 3 memory) hotplug
Date: Wed, 24 May 2023 14:47:31 +0000	[thread overview]
Message-ID: <BN8PR12MB3330831F2E666E9BB1319E66BD419@BN8PR12MB3330.namprd12.prod.outlook.com> (raw)
In-Reply-To: <646d8c76811cb_250e29456@dwillia2-mobl3.amr.corp.intel.com.notmuch>

> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Tuesday, May 23, 2023 11:03 PM
> To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>;
> linux-cxl@vger.kernel.org; catalin.marinas@arm.com; James Morse
> <james.morse@arm.com>
> Cc: Natu, Mahesh <mahesh.natu@intel.com>
> Subject: RE: Questions about CXL device (type 3 memory) hotplug
> Vikram Sethi wrote:
> > > From: Dan Williams <dan.j.williams@intel.com>
> > > Sent: Tuesday, May 23, 2023 1:40 PM
> > > To: Vikram Sethi <vsethi@nvidia.com>; Dan Williams
> > > <dan.j.williams@intel.com>; Yasunori Gotou (Fujitsu)
> > > <y-goto@fujitsu.com>; linux-cxl@vger.kernel.org;
> > > catalin.marinas@arm.com; James Morse <james.morse@arm.com>
> > > Cc: Natu, Mahesh <mahesh.natu@intel.com>
> > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > >
> > > Vikram Sethi wrote:
> > > > Hi Dan,
> > > >
> > > > > From: Dan Williams <dan.j.williams@intel.com>
> > > > > Sent: Monday, May 22, 2023 7:12 PM
> > > > > To: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>; linux-
> > > > > cxl@vger.kernel.org
> > > > > Cc: 'Dan Williams' <dan.j.williams@intel.com>
> > > > > Subject: RE: Questions about CXL device (type 3 memory) hotplug
> > > > >
> > > > > > Q4) Current CXL drivers/tools support Hot-removal request from
> PCIe?
> > > > > >
> > > > > >     CXL specification says "In a managed Hot-Remove flow, software
> is
> > > > > >     notified of a hot removal request."
> > > > >
> > > > > Currently there is a requirement that:
> > > > >
> > > > > cxl disable-memdev
> > > > >
> > > > > ...is run before the device can be removed. There is no warning
> > > > > from the PCI hotplug driver. Which means that if end user does
> > > > > the wrong sequence they can crash the kernel / remove memory
> > > > > that may still be in
> > > active use.
> > > > >
> > > > Is there any notion of a cache flush when memory is removed (or in
> > > > future
> > > CXL reset)?
> > >
> > > No.
> > >
> > > > Generally, CPU caches must be flushed when memory is removed
> > > > because any evictions when the memory isn't present can cause
> > > > async errors which can be fatal to the system or at least to VMs,
> depending on ISA.
> > >
> > > This seems incompatible with memory hotplug. The cache flushing is
> > > only done on the subsequent reuse of physical address range to make
> > > sure that any pending evictions are complete before the newly
> > > constituted address range is put into service, or that any prior
> > > clean cache lines of old content are dropped. See
> cxl_region_invalidate_memregion() for where this is called.
> > >
> > > > If the kernel does the cache flush, it must be done with only
> > > > uncacheable mappings present to prevent speculative fetches after
> > > > the
> > > cache flush.
> > >
> > > This is why the invalidation is done after physical address range is
> > > populated by new devices. To flush any speculative fetches to the
> > > old composition of the address range.
> > >
> > I don't think invalidate on the probe path will always work for
> > devices with snoop filters, including HDM-DB memory devices or CXL
> > type2 accelerators.  After CXL reset or hot plug insertion, a HDM-DB
> > device's snoop filter isn't tracking any lines checked out by the
> > host. Even if those were just clean lines in CPU caches, hosts can
> > send drop notifications in CXL in response to the cache flush
> > (MemClnEvict).  A device that isn't expecting this evict notification
> > can go into error state and optionally raise a device internal error
> > interrupt. So you could end up with a non functional device.
> 
> I don't understand this failure mode. Accelerator is added, driver sets up an
> HDM decode range and triggers CPU cache invalidation before mapping the
> memory into page tables. Wouldn't the device, upon receiving an invalidation
> request, just snoop its caches and say "nothing for me to do"?

Device's snoop filter is in a clean reset/power on state. It is not tracking anything checked out by the host CPU/peer.
If it starts receiving writebacks or even CleanEvicts for its memory, it certainly looks like an unexpected coherency message and i
Know of at least one implementation that triggers an error interrupt in response. I don't know of a statement 
In the specification that this is expected and implementations should ignore. If there is such a statement, could you 
please point me to it? 

Remove memory needs a cache flush IMO, in a way that prevents speculative fetches. 
This can be done in kernel with uncacheable mappings alone, if possible in the arch callback, or via FW call.