Re: [RFC] ACPI Code First ECR: Generic Target

From: Dan Williams <dan.j.williams@intel.com>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: linux-cxl@vger.kernel.org,
	Linux ACPI <linux-acpi@vger.kernel.org>,
	"Natu, Mahesh" <mahesh.natu@intel.com>,
	Chet R Douglas <chet.r.douglas@intel.com>,
	Ben Widawsky <ben.widawsky@intel.com>,
	Vishal L Verma <vishal.l.verma@intel.com>
Subject: Re: [RFC] ACPI Code First ECR: Generic Target
Date: Thu, 11 Feb 2021 09:06:51 -0800	[thread overview]
Message-ID: <CAPcyv4j0Wce-76OfgqTSkveukgDXB_p2VZZpgM8XjDFd+Q-0Ww@mail.gmail.com> (raw)
In-Reply-To: <20210211094222.000048ae@Huawei.com>

On Thu, Feb 11, 2021 at 1:44 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Wed, 10 Feb 2021 08:24:51 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> > >
> > > On Tue, 9 Feb 2021 19:55:05 -0800
> > > Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > > While the platform BIOS is able to describe the performance
> > > > characteristics of CXL memory that is present at boot, it is unable to
> > > > statically enumerate the performance of CXL memory hot inserted
> > > > post-boot. The OS can enumerate most of the characteristics from link
> > > > registers and CDAT, but the performance from the CPU to the host
> > > > bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> > > > ACPI mechanism for this purpose. Critically this is achieved with a
> > > > small tweak to how the existing Generic Initiator proximity domain is
> > > > utilized in the HMAT.
> > >
> > > Hi Dan,
> > >
> > > Agree there is a hole here, but I think the proposed solution has some
> > > issues for backwards compatibility.
> > >
> > > Just to clarify, I believe CDAT from root ports is sufficient for the
> > > other direction (GI on CXL, memory in host).  I wondered initially if
> > > this was a two way issue, but after a reread, I think that is fine
> > > with the root port providing CDAT or potentially treating the root
> > > port as a GI (though that runs into the same naming / representation issue
> > > as below and I think would need some clarifying text in UEFI GI description)
> > >
> > > http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> > >
> > > For the case you are dealing with here potentially we 'could' add something
> > > to CDAT as alternative to changing SRAT, but it would be more complex
> > > so your approach here makes more sense to me.
> >
> > CDAT seems the wrong mechanism because it identifies target
> > performance once you're at the front door of the device, not
> > performance relative to a given initiator.
>
> I'd argue you could make CDAT a more symmetric representation, but it would
> end up replicating a lot of info already in HMAT.  Didn't say it was a good
> idea!

CDAT describes points, HMAT describes edges on the performance graph,
it would be confusing if CDAT tried to supplant HMAT.

>
> That's an odd situation that it sort of 'half' manages it in the BIOS.
> We probably need some supplementary additional docs around this topic
> as the OS would need to be aware of that possibility and explicitly check
> for it before doing its normal build based on CDAT + what you are proposing
> here.  Maybe code is enough but given this is cross OS stuff I'd argue
> it probably isn't.
>
> I guess could revisit this draft Uefi white paper perhaps and add a bunch
> of examples around this usecase https://github.com/hisilicon/acpi-numa-whitepaper

Thanks for the reference, I'll take a look.

>
> >
> > >
> > > >
> > > > # Impact of the Change
> > > > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> > > > 5.2.16.6) already contains all the fields necessary to enumerate a
> > > > generic target proximity domain. All that is missing is the
> > > > interpretation of that proximity domain optionally as a target
> > > > identifier in the HMAT.
> > > >
> > > > Given that the OS still needs to dynamically enumerate and instantiate
> > > > the memory ranges behind the host bridge. The assumption is that
> > > > operating systems that do not support native CXL enumeration will ignore
> > > > this data in the HMAT, while CXL native enumeration aware environments
> > > > will use this fragment of the performance path to calculate the
> > > > performance characteristics.
> > >
> > > I don't think it is true that OS not supporting native CXL will ignore the
> > > data.
> >
> > True, I should have chosen more careful words like s/ignore/not
> > regress upon seeing/
>
> It's a sticky corner and I suspect likely to come up at in ACPI WG - what is
> being proposed here isn't backwards compatible

It seems our definitions of backwards compatible are divergent. Please
correct me if I'm wrong, but I understand your position to be "any
perceptible OS behavior change breaks backwards compatibility",
whereas I'm advocating that backwards compatibility is relative
regressing real world use cases. That said, I do need to go mock this
up in QEMU and verify how much disturbance it causes.

> even if the impacts in Linux are small.

I'd note the kernel would grind to a halt if the criteria for
"backwards compatible" was zero perceptible behavior change.

> Mostly it's infrastructure bring up that won't get used
> (fallback lists and similar for a node which will never be specified in
> allocations) and some confusing userspace ABI (which is more than a little
> confusing already).

Fallback lists are established relative to online nodes. These generic
targets are not onlined as memory.

> > > Linux will create a small amount of infrastructure to reflect them (more or
> > > less the same as a memoryless node) and also they will appear in places
> > > like access0 as a possible initiator of transactions.  It's small stuff,
> > > but I'd rather the impact on legacy was zero.
> >
> > I'm failing to see that small collision as fatal to the proposal. The
> > HMAT parsing had a significant bug for multiple kernel releases and no
> > one noticed. This quirk is minor in comparison.
>
> True, there is a lag in HMAT adoption - though for ACPI tables, not that long
> (only a couple of years :)
>
> >
> > >
> > > So my gut feeling here is we shouldn't reuse the generic initiator, but
> > > should invent something new.  Would look similar to GI, but with a different
> > > ID - to ensure legacy OS ignores it.
> >
> > A new id introduces more problems than it solves. Set aside the ACPICA
> > thrash, it does not allow a clean identity mapping of a point in a
> > system topology being both initiator and target. The SRAT does not
> > need more data structures to convey this information. At most I would
> > advocate for an OSC bit for the OS to opt into allowing this new usage
> > in the HMAT, but that still feels like overkill absent a clear
> > regression in legacy environments.
>
> OSC for this case doesn't work. You can't necessarily evaluate it
> early enough in the boot - in Linux the node setup is before AML parsing
> comes up.  HMAT is evaluated a lot later, but SRAT is too early.  + in theory
> another OS is allowed to evaluate HMAT before OSC is available.

The Linux node setup for online memory is before OSC parsing, but
there's nothing to "online" with a GI/GT entry. Also, if this was a
problem, it would already be impacting the OS today because this
proposal only changes HMAT, not SRAT. Lastly there *is* an OSC bit for
GI, so either that's vestigial and needs to be removed, or OSC is
relevant for this case.

>
> > The fact that hardly anyone is
> > using HMAT (as indicated by the bug I mentioned) gives me confidence
> > that perfection is more "enemy of the good" than required here.
>
> How about taking this another way
>
> 1) Assume that the costs of 'false' GI nodes on legacy system as a result
>    of this is minor - so just live with it.  (probably true, but as ever
>    need to confirm with other OS)
>
> 2) Try to remove the cost of pointless infrastructure on 'aware' kernels.
>    Add a flag to the GI entry to say it's a bridge and not expected to,
>    in of itself, represent an initiator or a target.
>    In Linux we then don't create the node intrastructure etc or assign
>    any devices to have the non existent NUMA node.
>
> The information is still there to combine with device info (CDAT) etc
> and build what we eventually want in the way of a representation of
> the topology that Linux can use.
>
> Now we just have the 'small' problem of figuring out how actually implement
> hotplugging of NUMA nodes.

I think it's tiny. Just pad the "possible" nodes past what SRAT enumerates.