All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] ACPI Code First ECR: Generic Target
@ 2021-02-10  3:55 Dan Williams
  2021-02-10 11:23 ` Jonathan Cameron
  2021-02-10 17:02 ` Vikram Sethi
  0 siblings, 2 replies; 18+ messages in thread
From: Dan Williams @ 2021-02-10  3:55 UTC (permalink / raw)
  To: linux-cxl, Linux ACPI
  Cc: Natu, Mahesh, Chet R Douglas, Ben Widawsky, Vishal L Verma

While the platform BIOS is able to describe the performance
characteristics of CXL memory that is present at boot, it is unable to
statically enumerate the performance of CXL memory hot inserted
post-boot. The OS can enumerate most of the characteristics from link
registers and CDAT, but the performance from the CPU to the host
bridge, for example, is not enumerated by PCIE or CXL. Introduce an
ACPI mechanism for this purpose. Critically this is achieved with a
small tweak to how the existing Generic Initiator proximity domain is
utilized in the HMAT.

---

# Title: Introduce a Generic Target for CXL

# Status: Draft

# Document: ACPI Specification 6.4

# License
SPDX-License Identifier: CC-BY-4.0

# Submitter:
* Sponsor: Dan Williams, Intel
* Creators/Contributors:
    * Mahesh Natu, Intel
    * Chet Douglas, Intel
    * Deepak Shivakumar, Intel

# Summary of the Change
Introduce a "Generic Target" concept to the SRAT to describe the root
performance parameters in the path to dynamically discovered (outside of
ACPI enumeration) CXL memory target endpoints.

# Benefits of the Change
Consider the case of a system with a set of CXL host bridges (ACPI0016),
and no devices attached at initial system power-on. In this scenario
platform firmware is unable to perform the end-to-end enumeration
necessary to populate SRAT and HMAT for the endpoints that may be
hot-inserted behind those bridges post power-on. The address-range is
unknown so SRAT can not be pre-populated, the performance is unknown (no
CDAT nor interleave configuration) so HMAT can not be pre-populated.

However, what is known to platform firmware that generates the SRAT and
HMAT is the performance characteristics of the path between CPU and
Generic Initiators to the CXL host bridge target. With either
CPU-to-Generic-Target, or Generic-Initiator-to-Generic-Target entries in
the HMAT the OS CXL subsystem can enumerate the remaining details (PCIE
link status, device CDAT, interleave configuration) to calculate the
bandwidth and latency of a dynamically discovered CXL memory target.

# Impact of the Change
The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
5.2.16.6) already contains all the fields necessary to enumerate a
generic target proximity domain. All that is missing is the
interpretation of that proximity domain optionally as a target
identifier in the HMAT.

Given that the OS still needs to dynamically enumerate and instantiate
the memory ranges behind the host bridge. The assumption is that
operating systems that do not support native CXL enumeration will ignore
this data in the HMAT, while CXL native enumeration aware environments
will use this fragment of the performance path to calculate the
performance characteristics.

# References
* Compute Express Link Specification v2.0,
<https://www.computeexpresslink.org/>

# Detailed Description of the Change

* Replace "Generic Initiator" with "Generic Initiator / Target" in all
locations except where an "initiator" or "target" is implied.
Specifically 5.2.27.3 "Memory Proximity Domain Attributes Structure"
need not replace occurrences of "generic initiator" in field: "Proximity
Domain for Attached Initiator". Additionally field: "Proximity Domain
for the Memory" must be renamed to "Proximity Domain for the Memory /
Generic Target" with a new description "Integer that represents the
memory / generic target proximity domain to which this memory belongs."

* Revise "5.2.16.6 Generic Initiator Affinity Structure" to make it
  consistent with being referenced as either a target or initiator.

        * Description: (replace all text)

        > The Generic Initiator / Target Affinity Structure provides the
        > association between a Generic Initiator and a Memory Proximity
        > Domain, or another Generic Target Proximity Domain. The
        > distinction as to whether this structure represents an
        > Initiator, a Target, or both depends on how it is referenced
        > in the HMAT. See Section 5.2.27.3 for details.

        > Support of Generic Initiator / Target Affinity Structures by
        > OSPM is optional, and the platform may query whether the OS
        > supports it via the _OSC method. See Section 6.2.11.2.

        * Architectural transactions: (append after current text)

        > If this proximity domain is referenced as a target then it
        > supports all the transaction types inferred above.

        * Other updates are simple Initiator => Initiator / Target
          replacements.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-10  3:55 [RFC] ACPI Code First ECR: Generic Target Dan Williams
@ 2021-02-10 11:23 ` Jonathan Cameron
  2021-02-10 15:18   ` Natu, Mahesh
  2021-02-10 16:24   ` Dan Williams
  2021-02-10 17:02 ` Vikram Sethi
  1 sibling, 2 replies; 18+ messages in thread
From: Jonathan Cameron @ 2021-02-10 11:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Tue, 9 Feb 2021 19:55:05 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> While the platform BIOS is able to describe the performance
> characteristics of CXL memory that is present at boot, it is unable to
> statically enumerate the performance of CXL memory hot inserted
> post-boot. The OS can enumerate most of the characteristics from link
> registers and CDAT, but the performance from the CPU to the host
> bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> ACPI mechanism for this purpose. Critically this is achieved with a
> small tweak to how the existing Generic Initiator proximity domain is
> utilized in the HMAT.

Hi Dan,

Agree there is a hole here, but I think the proposed solution has some
issues for backwards compatibility.

Just to clarify, I believe CDAT from root ports is sufficient for the
other direction (GI on CXL, memory in host).  I wondered initially if
this was a two way issue, but after a reread, I think that is fine
with the root port providing CDAT or potentially treating the root
port as a GI (though that runs into the same naming / representation issue
as below and I think would need some clarifying text in UEFI GI description)

http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf

For the case you are dealing with here potentially we 'could' add something
to CDAT as alternative to changing SRAT, but it would be more complex
so your approach here makes more sense to me.

> 
> ---
> 
> # Title: Introduce a Generic Target for CXL
> 
> # Status: Draft
> 
> # Document: ACPI Specification 6.4
> 
> # License
> SPDX-License Identifier: CC-BY-4.0
> 
> # Submitter:
> * Sponsor: Dan Williams, Intel
> * Creators/Contributors:
>     * Mahesh Natu, Intel
>     * Chet Douglas, Intel
>     * Deepak Shivakumar, Intel
> 
> # Summary of the Change
> Introduce a "Generic Target" concept to the SRAT to describe the root
> performance parameters in the path to dynamically discovered (outside of
> ACPI enumeration) CXL memory target endpoints.
> 
> # Benefits of the Change
> Consider the case of a system with a set of CXL host bridges (ACPI0016),

Superficially feels like this new SRAT entry might reference the CXL 2.0 Root
ports or the host bridge.

> and no devices attached at initial system power-on. In this scenario
> platform firmware is unable to perform the end-to-end enumeration
> necessary to populate SRAT and HMAT for the endpoints that may be
> hot-inserted behind those bridges post power-on. The address-range is
> unknown so SRAT can not be pre-populated, the performance is unknown (no
> CDAT nor interleave configuration) so HMAT can not be pre-populated.
> 
> However, what is known to platform firmware that generates the SRAT and
> HMAT is the performance characteristics of the path between CPU and
> Generic Initiators to the CXL host bridge target. With either
> CPU-to-Generic-Target, or Generic-Initiator-to-Generic-Target entries in
> the HMAT the OS CXL subsystem can enumerate the remaining details (PCIE
> link status, device CDAT, interleave configuration) to calculate the
> bandwidth and latency of a dynamically discovered CXL memory target.

I'm wondering if the term "generic target" is a good name.
Would something like "generic target bridge" be clearer?
The point being this isn't an actual target but a point along the way.
Mind you this is close to bike shedding.

As mentioned above, maybe "generic bridge" that can give us a node to hang
data off for both, a) GI on CXL to host memory, and b) Initiator in host to CXL memory
and hence give cleaner representation.

> 
> # Impact of the Change
> The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> 5.2.16.6) already contains all the fields necessary to enumerate a
> generic target proximity domain. All that is missing is the
> interpretation of that proximity domain optionally as a target
> identifier in the HMAT.
> 
> Given that the OS still needs to dynamically enumerate and instantiate
> the memory ranges behind the host bridge. The assumption is that
> operating systems that do not support native CXL enumeration will ignore
> this data in the HMAT, while CXL native enumeration aware environments
> will use this fragment of the performance path to calculate the
> performance characteristics.

I don't think it is true that OS not supporting native CXL will ignore the
data.

Linux will create a small amount of infrastructure to reflect them (more or
less the same as a memoryless node) and also they will appear in places
like access0 as a possible initiator of transactions.  It's small stuff,
but I'd rather the impact on legacy was zero.

So my gut feeling here is we shouldn't reuse the generic initiator, but
should invent something new.  Would look similar to GI, but with a different
ID - to ensure legacy OS ignores it.

Unfortunately we can't just add a flag because backwards compatibility
with old OS would mean it was ignored.  Hence I think this needs to be
a new type.

If we define a new node type rather than extend GI, we need to be careful
around the same issue with _PXM that we had when introducing Generic
Initiators (not sure the protections on that made it back to stable)
so might need to modify DSDT _PXM responses based on appropriate _OSC.
May well be fine but I'm not convinced yet.  Perhaps we need to say
that using _PXM to place anything in a node defined only via this new means
is not valid.

Jonathan

> 
> # References
> * Compute Express Link Specification v2.0,
> <https://www.computeexpresslink.org/>
> 
> # Detailed Description of the Change
> 
> * Replace "Generic Initiator" with "Generic Initiator / Target" in all
> locations except where an "initiator" or "target" is implied.
> Specifically 5.2.27.3 "Memory Proximity Domain Attributes Structure"
> need not replace occurrences of "generic initiator" in field: "Proximity
> Domain for Attached Initiator". Additionally field: "Proximity Domain
> for the Memory" must be renamed to "Proximity Domain for the Memory /
> Generic Target" with a new description "Integer that represents the
> memory / generic target proximity domain to which this memory belongs."
> 
> * Revise "5.2.16.6 Generic Initiator Affinity Structure" to make it
>   consistent with being referenced as either a target or initiator.
> 
>         * Description: (replace all text)
> 
>         > The Generic Initiator / Target Affinity Structure provides the
>         > association between a Generic Initiator and a Memory Proximity
>         > Domain, or another Generic Target Proximity Domain. The
>         > distinction as to whether this structure represents an
>         > Initiator, a Target, or both depends on how it is referenced
>         > in the HMAT. See Section 5.2.27.3 for details.  
> 
>         > Support of Generic Initiator / Target Affinity Structures by
>         > OSPM is optional, and the platform may query whether the OS
>         > supports it via the _OSC method. See Section 6.2.11.2.  
> 
>         * Architectural transactions: (append after current text)
> 
>         > If this proximity domain is referenced as a target then it
>         > supports all the transaction types inferred above.  
> 
>         * Other updates are simple Initiator => Initiator / Target
>           replacements.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC] ACPI Code First ECR: Generic Target
  2021-02-10 11:23 ` Jonathan Cameron
@ 2021-02-10 15:18   ` Natu, Mahesh
  2021-02-10 16:02     ` Jonathan Cameron
  2021-02-10 16:24   ` Dan Williams
  1 sibling, 1 reply; 18+ messages in thread
From: Natu, Mahesh @ 2021-02-10 15:18 UTC (permalink / raw)
  To: Jonathan Cameron, Williams, Dan J
  Cc: linux-cxl, Linux ACPI, Douglas, Chet R, Widawsky, Ben, Verma, Vishal L

Hi Jonathan,

First of all, it is a two-way problem because of the possibility of peer to peer traffic between root ports.

CDAT works very well for enumerable components with clean boundaries. It was not designed to handle Root complex objects. For example, it would have to account for CPU vendor specific coherent links between CPUs and at that point it starts looking like "HMAT".

Thank you,
Mahesh Natu
Datacenter Platform Architect
Intel Corporation

-----Original Message-----
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com> 
Sent: Wednesday, February 10, 2021 3:24 AM
To: Williams, Dan J <dan.j.williams@intel.com>
Cc: linux-cxl@vger.kernel.org; Linux ACPI <linux-acpi@vger.kernel.org>; Natu, Mahesh <mahesh.natu@intel.com>; Douglas, Chet R <chet.r.douglas@intel.com>; Widawsky, Ben <ben.widawsky@intel.com>; Verma, Vishal L <vishal.l.verma@intel.com>
Subject: Re: [RFC] ACPI Code First ECR: Generic Target

On Tue, 9 Feb 2021 19:55:05 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> While the platform BIOS is able to describe the performance 
> characteristics of CXL memory that is present at boot, it is unable to 
> statically enumerate the performance of CXL memory hot inserted 
> post-boot. The OS can enumerate most of the characteristics from link 
> registers and CDAT, but the performance from the CPU to the host 
> bridge, for example, is not enumerated by PCIE or CXL. Introduce an 
> ACPI mechanism for this purpose. Critically this is achieved with a 
> small tweak to how the existing Generic Initiator proximity domain is 
> utilized in the HMAT.

Hi Dan,

Agree there is a hole here, but I think the proposed solution has some issues for backwards compatibility.

Just to clarify, I believe CDAT from root ports is sufficient for the other direction (GI on CXL, memory in host).  I wondered initially if this was a two way issue, but after a reread, I think that is fine with the root port providing CDAT or potentially treating the root port as a GI (though that runs into the same naming / representation issue as below and I think would need some clarifying text in UEFI GI description)

http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf

For the case you are dealing with here potentially we 'could' add something to CDAT as alternative to changing SRAT, but it would be more complex so your approach here makes more sense to me.

> 
> ---
> 
> # Title: Introduce a Generic Target for CXL
> 
> # Status: Draft
> 
> # Document: ACPI Specification 6.4
> 
> # License
> SPDX-License Identifier: CC-BY-4.0
> 
> # Submitter:
> * Sponsor: Dan Williams, Intel
> * Creators/Contributors:
>     * Mahesh Natu, Intel
>     * Chet Douglas, Intel
>     * Deepak Shivakumar, Intel
> 
> # Summary of the Change
> Introduce a "Generic Target" concept to the SRAT to describe the root 
> performance parameters in the path to dynamically discovered (outside 
> of ACPI enumeration) CXL memory target endpoints.
> 
> # Benefits of the Change
> Consider the case of a system with a set of CXL host bridges 
> (ACPI0016),

Superficially feels like this new SRAT entry might reference the CXL 2.0 Root ports or the host bridge.

> and no devices attached at initial system power-on. In this scenario 
> platform firmware is unable to perform the end-to-end enumeration 
> necessary to populate SRAT and HMAT for the endpoints that may be 
> hot-inserted behind those bridges post power-on. The address-range is 
> unknown so SRAT can not be pre-populated, the performance is unknown 
> (no CDAT nor interleave configuration) so HMAT can not be pre-populated.
> 
> However, what is known to platform firmware that generates the SRAT 
> and HMAT is the performance characteristics of the path between CPU 
> and Generic Initiators to the CXL host bridge target. With either 
> CPU-to-Generic-Target, or Generic-Initiator-to-Generic-Target entries 
> in the HMAT the OS CXL subsystem can enumerate the remaining details 
> (PCIE link status, device CDAT, interleave configuration) to calculate 
> the bandwidth and latency of a dynamically discovered CXL memory target.

I'm wondering if the term "generic target" is a good name.
Would something like "generic target bridge" be clearer?
The point being this isn't an actual target but a point along the way.
Mind you this is close to bike shedding.

As mentioned above, maybe "generic bridge" that can give us a node to hang data off for both, a) GI on CXL to host memory, and b) Initiator in host to CXL memory and hence give cleaner representation.

> 
> # Impact of the Change
> The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> 5.2.16.6) already contains all the fields necessary to enumerate a 
> generic target proximity domain. All that is missing is the 
> interpretation of that proximity domain optionally as a target 
> identifier in the HMAT.
> 
> Given that the OS still needs to dynamically enumerate and instantiate 
> the memory ranges behind the host bridge. The assumption is that 
> operating systems that do not support native CXL enumeration will 
> ignore this data in the HMAT, while CXL native enumeration aware 
> environments will use this fragment of the performance path to 
> calculate the performance characteristics.

I don't think it is true that OS not supporting native CXL will ignore the data.

Linux will create a small amount of infrastructure to reflect them (more or less the same as a memoryless node) and also they will appear in places like access0 as a possible initiator of transactions.  It's small stuff, but I'd rather the impact on legacy was zero.

So my gut feeling here is we shouldn't reuse the generic initiator, but should invent something new.  Would look similar to GI, but with a different ID - to ensure legacy OS ignores it.

Unfortunately we can't just add a flag because backwards compatibility with old OS would mean it was ignored.  Hence I think this needs to be a new type.

If we define a new node type rather than extend GI, we need to be careful around the same issue with _PXM that we had when introducing Generic Initiators (not sure the protections on that made it back to stable) so might need to modify DSDT _PXM responses based on appropriate _OSC.
May well be fine but I'm not convinced yet.  Perhaps we need to say that using _PXM to place anything in a node defined only via this new means is not valid.

Jonathan

> 
> # References
> * Compute Express Link Specification v2.0, 
> <https://www.computeexpresslink.org/>
> 
> # Detailed Description of the Change
> 
> * Replace "Generic Initiator" with "Generic Initiator / Target" in all 
> locations except where an "initiator" or "target" is implied.
> Specifically 5.2.27.3 "Memory Proximity Domain Attributes Structure"
> need not replace occurrences of "generic initiator" in field: 
> "Proximity Domain for Attached Initiator". Additionally field: 
> "Proximity Domain for the Memory" must be renamed to "Proximity Domain 
> for the Memory / Generic Target" with a new description "Integer that 
> represents the memory / generic target proximity domain to which this memory belongs."
> 
> * Revise "5.2.16.6 Generic Initiator Affinity Structure" to make it
>   consistent with being referenced as either a target or initiator.
> 
>         * Description: (replace all text)
> 
>         > The Generic Initiator / Target Affinity Structure provides the
>         > association between a Generic Initiator and a Memory Proximity
>         > Domain, or another Generic Target Proximity Domain. The
>         > distinction as to whether this structure represents an
>         > Initiator, a Target, or both depends on how it is referenced
>         > in the HMAT. See Section 5.2.27.3 for details.  
> 
>         > Support of Generic Initiator / Target Affinity Structures by
>         > OSPM is optional, and the platform may query whether the OS
>         > supports it via the _OSC method. See Section 6.2.11.2.  
> 
>         * Architectural transactions: (append after current text)
> 
>         > If this proximity domain is referenced as a target then it
>         > supports all the transaction types inferred above.  
> 
>         * Other updates are simple Initiator => Initiator / Target
>           replacements.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-10 15:18   ` Natu, Mahesh
@ 2021-02-10 16:02     ` Jonathan Cameron
  0 siblings, 0 replies; 18+ messages in thread
From: Jonathan Cameron @ 2021-02-10 16:02 UTC (permalink / raw)
  To: Natu, Mahesh
  Cc: Williams, Dan J, linux-cxl, Linux ACPI, Douglas, Chet R,
	Widawsky, Ben, Verma, Vishal L

On Wed, 10 Feb 2021 15:18:38 +0000
"Natu, Mahesh" <mahesh.natu@intel.com> wrote:

> Hi Jonathan,
Hi Mahesh,

> 
> First of all, it is a two-way problem because of the possibility of peer to peer traffic between root ports.

Good point.  So in that case it seems we need to be able to report characteristics between pairs
of CXL ports (in generic terms 'boundaries' of the system where stuff might be connected later.)
As such they need to be represented in topology as something that can be both
a target and initiator (or a bridge to targets and initiators anyway).

> 
> CDAT works very well for enumerable components with clean boundaries. It was not designed to handle Root complex objects. For example, it would have to account for CPU vendor specific coherent links between CPUs and at that point it starts looking like "HMAT".

Agreed.  Clear distinction between what 'could' be done and what makes sense
to do in CDAT.  It's definitely feels like the wrong solution here.

Jonathan

> 
> Thank you,
> Mahesh Natu
> Datacenter Platform Architect
> Intel Corporation
> 
> -----Original Message-----
> From: Jonathan Cameron <Jonathan.Cameron@Huawei.com> 
> Sent: Wednesday, February 10, 2021 3:24 AM
> To: Williams, Dan J <dan.j.williams@intel.com>
> Cc: linux-cxl@vger.kernel.org; Linux ACPI <linux-acpi@vger.kernel.org>; Natu, Mahesh <mahesh.natu@intel.com>; Douglas, Chet R <chet.r.douglas@intel.com>; Widawsky, Ben <ben.widawsky@intel.com>; Verma, Vishal L <vishal.l.verma@intel.com>
> Subject: Re: [RFC] ACPI Code First ECR: Generic Target
> 
> On Tue, 9 Feb 2021 19:55:05 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > While the platform BIOS is able to describe the performance 
> > characteristics of CXL memory that is present at boot, it is unable to 
> > statically enumerate the performance of CXL memory hot inserted 
> > post-boot. The OS can enumerate most of the characteristics from link 
> > registers and CDAT, but the performance from the CPU to the host 
> > bridge, for example, is not enumerated by PCIE or CXL. Introduce an 
> > ACPI mechanism for this purpose. Critically this is achieved with a 
> > small tweak to how the existing Generic Initiator proximity domain is 
> > utilized in the HMAT.  
> 
> Hi Dan,
> 
> Agree there is a hole here, but I think the proposed solution has some issues for backwards compatibility.
> 
> Just to clarify, I believe CDAT from root ports is sufficient for the other direction (GI on CXL, memory in host).  I wondered initially if this was a two way issue, but after a reread, I think that is fine with the root port providing CDAT or potentially treating the root port as a GI (though that runs into the same naming / representation issue as below and I think would need some clarifying text in UEFI GI description)
> 
> http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> 
> For the case you are dealing with here potentially we 'could' add something to CDAT as alternative to changing SRAT, but it would be more complex so your approach here makes more sense to me.
> 
> > 
> > ---
> > 
> > # Title: Introduce a Generic Target for CXL
> > 
> > # Status: Draft
> > 
> > # Document: ACPI Specification 6.4
> > 
> > # License
> > SPDX-License Identifier: CC-BY-4.0
> > 
> > # Submitter:
> > * Sponsor: Dan Williams, Intel
> > * Creators/Contributors:
> >     * Mahesh Natu, Intel
> >     * Chet Douglas, Intel
> >     * Deepak Shivakumar, Intel
> > 
> > # Summary of the Change
> > Introduce a "Generic Target" concept to the SRAT to describe the root 
> > performance parameters in the path to dynamically discovered (outside 
> > of ACPI enumeration) CXL memory target endpoints.
> > 
> > # Benefits of the Change
> > Consider the case of a system with a set of CXL host bridges 
> > (ACPI0016),  
> 
> Superficially feels like this new SRAT entry might reference the CXL 2.0 Root ports or the host bridge.
> 
> > and no devices attached at initial system power-on. In this scenario 
> > platform firmware is unable to perform the end-to-end enumeration 
> > necessary to populate SRAT and HMAT for the endpoints that may be 
> > hot-inserted behind those bridges post power-on. The address-range is 
> > unknown so SRAT can not be pre-populated, the performance is unknown 
> > (no CDAT nor interleave configuration) so HMAT can not be pre-populated.
> > 
> > However, what is known to platform firmware that generates the SRAT 
> > and HMAT is the performance characteristics of the path between CPU 
> > and Generic Initiators to the CXL host bridge target. With either 
> > CPU-to-Generic-Target, or Generic-Initiator-to-Generic-Target entries 
> > in the HMAT the OS CXL subsystem can enumerate the remaining details 
> > (PCIE link status, device CDAT, interleave configuration) to calculate 
> > the bandwidth and latency of a dynamically discovered CXL memory target.  
> 
> I'm wondering if the term "generic target" is a good name.
> Would something like "generic target bridge" be clearer?
> The point being this isn't an actual target but a point along the way.
> Mind you this is close to bike shedding.
> 
> As mentioned above, maybe "generic bridge" that can give us a node to hang data off for both, a) GI on CXL to host memory, and b) Initiator in host to CXL memory and hence give cleaner representation.
> 
> > 
> > # Impact of the Change
> > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> > 5.2.16.6) already contains all the fields necessary to enumerate a 
> > generic target proximity domain. All that is missing is the 
> > interpretation of that proximity domain optionally as a target 
> > identifier in the HMAT.
> > 
> > Given that the OS still needs to dynamically enumerate and instantiate 
> > the memory ranges behind the host bridge. The assumption is that 
> > operating systems that do not support native CXL enumeration will 
> > ignore this data in the HMAT, while CXL native enumeration aware 
> > environments will use this fragment of the performance path to 
> > calculate the performance characteristics.  
> 
> I don't think it is true that OS not supporting native CXL will ignore the data.
> 
> Linux will create a small amount of infrastructure to reflect them (more or less the same as a memoryless node) and also they will appear in places like access0 as a possible initiator of transactions.  It's small stuff, but I'd rather the impact on legacy was zero.
> 
> So my gut feeling here is we shouldn't reuse the generic initiator, but should invent something new.  Would look similar to GI, but with a different ID - to ensure legacy OS ignores it.
> 
> Unfortunately we can't just add a flag because backwards compatibility with old OS would mean it was ignored.  Hence I think this needs to be a new type.
> 
> If we define a new node type rather than extend GI, we need to be careful around the same issue with _PXM that we had when introducing Generic Initiators (not sure the protections on that made it back to stable) so might need to modify DSDT _PXM responses based on appropriate _OSC.
> May well be fine but I'm not convinced yet.  Perhaps we need to say that using _PXM to place anything in a node defined only via this new means is not valid.
> 
> Jonathan
> 
> > 
> > # References
> > * Compute Express Link Specification v2.0, 
> > <https://www.computeexpresslink.org/>
> > 
> > # Detailed Description of the Change
> > 
> > * Replace "Generic Initiator" with "Generic Initiator / Target" in all 
> > locations except where an "initiator" or "target" is implied.
> > Specifically 5.2.27.3 "Memory Proximity Domain Attributes Structure"
> > need not replace occurrences of "generic initiator" in field: 
> > "Proximity Domain for Attached Initiator". Additionally field: 
> > "Proximity Domain for the Memory" must be renamed to "Proximity Domain 
> > for the Memory / Generic Target" with a new description "Integer that 
> > represents the memory / generic target proximity domain to which this memory belongs."
> > 
> > * Revise "5.2.16.6 Generic Initiator Affinity Structure" to make it
> >   consistent with being referenced as either a target or initiator.
> > 
> >         * Description: (replace all text)
> >   
> >         > The Generic Initiator / Target Affinity Structure provides the
> >         > association between a Generic Initiator and a Memory Proximity
> >         > Domain, or another Generic Target Proximity Domain. The
> >         > distinction as to whether this structure represents an
> >         > Initiator, a Target, or both depends on how it is referenced
> >         > in the HMAT. See Section 5.2.27.3 for details.    
> >   
> >         > Support of Generic Initiator / Target Affinity Structures by
> >         > OSPM is optional, and the platform may query whether the OS
> >         > supports it via the _OSC method. See Section 6.2.11.2.    
> > 
> >         * Architectural transactions: (append after current text)
> >   
> >         > If this proximity domain is referenced as a target then it
> >         > supports all the transaction types inferred above.    
> > 
> >         * Other updates are simple Initiator => Initiator / Target
> >           replacements.  
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-10 11:23 ` Jonathan Cameron
  2021-02-10 15:18   ` Natu, Mahesh
@ 2021-02-10 16:24   ` Dan Williams
  2021-02-11  9:42     ` Jonathan Cameron
  1 sibling, 1 reply; 18+ messages in thread
From: Dan Williams @ 2021-02-10 16:24 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Tue, 9 Feb 2021 19:55:05 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > While the platform BIOS is able to describe the performance
> > characteristics of CXL memory that is present at boot, it is unable to
> > statically enumerate the performance of CXL memory hot inserted
> > post-boot. The OS can enumerate most of the characteristics from link
> > registers and CDAT, but the performance from the CPU to the host
> > bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> > ACPI mechanism for this purpose. Critically this is achieved with a
> > small tweak to how the existing Generic Initiator proximity domain is
> > utilized in the HMAT.
>
> Hi Dan,
>
> Agree there is a hole here, but I think the proposed solution has some
> issues for backwards compatibility.
>
> Just to clarify, I believe CDAT from root ports is sufficient for the
> other direction (GI on CXL, memory in host).  I wondered initially if
> this was a two way issue, but after a reread, I think that is fine
> with the root port providing CDAT or potentially treating the root
> port as a GI (though that runs into the same naming / representation issue
> as below and I think would need some clarifying text in UEFI GI description)
>
> http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
>
> For the case you are dealing with here potentially we 'could' add something
> to CDAT as alternative to changing SRAT, but it would be more complex
> so your approach here makes more sense to me.

CDAT seems the wrong mechanism because it identifies target
performance once you're at the front door of the device, not
performance relative to a given initiator.

>
> >
> > ---
> >
> > # Title: Introduce a Generic Target for CXL
> >
> > # Status: Draft
> >
> > # Document: ACPI Specification 6.4
> >
> > # License
> > SPDX-License Identifier: CC-BY-4.0
> >
> > # Submitter:
> > * Sponsor: Dan Williams, Intel
> > * Creators/Contributors:
> >     * Mahesh Natu, Intel
> >     * Chet Douglas, Intel
> >     * Deepak Shivakumar, Intel
> >
> > # Summary of the Change
> > Introduce a "Generic Target" concept to the SRAT to describe the root
> > performance parameters in the path to dynamically discovered (outside of
> > ACPI enumeration) CXL memory target endpoints.
> >
> > # Benefits of the Change
> > Consider the case of a system with a set of CXL host bridges (ACPI0016),
>
> Superficially feels like this new SRAT entry might reference the CXL 2.0 Root
> ports or the host bridge.
>
> > and no devices attached at initial system power-on. In this scenario
> > platform firmware is unable to perform the end-to-end enumeration
> > necessary to populate SRAT and HMAT for the endpoints that may be
> > hot-inserted behind those bridges post power-on. The address-range is
> > unknown so SRAT can not be pre-populated, the performance is unknown (no
> > CDAT nor interleave configuration) so HMAT can not be pre-populated.
> >
> > However, what is known to platform firmware that generates the SRAT and
> > HMAT is the performance characteristics of the path between CPU and
> > Generic Initiators to the CXL host bridge target. With either
> > CPU-to-Generic-Target, or Generic-Initiator-to-Generic-Target entries in
> > the HMAT the OS CXL subsystem can enumerate the remaining details (PCIE
> > link status, device CDAT, interleave configuration) to calculate the
> > bandwidth and latency of a dynamically discovered CXL memory target.
>
> I'm wondering if the term "generic target" is a good name.
> Would something like "generic target bridge" be clearer?
> The point being this isn't an actual target but a point along the way.
> Mind you this is close to bike shedding.
>
> As mentioned above, maybe "generic bridge" that can give us a node to hang
> data off for both, a) GI on CXL to host memory, and b) Initiator in host to CXL memory
> and hence give cleaner representation.

"Target" in the sense of its role in the HMAT. This is conceptually
not limited to bridges. Imagine a CXL endpoint that the BIOS lets the
OS map into the memory address space, but describes the performance in
HMAT.

>
> >
> > # Impact of the Change
> > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> > 5.2.16.6) already contains all the fields necessary to enumerate a
> > generic target proximity domain. All that is missing is the
> > interpretation of that proximity domain optionally as a target
> > identifier in the HMAT.
> >
> > Given that the OS still needs to dynamically enumerate and instantiate
> > the memory ranges behind the host bridge. The assumption is that
> > operating systems that do not support native CXL enumeration will ignore
> > this data in the HMAT, while CXL native enumeration aware environments
> > will use this fragment of the performance path to calculate the
> > performance characteristics.
>
> I don't think it is true that OS not supporting native CXL will ignore the
> data.

True, I should have chosen more careful words like s/ignore/not
regress upon seeing/

>
> Linux will create a small amount of infrastructure to reflect them (more or
> less the same as a memoryless node) and also they will appear in places
> like access0 as a possible initiator of transactions.  It's small stuff,
> but I'd rather the impact on legacy was zero.

I'm failing to see that small collision as fatal to the proposal. The
HMAT parsing had a significant bug for multiple kernel releases and no
one noticed. This quirk is minor in comparison.

>
> So my gut feeling here is we shouldn't reuse the generic initiator, but
> should invent something new.  Would look similar to GI, but with a different
> ID - to ensure legacy OS ignores it.

A new id introduces more problems than it solves. Set aside the ACPICA
thrash, it does not allow a clean identity mapping of a point in a
system topology being both initiator and target. The SRAT does not
need more data structures to convey this information. At most I would
advocate for an OSC bit for the OS to opt into allowing this new usage
in the HMAT, but that still feels like overkill absent a clear
regression in legacy environments. The fact that hardly anyone is
using HMAT (as indicated by the bug I mentioned) gives me confidence
that perfection is more "enemy of the good" than required here.

>
> Unfortunately we can't just add a flag because backwards compatibility
> with old OS would mean it was ignored.  Hence I think this needs to be
> a new type.
>
> If we define a new node type rather than extend GI, we need to be careful
> around the same issue with _PXM that we had when introducing Generic
> Initiators (not sure the protections on that made it back to stable)
> so might need to modify DSDT _PXM responses based on appropriate _OSC.
> May well be fine but I'm not convinced yet.  Perhaps we need to say
> that using _PXM to place anything in a node defined only via this new means
> is not valid.
>
> Jonathan
>
> >
> > # References
> > * Compute Express Link Specification v2.0,
> > <https://www.computeexpresslink.org/>
> >
> > # Detailed Description of the Change
> >
> > * Replace "Generic Initiator" with "Generic Initiator / Target" in all
> > locations except where an "initiator" or "target" is implied.
> > Specifically 5.2.27.3 "Memory Proximity Domain Attributes Structure"
> > need not replace occurrences of "generic initiator" in field: "Proximity
> > Domain for Attached Initiator". Additionally field: "Proximity Domain
> > for the Memory" must be renamed to "Proximity Domain for the Memory /
> > Generic Target" with a new description "Integer that represents the
> > memory / generic target proximity domain to which this memory belongs."
> >
> > * Revise "5.2.16.6 Generic Initiator Affinity Structure" to make it
> >   consistent with being referenced as either a target or initiator.
> >
> >         * Description: (replace all text)
> >
> >         > The Generic Initiator / Target Affinity Structure provides the
> >         > association between a Generic Initiator and a Memory Proximity
> >         > Domain, or another Generic Target Proximity Domain. The
> >         > distinction as to whether this structure represents an
> >         > Initiator, a Target, or both depends on how it is referenced
> >         > in the HMAT. See Section 5.2.27.3 for details.
> >
> >         > Support of Generic Initiator / Target Affinity Structures by
> >         > OSPM is optional, and the platform may query whether the OS
> >         > supports it via the _OSC method. See Section 6.2.11.2.
> >
> >         * Architectural transactions: (append after current text)
> >
> >         > If this proximity domain is referenced as a target then it
> >         > supports all the transaction types inferred above.
> >
> >         * Other updates are simple Initiator => Initiator / Target
> >           replacements.
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC] ACPI Code First ECR: Generic Target
  2021-02-10  3:55 [RFC] ACPI Code First ECR: Generic Target Dan Williams
  2021-02-10 11:23 ` Jonathan Cameron
@ 2021-02-10 17:02 ` Vikram Sethi
  2021-02-12  0:13   ` Natu, Mahesh
  1 sibling, 1 reply; 18+ messages in thread
From: Vikram Sethi @ 2021-02-10 17:02 UTC (permalink / raw)
  To: Dan Williams, linux-cxl, Linux ACPI
  Cc: Natu, Mahesh, Chet R Douglas, Ben Widawsky, Vishal L Verma

Hi Dan,

> -----Original Message-----
> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Tuesday, February 9, 2021 9:55 PM
> To: linux-cxl@vger.kernel.org; Linux ACPI <linux-acpi@vger.kernel.org>
> Cc: Natu, Mahesh <mahesh.natu@intel.com>; Chet R Douglas
> <chet.r.douglas@intel.com>; Ben Widawsky <ben.widawsky@intel.com>; Vishal L
> Verma <vishal.l.verma@intel.com>
> Subject: [RFC] ACPI Code First ECR: Generic Target
> 
> External email: Use caution opening links or attachments
> 
> 
> While the platform BIOS is able to describe the performance
> characteristics of CXL memory that is present at boot, it is unable to
> statically enumerate the performance of CXL memory hot inserted
> post-boot. The OS can enumerate most of the characteristics from link
> registers and CDAT, but the performance from the CPU to the host
> bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> ACPI mechanism for this purpose. Critically this is achieved with a
> small tweak to how the existing Generic Initiator proximity domain is
> utilized in the HMAT.
> 
> ---
> 
> # Title: Introduce a Generic Target for CXL
> 
> # Status: Draft
> 
> # Document: ACPI Specification 6.4
> 
> # License
> SPDX-License Identifier: CC-BY-4.0
> 
> # Submitter:
> * Sponsor: Dan Williams, Intel
> * Creators/Contributors:
>     * Mahesh Natu, Intel
>     * Chet Douglas, Intel
>     * Deepak Shivakumar, Intel
> 
> # Summary of the Change
> Introduce a "Generic Target" concept to the SRAT to describe the root
> performance parameters in the path to dynamically discovered (outside of
> ACPI enumeration) CXL memory target endpoints.
> 
> # Benefits of the Change
> Consider the case of a system with a set of CXL host bridges (ACPI0016),
> and no devices attached at initial system power-on. In this scenario
> platform firmware is unable to perform the end-to-end enumeration
> necessary to populate SRAT and HMAT for the endpoints that may be
> hot-inserted behind those bridges post power-on. The address-range is
> unknown so SRAT can not be pre-populated, the performance is unknown (no
> CDAT nor interleave configuration) so HMAT can not be pre-populated.
> 
> However, what is known to platform firmware that generates the SRAT and
> HMAT is the performance characteristics of the path between CPU and
> Generic Initiators to the CXL host bridge target. With either
> CPU-to-Generic-Target, or Generic-Initiator-to-Generic-Target entries in
> the HMAT the OS CXL subsystem can enumerate the remaining details (PCIE
> link status, device CDAT, interleave configuration) to calculate the
> bandwidth and latency of a dynamically discovered CXL memory target.
> 
What if there is a CXL switch with a Generic Initiator and CXL type 3 memory
which could be hotplugged in. 
I forget if the GI to type 3 memory path is ONLY through the host bridge today in 2.0 
or also allowed through the switch. In future we would want it allowed through
the switch for sure, just like PCIe p2p. 
So how would the switch route latency/BW be discovered?
Also, an example with numbers of what would be in HMAT may help understand the case
where everything is only connected via host bridge also. 

> # Impact of the Change
> The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> 5.2.16.6) already contains all the fields necessary to enumerate a
> generic target proximity domain. All that is missing is the
> interpretation of that proximity domain optionally as a target
> identifier in the HMAT.
> 
> Given that the OS still needs to dynamically enumerate and instantiate
> the memory ranges behind the host bridge. The assumption is that
> operating systems that do not support native CXL enumeration will ignore
> this data in the HMAT, while CXL native enumeration aware environments
> will use this fragment of the performance path to calculate the
> performance characteristics.
> 
> # References
> * Compute Express Link Specification v2.0,
> <https://www.computeexpresslink.org/>
> 
> # Detailed Description of the Change
> 
> * Replace "Generic Initiator" with "Generic Initiator / Target" in all
> locations except where an "initiator" or "target" is implied.
> Specifically 5.2.27.3 "Memory Proximity Domain Attributes Structure"
> need not replace occurrences of "generic initiator" in field: "Proximity
> Domain for Attached Initiator". Additionally field: "Proximity Domain
> for the Memory" must be renamed to "Proximity Domain for the Memory /
> Generic Target" with a new description "Integer that represents the
> memory / generic target proximity domain to which this memory belongs."
> 
> * Revise "5.2.16.6 Generic Initiator Affinity Structure" to make it
>   consistent with being referenced as either a target or initiator.
> 
>         * Description: (replace all text)
> 
>         > The Generic Initiator / Target Affinity Structure provides the
>         > association between a Generic Initiator and a Memory Proximity
>         > Domain, or another Generic Target Proximity Domain. The
>         > distinction as to whether this structure represents an
>         > Initiator, a Target, or both depends on how it is referenced
>         > in the HMAT. See Section 5.2.27.3 for details.
> 
>         > Support of Generic Initiator / Target Affinity Structures by
>         > OSPM is optional, and the platform may query whether the OS
>         > supports it via the _OSC method. See Section 6.2.11.2.
> 
>         * Architectural transactions: (append after current text)
> 
>         > If this proximity domain is referenced as a target then it
>         > supports all the transaction types inferred above.
> 
>         * Other updates are simple Initiator => Initiator / Target
>           replacements.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-10 16:24   ` Dan Williams
@ 2021-02-11  9:42     ` Jonathan Cameron
  2021-02-11 17:06       ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2021-02-11  9:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Wed, 10 Feb 2021 08:24:51 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Tue, 9 Feb 2021 19:55:05 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >  
> > > While the platform BIOS is able to describe the performance
> > > characteristics of CXL memory that is present at boot, it is unable to
> > > statically enumerate the performance of CXL memory hot inserted
> > > post-boot. The OS can enumerate most of the characteristics from link
> > > registers and CDAT, but the performance from the CPU to the host
> > > bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> > > ACPI mechanism for this purpose. Critically this is achieved with a
> > > small tweak to how the existing Generic Initiator proximity domain is
> > > utilized in the HMAT.  
> >
> > Hi Dan,
> >
> > Agree there is a hole here, but I think the proposed solution has some
> > issues for backwards compatibility.
> >
> > Just to clarify, I believe CDAT from root ports is sufficient for the
> > other direction (GI on CXL, memory in host).  I wondered initially if
> > this was a two way issue, but after a reread, I think that is fine
> > with the root port providing CDAT or potentially treating the root
> > port as a GI (though that runs into the same naming / representation issue
> > as below and I think would need some clarifying text in UEFI GI description)
> >
> > http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> >
> > For the case you are dealing with here potentially we 'could' add something
> > to CDAT as alternative to changing SRAT, but it would be more complex
> > so your approach here makes more sense to me.  
> 
> CDAT seems the wrong mechanism because it identifies target
> performance once you're at the front door of the device, not
> performance relative to a given initiator.

I'd argue you could make CDAT a more symmetric representation, but it would
end up replicating a lot of info already in HMAT.  Didn't say it was a good
idea!

> 
> >  
> > >
> > > ---
> > >
> > > # Title: Introduce a Generic Target for CXL
> > >
> > > # Status: Draft
> > >
> > > # Document: ACPI Specification 6.4
> > >
> > > # License
> > > SPDX-License Identifier: CC-BY-4.0
> > >
> > > # Submitter:
> > > * Sponsor: Dan Williams, Intel
> > > * Creators/Contributors:
> > >     * Mahesh Natu, Intel
> > >     * Chet Douglas, Intel
> > >     * Deepak Shivakumar, Intel
> > >
> > > # Summary of the Change
> > > Introduce a "Generic Target" concept to the SRAT to describe the root
> > > performance parameters in the path to dynamically discovered (outside of
> > > ACPI enumeration) CXL memory target endpoints.
> > >
> > > # Benefits of the Change
> > > Consider the case of a system with a set of CXL host bridges (ACPI0016),  
> >
> > Superficially feels like this new SRAT entry might reference the CXL 2.0 Root
> > ports or the host bridge.
> >  
> > > and no devices attached at initial system power-on. In this scenario
> > > platform firmware is unable to perform the end-to-end enumeration
> > > necessary to populate SRAT and HMAT for the endpoints that may be
> > > hot-inserted behind those bridges post power-on. The address-range is
> > > unknown so SRAT can not be pre-populated, the performance is unknown (no
> > > CDAT nor interleave configuration) so HMAT can not be pre-populated.
> > >
> > > However, what is known to platform firmware that generates the SRAT and
> > > HMAT is the performance characteristics of the path between CPU and
> > > Generic Initiators to the CXL host bridge target. With either
> > > CPU-to-Generic-Target, or Generic-Initiator-to-Generic-Target entries in
> > > the HMAT the OS CXL subsystem can enumerate the remaining details (PCIE
> > > link status, device CDAT, interleave configuration) to calculate the
> > > bandwidth and latency of a dynamically discovered CXL memory target.  
> >
> > I'm wondering if the term "generic target" is a good name.
> > Would something like "generic target bridge" be clearer?
> > The point being this isn't an actual target but a point along the way.
> > Mind you this is close to bike shedding.
> >
> > As mentioned above, maybe "generic bridge" that can give us a node to hang
> > data off for both, a) GI on CXL to host memory, and b) Initiator in host to CXL memory
> > and hence give cleaner representation.  
> 
> "Target" in the sense of its role in the HMAT. This is conceptually
> not limited to bridges. Imagine a CXL endpoint that the BIOS lets the
> OS map into the memory address space, but describes the performance in
> HMAT.

That's an odd situation that it sort of 'half' manages it in the BIOS.
We probably need some supplementary additional docs around this topic
as the OS would need to be aware of that possibility and explicitly check
for it before doing its normal build based on CDAT + what you are proposing
here.  Maybe code is enough but given this is cross OS stuff I'd argue
it probably isn't.

I guess could revisit this draft Uefi white paper perhaps and add a bunch
of examples around this usecase https://github.com/hisilicon/acpi-numa-whitepaper

> 
> >  
> > >
> > > # Impact of the Change
> > > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> > > 5.2.16.6) already contains all the fields necessary to enumerate a
> > > generic target proximity domain. All that is missing is the
> > > interpretation of that proximity domain optionally as a target
> > > identifier in the HMAT.
> > >
> > > Given that the OS still needs to dynamically enumerate and instantiate
> > > the memory ranges behind the host bridge. The assumption is that
> > > operating systems that do not support native CXL enumeration will ignore
> > > this data in the HMAT, while CXL native enumeration aware environments
> > > will use this fragment of the performance path to calculate the
> > > performance characteristics.  
> >
> > I don't think it is true that OS not supporting native CXL will ignore the
> > data.  
> 
> True, I should have chosen more careful words like s/ignore/not
> regress upon seeing/

It's a sticky corner and I suspect likely to come up at in ACPI WG - what is
being proposed here isn't backwards compatible, even if the impacts in Linux
are small.  Mostly it's infrastructure bring up that won't get used
(fallback lists and similar for a node which will never be specified in
allocations) and some confusing userspace ABI (which is more than a little
confusing already). 

> 
> >
> > Linux will create a small amount of infrastructure to reflect them (more or
> > less the same as a memoryless node) and also they will appear in places
> > like access0 as a possible initiator of transactions.  It's small stuff,
> > but I'd rather the impact on legacy was zero.  
> 
> I'm failing to see that small collision as fatal to the proposal. The
> HMAT parsing had a significant bug for multiple kernel releases and no
> one noticed. This quirk is minor in comparison.

True, there is a lag in HMAT adoption - though for ACPI tables, not that long
(only a couple of years :)

> 
> >
> > So my gut feeling here is we shouldn't reuse the generic initiator, but
> > should invent something new.  Would look similar to GI, but with a different
> > ID - to ensure legacy OS ignores it.  
> 
> A new id introduces more problems than it solves. Set aside the ACPICA
> thrash, it does not allow a clean identity mapping of a point in a
> system topology being both initiator and target. The SRAT does not
> need more data structures to convey this information. At most I would
> advocate for an OSC bit for the OS to opt into allowing this new usage
> in the HMAT, but that still feels like overkill absent a clear
> regression in legacy environments.

OSC for this case doesn't work. You can't necessarily evaluate it
early enough in the boot - in Linux the node setup is before AML parsing
comes up.  HMAT is evaluated a lot later, but SRAT is too early.  + in theory
another OS is allowed to evaluate HMAT before OSC is available.
 
> The fact that hardly anyone is
> using HMAT (as indicated by the bug I mentioned) gives me confidence
> that perfection is more "enemy of the good" than required here.

How about taking this another way

1) Assume that the costs of 'false' GI nodes on legacy system as a result
   of this is minor - so just live with it.  (probably true, but as ever
   need to confirm with other OS)

2) Try to remove the cost of pointless infrastructure on 'aware' kernels.
   Add a flag to the GI entry to say it's a bridge and not expected to,
   in of itself, represent an initiator or a target.
   In Linux we then don't create the node intrastructure etc or assign
   any devices to have the non existent NUMA node.

The information is still there to combine with device info (CDAT) etc
and build what we eventually want in the way of a representation of
the topology that Linux can use.

Now we just have the 'small' problem of figuring out how actually implement
hotplugging of NUMA nodes.

Jonathan

> 
> >
> > Unfortunately we can't just add a flag because backwards compatibility
> > with old OS would mean it was ignored.  Hence I think this needs to be
> > a new type.
> >
> > If we define a new node type rather than extend GI, we need to be careful
> > around the same issue with _PXM that we had when introducing Generic
> > Initiators (not sure the protections on that made it back to stable)
> > so might need to modify DSDT _PXM responses based on appropriate _OSC.
> > May well be fine but I'm not convinced yet.  Perhaps we need to say
> > that using _PXM to place anything in a node defined only via this new means
> > is not valid.
> >
> > Jonathan
> >  
> > >
> > > # References
> > > * Compute Express Link Specification v2.0,
> > > <https://www.computeexpresslink.org/>
> > >
> > > # Detailed Description of the Change
> > >
> > > * Replace "Generic Initiator" with "Generic Initiator / Target" in all
> > > locations except where an "initiator" or "target" is implied.
> > > Specifically 5.2.27.3 "Memory Proximity Domain Attributes Structure"
> > > need not replace occurrences of "generic initiator" in field: "Proximity
> > > Domain for Attached Initiator". Additionally field: "Proximity Domain
> > > for the Memory" must be renamed to "Proximity Domain for the Memory /
> > > Generic Target" with a new description "Integer that represents the
> > > memory / generic target proximity domain to which this memory belongs."
> > >
> > > * Revise "5.2.16.6 Generic Initiator Affinity Structure" to make it
> > >   consistent with being referenced as either a target or initiator.
> > >
> > >         * Description: (replace all text)
> > >  
> > >         > The Generic Initiator / Target Affinity Structure provides the
> > >         > association between a Generic Initiator and a Memory Proximity
> > >         > Domain, or another Generic Target Proximity Domain. The
> > >         > distinction as to whether this structure represents an
> > >         > Initiator, a Target, or both depends on how it is referenced
> > >         > in the HMAT. See Section 5.2.27.3 for details.  
> > >  
> > >         > Support of Generic Initiator / Target Affinity Structures by
> > >         > OSPM is optional, and the platform may query whether the OS
> > >         > supports it via the _OSC method. See Section 6.2.11.2.  
> > >
> > >         * Architectural transactions: (append after current text)
> > >  
> > >         > If this proximity domain is referenced as a target then it
> > >         > supports all the transaction types inferred above.  
> > >
> > >         * Other updates are simple Initiator => Initiator / Target
> > >           replacements.  
> >
> >  


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-11  9:42     ` Jonathan Cameron
@ 2021-02-11 17:06       ` Dan Williams
  2021-02-12 12:24         ` Jonathan Cameron
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2021-02-11 17:06 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Thu, Feb 11, 2021 at 1:44 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Wed, 10 Feb 2021 08:24:51 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> > >
> > > On Tue, 9 Feb 2021 19:55:05 -0800
> > > Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > > While the platform BIOS is able to describe the performance
> > > > characteristics of CXL memory that is present at boot, it is unable to
> > > > statically enumerate the performance of CXL memory hot inserted
> > > > post-boot. The OS can enumerate most of the characteristics from link
> > > > registers and CDAT, but the performance from the CPU to the host
> > > > bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> > > > ACPI mechanism for this purpose. Critically this is achieved with a
> > > > small tweak to how the existing Generic Initiator proximity domain is
> > > > utilized in the HMAT.
> > >
> > > Hi Dan,
> > >
> > > Agree there is a hole here, but I think the proposed solution has some
> > > issues for backwards compatibility.
> > >
> > > Just to clarify, I believe CDAT from root ports is sufficient for the
> > > other direction (GI on CXL, memory in host).  I wondered initially if
> > > this was a two way issue, but after a reread, I think that is fine
> > > with the root port providing CDAT or potentially treating the root
> > > port as a GI (though that runs into the same naming / representation issue
> > > as below and I think would need some clarifying text in UEFI GI description)
> > >
> > > http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> > >
> > > For the case you are dealing with here potentially we 'could' add something
> > > to CDAT as alternative to changing SRAT, but it would be more complex
> > > so your approach here makes more sense to me.
> >
> > CDAT seems the wrong mechanism because it identifies target
> > performance once you're at the front door of the device, not
> > performance relative to a given initiator.
>
> I'd argue you could make CDAT a more symmetric representation, but it would
> end up replicating a lot of info already in HMAT.  Didn't say it was a good
> idea!

CDAT describes points, HMAT describes edges on the performance graph,
it would be confusing if CDAT tried to supplant HMAT.

>
> That's an odd situation that it sort of 'half' manages it in the BIOS.
> We probably need some supplementary additional docs around this topic
> as the OS would need to be aware of that possibility and explicitly check
> for it before doing its normal build based on CDAT + what you are proposing
> here.  Maybe code is enough but given this is cross OS stuff I'd argue
> it probably isn't.
>
> I guess could revisit this draft Uefi white paper perhaps and add a bunch
> of examples around this usecase https://github.com/hisilicon/acpi-numa-whitepaper

Thanks for the reference, I'll take a look.

>
> >
> > >
> > > >
> > > > # Impact of the Change
> > > > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> > > > 5.2.16.6) already contains all the fields necessary to enumerate a
> > > > generic target proximity domain. All that is missing is the
> > > > interpretation of that proximity domain optionally as a target
> > > > identifier in the HMAT.
> > > >
> > > > Given that the OS still needs to dynamically enumerate and instantiate
> > > > the memory ranges behind the host bridge. The assumption is that
> > > > operating systems that do not support native CXL enumeration will ignore
> > > > this data in the HMAT, while CXL native enumeration aware environments
> > > > will use this fragment of the performance path to calculate the
> > > > performance characteristics.
> > >
> > > I don't think it is true that OS not supporting native CXL will ignore the
> > > data.
> >
> > True, I should have chosen more careful words like s/ignore/not
> > regress upon seeing/
>
> It's a sticky corner and I suspect likely to come up at in ACPI WG - what is
> being proposed here isn't backwards compatible

It seems our definitions of backwards compatible are divergent. Please
correct me if I'm wrong, but I understand your position to be "any
perceptible OS behavior change breaks backwards compatibility",
whereas I'm advocating that backwards compatibility is relative
regressing real world use cases. That said, I do need to go mock this
up in QEMU and verify how much disturbance it causes.

> even if the impacts in Linux are small.

I'd note the kernel would grind to a halt if the criteria for
"backwards compatible" was zero perceptible behavior change.

> Mostly it's infrastructure bring up that won't get used
> (fallback lists and similar for a node which will never be specified in
> allocations) and some confusing userspace ABI (which is more than a little
> confusing already).

Fallback lists are established relative to online nodes. These generic
targets are not onlined as memory.

> > > Linux will create a small amount of infrastructure to reflect them (more or
> > > less the same as a memoryless node) and also they will appear in places
> > > like access0 as a possible initiator of transactions.  It's small stuff,
> > > but I'd rather the impact on legacy was zero.
> >
> > I'm failing to see that small collision as fatal to the proposal. The
> > HMAT parsing had a significant bug for multiple kernel releases and no
> > one noticed. This quirk is minor in comparison.
>
> True, there is a lag in HMAT adoption - though for ACPI tables, not that long
> (only a couple of years :)
>
> >
> > >
> > > So my gut feeling here is we shouldn't reuse the generic initiator, but
> > > should invent something new.  Would look similar to GI, but with a different
> > > ID - to ensure legacy OS ignores it.
> >
> > A new id introduces more problems than it solves. Set aside the ACPICA
> > thrash, it does not allow a clean identity mapping of a point in a
> > system topology being both initiator and target. The SRAT does not
> > need more data structures to convey this information. At most I would
> > advocate for an OSC bit for the OS to opt into allowing this new usage
> > in the HMAT, but that still feels like overkill absent a clear
> > regression in legacy environments.
>
> OSC for this case doesn't work. You can't necessarily evaluate it
> early enough in the boot - in Linux the node setup is before AML parsing
> comes up.  HMAT is evaluated a lot later, but SRAT is too early.  + in theory
> another OS is allowed to evaluate HMAT before OSC is available.

The Linux node setup for online memory is before OSC parsing, but
there's nothing to "online" with a GI/GT entry. Also, if this was a
problem, it would already be impacting the OS today because this
proposal only changes HMAT, not SRAT. Lastly there *is* an OSC bit for
GI, so either that's vestigial and needs to be removed, or OSC is
relevant for this case.

>
> > The fact that hardly anyone is
> > using HMAT (as indicated by the bug I mentioned) gives me confidence
> > that perfection is more "enemy of the good" than required here.
>
> How about taking this another way
>
> 1) Assume that the costs of 'false' GI nodes on legacy system as a result
>    of this is minor - so just live with it.  (probably true, but as ever
>    need to confirm with other OS)
>
> 2) Try to remove the cost of pointless infrastructure on 'aware' kernels.
>    Add a flag to the GI entry to say it's a bridge and not expected to,
>    in of itself, represent an initiator or a target.
>    In Linux we then don't create the node intrastructure etc or assign
>    any devices to have the non existent NUMA node.
>
> The information is still there to combine with device info (CDAT) etc
> and build what we eventually want in the way of a representation of
> the topology that Linux can use.
>
> Now we just have the 'small' problem of figuring out how actually implement
> hotplugging of NUMA nodes.

I think it's tiny. Just pad the "possible" nodes past what SRAT enumerates.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC] ACPI Code First ECR: Generic Target
  2021-02-10 17:02 ` Vikram Sethi
@ 2021-02-12  0:13   ` Natu, Mahesh
  0 siblings, 0 replies; 18+ messages in thread
From: Natu, Mahesh @ 2021-02-12  0:13 UTC (permalink / raw)
  To: Vikram Sethi, Williams, Dan J, linux-cxl, Linux ACPI
  Cc: Douglas, Chet R, Widawsky, Ben, Verma, Vishal L

What if there is a CXL switch with a Generic Initiator and CXL type 3 memory which could be hotplugged in. 
I forget if the GI to type 3 memory path is ONLY through the host bridge today in 2.0 or also allowed through the switch. In future we would want it allowed through the switch for sure, just like PCIe p2p. 
So how would the switch route latency/BW be discovered?


In the current CDAT spec, a coherent switch is modelled as having two or more bidirectional ports that carry traffic between host CPUs, coherent accelerators, and coherent memory devices. Switch CDAT returns latency and BW between various port pairs and can describe what you are looking for.



Thank you,
Mahesh Natu
Datacenter Platform Architect
Intel Corporation

-----Original Message-----
From: Vikram Sethi <vsethi@nvidia.com> 
Sent: Wednesday, February 10, 2021 9:02 AM
To: Williams, Dan J <dan.j.williams@intel.com>; linux-cxl@vger.kernel.org; Linux ACPI <linux-acpi@vger.kernel.org>
Cc: Natu, Mahesh <mahesh.natu@intel.com>; Douglas, Chet R <chet.r.douglas@intel.com>; Widawsky, Ben <ben.widawsky@intel.com>; Verma, Vishal L <vishal.l.verma@intel.com>
Subject: RE: [RFC] ACPI Code First ECR: Generic Target

Hi Dan,

> -----Original Message-----
> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Tuesday, February 9, 2021 9:55 PM
> To: linux-cxl@vger.kernel.org; Linux ACPI <linux-acpi@vger.kernel.org>
> Cc: Natu, Mahesh <mahesh.natu@intel.com>; Chet R Douglas 
> <chet.r.douglas@intel.com>; Ben Widawsky <ben.widawsky@intel.com>; 
> Vishal L Verma <vishal.l.verma@intel.com>
> Subject: [RFC] ACPI Code First ECR: Generic Target
> 
> External email: Use caution opening links or attachments
> 
> 
> While the platform BIOS is able to describe the performance 
> characteristics of CXL memory that is present at boot, it is unable to 
> statically enumerate the performance of CXL memory hot inserted 
> post-boot. The OS can enumerate most of the characteristics from link 
> registers and CDAT, but the performance from the CPU to the host 
> bridge, for example, is not enumerated by PCIE or CXL. Introduce an 
> ACPI mechanism for this purpose. Critically this is achieved with a 
> small tweak to how the existing Generic Initiator proximity domain is 
> utilized in the HMAT.
> 
> ---
> 
> # Title: Introduce a Generic Target for CXL
> 
> # Status: Draft
> 
> # Document: ACPI Specification 6.4
> 
> # License
> SPDX-License Identifier: CC-BY-4.0
> 
> # Submitter:
> * Sponsor: Dan Williams, Intel
> * Creators/Contributors:
>     * Mahesh Natu, Intel
>     * Chet Douglas, Intel
>     * Deepak Shivakumar, Intel
> 
> # Summary of the Change
> Introduce a "Generic Target" concept to the SRAT to describe the root 
> performance parameters in the path to dynamically discovered (outside 
> of ACPI enumeration) CXL memory target endpoints.
> 
> # Benefits of the Change
> Consider the case of a system with a set of CXL host bridges 
> (ACPI0016), and no devices attached at initial system power-on. In 
> this scenario platform firmware is unable to perform the end-to-end 
> enumeration necessary to populate SRAT and HMAT for the endpoints that 
> may be hot-inserted behind those bridges post power-on. The 
> address-range is unknown so SRAT can not be pre-populated, the 
> performance is unknown (no CDAT nor interleave configuration) so HMAT can not be pre-populated.
> 
> However, what is known to platform firmware that generates the SRAT 
> and HMAT is the performance characteristics of the path between CPU 
> and Generic Initiators to the CXL host bridge target. With either 
> CPU-to-Generic-Target, or Generic-Initiator-to-Generic-Target entries 
> in the HMAT the OS CXL subsystem can enumerate the remaining details 
> (PCIE link status, device CDAT, interleave configuration) to calculate 
> the bandwidth and latency of a dynamically discovered CXL memory target.
> 
What if there is a CXL switch with a Generic Initiator and CXL type 3 memory which could be hotplugged in. 
I forget if the GI to type 3 memory path is ONLY through the host bridge today in 2.0 or also allowed through the switch. In future we would want it allowed through the switch for sure, just like PCIe p2p. 
So how would the switch route latency/BW be discovered?
Also, an example with numbers of what would be in HMAT may help understand the case where everything is only connected via host bridge also. 

> # Impact of the Change
> The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> 5.2.16.6) already contains all the fields necessary to enumerate a 
> generic target proximity domain. All that is missing is the 
> interpretation of that proximity domain optionally as a target 
> identifier in the HMAT.
> 
> Given that the OS still needs to dynamically enumerate and instantiate 
> the memory ranges behind the host bridge. The assumption is that 
> operating systems that do not support native CXL enumeration will 
> ignore this data in the HMAT, while CXL native enumeration aware 
> environments will use this fragment of the performance path to 
> calculate the performance characteristics.
> 
> # References
> * Compute Express Link Specification v2.0, 
> <https://www.computeexpresslink.org/>
> 
> # Detailed Description of the Change
> 
> * Replace "Generic Initiator" with "Generic Initiator / Target" in all 
> locations except where an "initiator" or "target" is implied.
> Specifically 5.2.27.3 "Memory Proximity Domain Attributes Structure"
> need not replace occurrences of "generic initiator" in field: 
> "Proximity Domain for Attached Initiator". Additionally field: 
> "Proximity Domain for the Memory" must be renamed to "Proximity Domain 
> for the Memory / Generic Target" with a new description "Integer that 
> represents the memory / generic target proximity domain to which this memory belongs."
> 
> * Revise "5.2.16.6 Generic Initiator Affinity Structure" to make it
>   consistent with being referenced as either a target or initiator.
> 
>         * Description: (replace all text)
> 
>         > The Generic Initiator / Target Affinity Structure provides the
>         > association between a Generic Initiator and a Memory Proximity
>         > Domain, or another Generic Target Proximity Domain. The
>         > distinction as to whether this structure represents an
>         > Initiator, a Target, or both depends on how it is referenced
>         > in the HMAT. See Section 5.2.27.3 for details.
> 
>         > Support of Generic Initiator / Target Affinity Structures by
>         > OSPM is optional, and the platform may query whether the OS
>         > supports it via the _OSC method. See Section 6.2.11.2.
> 
>         * Architectural transactions: (append after current text)
> 
>         > If this proximity domain is referenced as a target then it
>         > supports all the transaction types inferred above.
> 
>         * Other updates are simple Initiator => Initiator / Target
>           replacements.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-11 17:06       ` Dan Williams
@ 2021-02-12 12:24         ` Jonathan Cameron
  2021-02-12 23:51           ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2021-02-12 12:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Thu, 11 Feb 2021 09:06:51 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Thu, Feb 11, 2021 at 1:44 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Wed, 10 Feb 2021 08:24:51 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >  
> > > On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron
> > > <Jonathan.Cameron@huawei.com> wrote:  
> > > >
> > > > On Tue, 9 Feb 2021 19:55:05 -0800
> > > > Dan Williams <dan.j.williams@intel.com> wrote:
> > > >  
> > > > > While the platform BIOS is able to describe the performance
> > > > > characteristics of CXL memory that is present at boot, it is unable to
> > > > > statically enumerate the performance of CXL memory hot inserted
> > > > > post-boot. The OS can enumerate most of the characteristics from link
> > > > > registers and CDAT, but the performance from the CPU to the host
> > > > > bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> > > > > ACPI mechanism for this purpose. Critically this is achieved with a
> > > > > small tweak to how the existing Generic Initiator proximity domain is
> > > > > utilized in the HMAT.  
> > > >
> > > > Hi Dan,
> > > >
> > > > Agree there is a hole here, but I think the proposed solution has some
> > > > issues for backwards compatibility.
> > > >
> > > > Just to clarify, I believe CDAT from root ports is sufficient for the
> > > > other direction (GI on CXL, memory in host).  I wondered initially if
> > > > this was a two way issue, but after a reread, I think that is fine
> > > > with the root port providing CDAT or potentially treating the root
> > > > port as a GI (though that runs into the same naming / representation issue
> > > > as below and I think would need some clarifying text in UEFI GI description)
> > > >
> > > > http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> > > >
> > > > For the case you are dealing with here potentially we 'could' add something
> > > > to CDAT as alternative to changing SRAT, but it would be more complex
> > > > so your approach here makes more sense to me.  
> > >
> > > CDAT seems the wrong mechanism because it identifies target
> > > performance once you're at the front door of the device, not
> > > performance relative to a given initiator.  
> >
> > I'd argue you could make CDAT a more symmetric representation, but it would
> > end up replicating a lot of info already in HMAT.  Didn't say it was a good
> > idea!  
> 
> CDAT describes points, HMAT describes edges on the performance graph,
> it would be confusing if CDAT tried to supplant HMAT.

Entirely agreed.  Note I'm not disagreeing with approach here at all, simply
trying to point out where I think you'll get questions taking this to ACPI.

> 
> >
> > That's an odd situation that it sort of 'half' manages it in the BIOS.
> > We probably need some supplementary additional docs around this topic
> > as the OS would need to be aware of that possibility and explicitly check
> > for it before doing its normal build based on CDAT + what you are proposing
> > here.  Maybe code is enough but given this is cross OS stuff I'd argue
> > it probably isn't.
> >
> > I guess could revisit this draft Uefi white paper perhaps and add a bunch
> > of examples around this usecase https://github.com/hisilicon/acpi-numa-whitepaper  
> 
> Thanks for the reference, I'll take a look.
> 
> >  
> > >  
> > > >  
> > > > >
> > > > > # Impact of the Change
> > > > > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> > > > > 5.2.16.6) already contains all the fields necessary to enumerate a
> > > > > generic target proximity domain. All that is missing is the
> > > > > interpretation of that proximity domain optionally as a target
> > > > > identifier in the HMAT.
> > > > >
> > > > > Given that the OS still needs to dynamically enumerate and instantiate
> > > > > the memory ranges behind the host bridge. The assumption is that
> > > > > operating systems that do not support native CXL enumeration will ignore
> > > > > this data in the HMAT, while CXL native enumeration aware environments
> > > > > will use this fragment of the performance path to calculate the
> > > > > performance characteristics.  
> > > >
> > > > I don't think it is true that OS not supporting native CXL will ignore the
> > > > data.  
> > >
> > > True, I should have chosen more careful words like s/ignore/not
> > > regress upon seeing/  
> >
> > It's a sticky corner and I suspect likely to come up at in ACPI WG - what is
> > being proposed here isn't backwards compatible  
> 
> It seems our definitions of backwards compatible are divergent. Please
> correct me if I'm wrong, but I understand your position to be "any
> perceptible OS behavior change breaks backwards compatibility",
> whereas I'm advocating that backwards compatibility is relative
> regressing real world use cases. That said, I do need to go mock this
> up in QEMU and verify how much disturbance it causes.
> 
> > even if the impacts in Linux are small.  
> 
> I'd note the kernel would grind to a halt if the criteria for
> "backwards compatible" was zero perceptible behavior change.

Key thing here is the difference between Linux backwards compatibility and
ACPI backwards compatibility.  Linux impacts are easily understood because
we can go look.  ACPI change impacts in general are rather trickier to
define.

Note currently I'm fine with this change, though think perhaps it could
be improved - simply raising that others may not be.

> 
> > Mostly it's infrastructure bring up that won't get used
> > (fallback lists and similar for a node which will never be specified in
> > allocations) and some confusing userspace ABI (which is more than a little
> > confusing already).  
> 
> Fallback lists are established relative to online nodes. These generic
> targets are not onlined as memory.

This doesn't really matter beyond the point that there is stuff done for
a GI only node currently that doesn't make any sense for this new usecase.

Still, you got me digging...

I may have the  wrong term there, but GI nodes have a zone list
generated that reflects where an allocation will go if an allocation
happens that is requested on the the GI node.

devm_kzalloc() for example for a device in one of these nodes has to know
where to allocate memory from.  Similar true for other resources that don't
exist in the GI node.
 
Check this worked as expected led me down a rabbit hole.  I knew the end
to end test did what I expected but wasn't certainly why...

Anyhow, at least for my test case on x86 doing an 8k devm_kzalloc() it works
as follows. 

Some way down the nest of calls we get a call to
__alloc_pages_nodemask()
-> prepare_alloc_pages() // this finds the zonelist
   -> node_listlist(gi_node, ...)
      -> NODE_DATA(gi_node)->node_zonelists + gfp_zonelist(flags)
	 node_data[gi_node]->node_zonelists + gfp_zonelist(flags)

The node_data entry for the gi_node has it's own zonelist which is used in this
path.  The first entry of which is an appropriate zone on the nearest node which
actually has some memory.

The build is referenced by a comment in x86/mm/numa.c under init_memory_less_node()
"All zonelists will be built later in start_kernel() after per cpu areas are initialized"

build_zonelists() does the expected building based on SLIT distances.


> 
> > > > Linux will create a small amount of infrastructure to reflect them (more or
> > > > less the same as a memoryless node) and also they will appear in places
> > > > like access0 as a possible initiator of transactions.  It's small stuff,
> > > > but I'd rather the impact on legacy was zero.  
> > >
> > > I'm failing to see that small collision as fatal to the proposal. The
> > > HMAT parsing had a significant bug for multiple kernel releases and no
> > > one noticed. This quirk is minor in comparison.  
> >
> > True, there is a lag in HMAT adoption - though for ACPI tables, not that long
> > (only a couple of years :)
> >  
> > >  
> > > >
> > > > So my gut feeling here is we shouldn't reuse the generic initiator, but
> > > > should invent something new.  Would look similar to GI, but with a different
> > > > ID - to ensure legacy OS ignores it.  
> > >
> > > A new id introduces more problems than it solves. Set aside the ACPICA
> > > thrash, it does not allow a clean identity mapping of a point in a
> > > system topology being both initiator and target. The SRAT does not
> > > need more data structures to convey this information. At most I would
> > > advocate for an OSC bit for the OS to opt into allowing this new usage
> > > in the HMAT, but that still feels like overkill absent a clear
> > > regression in legacy environments.  
> >
> > OSC for this case doesn't work. You can't necessarily evaluate it
> > early enough in the boot - in Linux the node setup is before AML parsing
> > comes up.  HMAT is evaluated a lot later, but SRAT is too early.  + in theory
> > another OS is allowed to evaluate HMAT before OSC is available.  
> 
> The Linux node setup for online memory is before OSC parsing, but
> there's nothing to "online" with a GI/GT entry. Also, if this was a
> problem, it would already be impacting the OS today because this
> proposal only changes HMAT, not SRAT.

It is not just changing HMAT - you've just redefined the meaning of a GI
which is an entry in SRAT.  So there may be impacts in early boot..
None of them are fatal as far as I can see, but they are undesirable
(in the sense that they are pointless for this new use).
 
> Lastly there *is* an OSC bit for
> GI, so either that's vestigial and needs to be removed, or OSC is
> relevant for this case.

No. The GI OSC bit has a very specific purpose and is absolutely necessary.

It's not about presenting GI in SRAT.  We can always present GI in SRAT
irrespective of whether the OS has a clue what it is because an OS is guaranteed
to not take any notice of entries it doesn't understand. (anything else is
just a bug). Note that in this proposal the issue is the OS 'thinks' it understands
the node, but is wrong...

The GI OSC is to allow DSDT _PXM entries to be modified (which are AML obviously
so can be modified based on _OSC).  Thus if we don't support GI we can move
the apparent location of the initiator to the best node of a type we do support.
(it also worked around Linux crashing if you supplied a _PXM for a proximity
 domain it didn't now about).

Similar case doesn't apply here because what we'd need to modify if the OS
didn't understand this new usecase would be static tables, not AML.

> 
> >  
> > > The fact that hardly anyone is
> > > using HMAT (as indicated by the bug I mentioned) gives me confidence
> > > that perfection is more "enemy of the good" than required here.  
> >
> > How about taking this another way
> >
> > 1) Assume that the costs of 'false' GI nodes on legacy system as a result
> >    of this is minor - so just live with it.  (probably true, but as ever
> >    need to confirm with other OS)
> >
> > 2) Try to remove the cost of pointless infrastructure on 'aware' kernels.
> >    Add a flag to the GI entry to say it's a bridge and not expected to,
> >    in of itself, represent an initiator or a target.
> >    In Linux we then don't create the node intrastructure etc or assign
> >    any devices to have the non existent NUMA node.
> >
> > The information is still there to combine with device info (CDAT) etc
> > and build what we eventually want in the way of a representation of
> > the topology that Linux can use.
> >
> > Now we just have the 'small' problem of figuring out how actually implement
> > hotplugging of NUMA nodes.  
> 
> I think it's tiny. Just pad the "possible" nodes past what SRAT enumerates.

Maybe, not tried it yet.  Neither _HMA nor _SLI have ever been implemented
in Linux (as far as I know) which are the ACPI equivalents of the updates
we need to make on such a hotplug.  Might not be too bad, but needs some prototyping
fun once we get the emulation up and running to provide all the info needed
+ changes like you have here to fill in the gaps.

Jonathan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-12 12:24         ` Jonathan Cameron
@ 2021-02-12 23:51           ` Dan Williams
  2021-02-16 11:06             ` Jonathan Cameron
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2021-02-12 23:51 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Fri, Feb 12, 2021 at 4:26 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Thu, 11 Feb 2021 09:06:51 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > On Thu, Feb 11, 2021 at 1:44 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> > >
> > > On Wed, 10 Feb 2021 08:24:51 -0800
> > > Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > > On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron
> > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > >
> > > > > On Tue, 9 Feb 2021 19:55:05 -0800
> > > > > Dan Williams <dan.j.williams@intel.com> wrote:
> > > > >
> > > > > > While the platform BIOS is able to describe the performance
> > > > > > characteristics of CXL memory that is present at boot, it is unable to
> > > > > > statically enumerate the performance of CXL memory hot inserted
> > > > > > post-boot. The OS can enumerate most of the characteristics from link
> > > > > > registers and CDAT, but the performance from the CPU to the host
> > > > > > bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> > > > > > ACPI mechanism for this purpose. Critically this is achieved with a
> > > > > > small tweak to how the existing Generic Initiator proximity domain is
> > > > > > utilized in the HMAT.
> > > > >
> > > > > Hi Dan,
> > > > >
> > > > > Agree there is a hole here, but I think the proposed solution has some
> > > > > issues for backwards compatibility.
> > > > >
> > > > > Just to clarify, I believe CDAT from root ports is sufficient for the
> > > > > other direction (GI on CXL, memory in host).  I wondered initially if
> > > > > this was a two way issue, but after a reread, I think that is fine
> > > > > with the root port providing CDAT or potentially treating the root
> > > > > port as a GI (though that runs into the same naming / representation issue
> > > > > as below and I think would need some clarifying text in UEFI GI description)
> > > > >
> > > > > http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> > > > >
> > > > > For the case you are dealing with here potentially we 'could' add something
> > > > > to CDAT as alternative to changing SRAT, but it would be more complex
> > > > > so your approach here makes more sense to me.
> > > >
> > > > CDAT seems the wrong mechanism because it identifies target
> > > > performance once you're at the front door of the device, not
> > > > performance relative to a given initiator.
> > >
> > > I'd argue you could make CDAT a more symmetric representation, but it would
> > > end up replicating a lot of info already in HMAT.  Didn't say it was a good
> > > idea!
> >
> > CDAT describes points, HMAT describes edges on the performance graph,
> > it would be confusing if CDAT tried to supplant HMAT.
>
> Entirely agreed.  Note I'm not disagreeing with approach here at all, simply
> trying to point out where I think you'll get questions taking this to ACPI.

Understood.

>
> >
> > >
> > > That's an odd situation that it sort of 'half' manages it in the BIOS.
> > > We probably need some supplementary additional docs around this topic
> > > as the OS would need to be aware of that possibility and explicitly check
> > > for it before doing its normal build based on CDAT + what you are proposing
> > > here.  Maybe code is enough but given this is cross OS stuff I'd argue
> > > it probably isn't.
> > >
> > > I guess could revisit this draft Uefi white paper perhaps and add a bunch
> > > of examples around this usecase https://github.com/hisilicon/acpi-numa-whitepaper
> >
> > Thanks for the reference, I'll take a look.
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > # Impact of the Change
> > > > > > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> > > > > > 5.2.16.6) already contains all the fields necessary to enumerate a
> > > > > > generic target proximity domain. All that is missing is the
> > > > > > interpretation of that proximity domain optionally as a target
> > > > > > identifier in the HMAT.
> > > > > >
> > > > > > Given that the OS still needs to dynamically enumerate and instantiate
> > > > > > the memory ranges behind the host bridge. The assumption is that
> > > > > > operating systems that do not support native CXL enumeration will ignore
> > > > > > this data in the HMAT, while CXL native enumeration aware environments
> > > > > > will use this fragment of the performance path to calculate the
> > > > > > performance characteristics.
> > > > >
> > > > > I don't think it is true that OS not supporting native CXL will ignore the
> > > > > data.
> > > >
> > > > True, I should have chosen more careful words like s/ignore/not
> > > > regress upon seeing/
> > >
> > > It's a sticky corner and I suspect likely to come up at in ACPI WG - what is
> > > being proposed here isn't backwards compatible
> >
> > It seems our definitions of backwards compatible are divergent. Please
> > correct me if I'm wrong, but I understand your position to be "any
> > perceptible OS behavior change breaks backwards compatibility",
> > whereas I'm advocating that backwards compatibility is relative
> > regressing real world use cases. That said, I do need to go mock this
> > up in QEMU and verify how much disturbance it causes.
> >
> > > even if the impacts in Linux are small.
> >
> > I'd note the kernel would grind to a halt if the criteria for
> > "backwards compatible" was zero perceptible behavior change.
>
> Key thing here is the difference between Linux backwards compatibility and
> ACPI backwards compatibility.  Linux impacts are easily understood because
> we can go look.  ACPI change impacts in general are rather trickier to
> define.
>
> Note currently I'm fine with this change, though think perhaps it could
> be improved - simply raising that others may not be.

I appreciate that, and as the originator of GI I hold your feedback in
high regard.

Yes, the criteria that it needs to be a functional regression vs a
behavior change may be difficult argument to carry with others.

>
> >
> > > Mostly it's infrastructure bring up that won't get used
> > > (fallback lists and similar for a node which will never be specified in
> > > allocations) and some confusing userspace ABI (which is more than a little
> > > confusing already).
> >
> > Fallback lists are established relative to online nodes. These generic
> > targets are not onlined as memory.
>
> This doesn't really matter beyond the point that there is stuff done for
> a GI only node currently that doesn't make any sense for this new usecase.
>
> Still, you got me digging...
>
> I may have the  wrong term there, but GI nodes have a zone list
> generated that reflects where an allocation will go if an allocation
> happens that is requested on the the GI node.
>
> devm_kzalloc() for example for a device in one of these nodes has to know
> where to allocate memory from.  Similar true for other resources that don't
> exist in the GI node.
>
> Check this worked as expected led me down a rabbit hole.  I knew the end
> to end test did what I expected but wasn't certainly why...
>
> Anyhow, at least for my test case on x86 doing an 8k devm_kzalloc() it works
> as follows.
>
> Some way down the nest of calls we get a call to
> __alloc_pages_nodemask()
> -> prepare_alloc_pages() // this finds the zonelist
>    -> node_listlist(gi_node, ...)
>       -> NODE_DATA(gi_node)->node_zonelists + gfp_zonelist(flags)
>          node_data[gi_node]->node_zonelists + gfp_zonelist(flags)
>
> The node_data entry for the gi_node has it's own zonelist which is used in this
> path.  The first entry of which is an appropriate zone on the nearest node which
> actually has some memory.
>
> The build is referenced by a comment in x86/mm/numa.c under init_memory_less_node()
> "All zonelists will be built later in start_kernel() after per cpu areas are initialized"
>
> build_zonelists() does the expected building based on SLIT distances.

Ugh, I had been holding onto the mental model  that GI support was
limited to the HMAT sysfs representation. What is the use case for
init_gi_nodes()? I would expect GI information is only useful for
performance calculations and that can be done without touching the
node infrastructure.

At first glance that metadata manipulation looks like a mistake to me,
I wish I had paid more attention as that went in... regardless this
does indeed change my thoughts of Generic Target being a minor impact.

Why does GI need anything more than acpi_map_pxm_to_node() to have a
node number assigned?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-12 23:51           ` Dan Williams
@ 2021-02-16 11:06             ` Jonathan Cameron
  2021-02-16 16:29               ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2021-02-16 11:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Fri, 12 Feb 2021 15:51:42 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Fri, Feb 12, 2021 at 4:26 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Thu, 11 Feb 2021 09:06:51 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >  
> > > On Thu, Feb 11, 2021 at 1:44 AM Jonathan Cameron
> > > <Jonathan.Cameron@huawei.com> wrote:  
> > > >
> > > > On Wed, 10 Feb 2021 08:24:51 -0800
> > > > Dan Williams <dan.j.williams@intel.com> wrote:
> > > >  
> > > > > On Wed, Feb 10, 2021 at 3:24 AM Jonathan Cameron
> > > > > <Jonathan.Cameron@huawei.com> wrote:  
> > > > > >
> > > > > > On Tue, 9 Feb 2021 19:55:05 -0800
> > > > > > Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > >  
> > > > > > > While the platform BIOS is able to describe the performance
> > > > > > > characteristics of CXL memory that is present at boot, it is unable to
> > > > > > > statically enumerate the performance of CXL memory hot inserted
> > > > > > > post-boot. The OS can enumerate most of the characteristics from link
> > > > > > > registers and CDAT, but the performance from the CPU to the host
> > > > > > > bridge, for example, is not enumerated by PCIE or CXL. Introduce an
> > > > > > > ACPI mechanism for this purpose. Critically this is achieved with a
> > > > > > > small tweak to how the existing Generic Initiator proximity domain is
> > > > > > > utilized in the HMAT.  
> > > > > >
> > > > > > Hi Dan,
> > > > > >
> > > > > > Agree there is a hole here, but I think the proposed solution has some
> > > > > > issues for backwards compatibility.
> > > > > >
> > > > > > Just to clarify, I believe CDAT from root ports is sufficient for the
> > > > > > other direction (GI on CXL, memory in host).  I wondered initially if
> > > > > > this was a two way issue, but after a reread, I think that is fine
> > > > > > with the root port providing CDAT or potentially treating the root
> > > > > > port as a GI (though that runs into the same naming / representation issue
> > > > > > as below and I think would need some clarifying text in UEFI GI description)
> > > > > >
> > > > > > http://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> > > > > >
> > > > > > For the case you are dealing with here potentially we 'could' add something
> > > > > > to CDAT as alternative to changing SRAT, but it would be more complex
> > > > > > so your approach here makes more sense to me.  
> > > > >
> > > > > CDAT seems the wrong mechanism because it identifies target
> > > > > performance once you're at the front door of the device, not
> > > > > performance relative to a given initiator.  
> > > >
> > > > I'd argue you could make CDAT a more symmetric representation, but it would
> > > > end up replicating a lot of info already in HMAT.  Didn't say it was a good
> > > > idea!  
> > >
> > > CDAT describes points, HMAT describes edges on the performance graph,
> > > it would be confusing if CDAT tried to supplant HMAT.  
> >
> > Entirely agreed.  Note I'm not disagreeing with approach here at all, simply
> > trying to point out where I think you'll get questions taking this to ACPI.  
> 
> Understood.
> 
> >  
> > >  
> > > >
> > > > That's an odd situation that it sort of 'half' manages it in the BIOS.
> > > > We probably need some supplementary additional docs around this topic
> > > > as the OS would need to be aware of that possibility and explicitly check
> > > > for it before doing its normal build based on CDAT + what you are proposing
> > > > here.  Maybe code is enough but given this is cross OS stuff I'd argue
> > > > it probably isn't.
> > > >
> > > > I guess could revisit this draft Uefi white paper perhaps and add a bunch
> > > > of examples around this usecase https://github.com/hisilicon/acpi-numa-whitepaper  
> > >
> > > Thanks for the reference, I'll take a look.
> > >  
> > > >  
> > > > >  
> > > > > >  
> > > > > > >
> > > > > > > # Impact of the Change
> > > > > > > The existing Generic Initiator Affinity Structure (ACPI 6.4 Section
> > > > > > > 5.2.16.6) already contains all the fields necessary to enumerate a
> > > > > > > generic target proximity domain. All that is missing is the
> > > > > > > interpretation of that proximity domain optionally as a target
> > > > > > > identifier in the HMAT.
> > > > > > >
> > > > > > > Given that the OS still needs to dynamically enumerate and instantiate
> > > > > > > the memory ranges behind the host bridge. The assumption is that
> > > > > > > operating systems that do not support native CXL enumeration will ignore
> > > > > > > this data in the HMAT, while CXL native enumeration aware environments
> > > > > > > will use this fragment of the performance path to calculate the
> > > > > > > performance characteristics.  
> > > > > >
> > > > > > I don't think it is true that OS not supporting native CXL will ignore the
> > > > > > data.  
> > > > >
> > > > > True, I should have chosen more careful words like s/ignore/not
> > > > > regress upon seeing/  
> > > >
> > > > It's a sticky corner and I suspect likely to come up at in ACPI WG - what is
> > > > being proposed here isn't backwards compatible  
> > >
> > > It seems our definitions of backwards compatible are divergent. Please
> > > correct me if I'm wrong, but I understand your position to be "any
> > > perceptible OS behavior change breaks backwards compatibility",
> > > whereas I'm advocating that backwards compatibility is relative
> > > regressing real world use cases. That said, I do need to go mock this
> > > up in QEMU and verify how much disturbance it causes.
> > >  
> > > > even if the impacts in Linux are small.  
> > >
> > > I'd note the kernel would grind to a halt if the criteria for
> > > "backwards compatible" was zero perceptible behavior change.  
> >
> > Key thing here is the difference between Linux backwards compatibility and
> > ACPI backwards compatibility.  Linux impacts are easily understood because
> > we can go look.  ACPI change impacts in general are rather trickier to
> > define.
> >
> > Note currently I'm fine with this change, though think perhaps it could
> > be improved - simply raising that others may not be.  
> 
> I appreciate that, and as the originator of GI I hold your feedback in
> high regard.
> 
> Yes, the criteria that it needs to be a functional regression vs a
> behavior change may be difficult argument to carry with others.
> 
> >  
> > >  
> > > > Mostly it's infrastructure bring up that won't get used
> > > > (fallback lists and similar for a node which will never be specified in
> > > > allocations) and some confusing userspace ABI (which is more than a little
> > > > confusing already).  
> > >
> > > Fallback lists are established relative to online nodes. These generic
> > > targets are not onlined as memory.  
> >
> > This doesn't really matter beyond the point that there is stuff done for
> > a GI only node currently that doesn't make any sense for this new usecase.
> >
> > Still, you got me digging...
> >
> > I may have the  wrong term there, but GI nodes have a zone list
> > generated that reflects where an allocation will go if an allocation
> > happens that is requested on the the GI node.
> >
> > devm_kzalloc() for example for a device in one of these nodes has to know
> > where to allocate memory from.  Similar true for other resources that don't
> > exist in the GI node.
> >
> > Check this worked as expected led me down a rabbit hole.  I knew the end
> > to end test did what I expected but wasn't certainly why...
> >
> > Anyhow, at least for my test case on x86 doing an 8k devm_kzalloc() it works
> > as follows.
> >
> > Some way down the nest of calls we get a call to
> > __alloc_pages_nodemask()  
> > -> prepare_alloc_pages() // this finds the zonelist
> >    -> node_listlist(gi_node, ...)
> >       -> NODE_DATA(gi_node)->node_zonelists + gfp_zonelist(flags)  
> >          node_data[gi_node]->node_zonelists + gfp_zonelist(flags)
> >
> > The node_data entry for the gi_node has it's own zonelist which is used in this
> > path.  The first entry of which is an appropriate zone on the nearest node which
> > actually has some memory.
> >
> > The build is referenced by a comment in x86/mm/numa.c under init_memory_less_node()
> > "All zonelists will be built later in start_kernel() after per cpu areas are initialized"
> >
> > build_zonelists() does the expected building based on SLIT distances.  
> 
> Ugh, I had been holding onto the mental model  that GI support was
> limited to the HMAT sysfs representation. What is the use case for
> init_gi_nodes()? I would expect GI information is only useful for
> performance calculations and that can be done without touching the
> node infrastructure.
> 
> At first glance that metadata manipulation looks like a mistake to me,
> I wish I had paid more attention as that went in... regardless this
> does indeed change my thoughts of Generic Target being a minor impact.

It's still fairly minor I think, but definitely a harder case to sell to
the ACPI WG.  Unused infrastructure shouldn't be a major issue.  No
worse than creating a GI node with say a NIC in it but never loading the
NIC driver.

> 
> Why does GI need anything more than acpi_map_pxm_to_node() to have a
> node number assigned?

It might have been possible (with limitations) to do it by making multiple
proximity domains map to a single numa node, along with some additional
functionality to allow it to retrieve the real node for aware drivers,
but seeing as we already had the memoryless node infrastructure in place,
it fitted more naturally into that scheme.  GI introduction to the
ACPI spec, and indeed the kernel was originally driven by the needs of
CCIX (before CXL was public) with CCIX's symmetric view of initiators
(CPU or other) + a few other existing situations where we'd been
papering over the topology for years and paying a cost in custom
load balancing in drivers etc. That more symmetric view meant that the
natural approach was to treat these as memoryless nodes.

The full handling of nodes is needed to deal with situations like
the following contrived setup. With a few interconnect
links I haven't bothered drawing, there are existing systems where
a portion of the topology looks like this:


    RAM                              RAM             RAM
     |                                |               |
 --------        ---------        --------        --------
| a      |      | b       |      | c      |      | d      |
|   CPUs |------|  PCI RC |------| CPUs   |------|  CPUs  |
|        |      |         |      |        |      |        |
 --------        ---------        --------        --------
                     |
                  PCI EP

We need the GI representation to allow an "aware" driver to understand
that the PCI EP is equal distances from CPUs and RAM on (a) and (c),
(and that using allocations from (d) is a a bad idea).  This would be
the same as a driver running on an PCI RC attached to a memoryless
CPU node (you would hope no one would build one of those, but I've seen
them occasionally).  Such an aware driver carefully places both memory
and processing threads / interrupts etc to balance the load.

In pre GI days, can just drop (b) into (a or c) and not worry about it, but
that comes with a large performance cost (20% plus on network throughput
on some of our more crazy systems, due to it appearing that balancing
memory load across (a) and (c) doesn't make sense).  Also, if we happened
to drop it into (c) then once we run out of space on (c) we'll start
using (d) which is a bad idea.

With GI nodes, you need an unaware PCI driver to work well and they
will use allocations linked to the particular NUMA node that are in.
The kernel needs to know a reasonable place to shunt them to and in
more complex topologies the zone list may not correspond to that of
any other node.  In a CCIX world for example, a GI can sit between
a pair of Home Agents with memory, and the host on the other side of
them.  We had a lot of fun working through these cases back when drawing
up the ACPI changes to support them. :)

There are corner cases where you have to tweak a driver to
make it more aware though or you can get unhelpful things like all memory
use on (a) and all interrupts on (c) - if things filled up in a particularly
pathological case.  However that can happen anyway whenever memory runs out
on the local node.

Jonathan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-16 11:06             ` Jonathan Cameron
@ 2021-02-16 16:29               ` Dan Williams
  2021-02-16 18:06                 ` Jonathan Cameron
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2021-02-16 16:29 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Tue, Feb 16, 2021 at 3:08 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
[..]
> > Why does GI need anything more than acpi_map_pxm_to_node() to have a
> > node number assigned?
>
> It might have been possible (with limitations) to do it by making multiple
> proximity domains map to a single numa node, along with some additional
> functionality to allow it to retrieve the real node for aware drivers,
> but seeing as we already had the memoryless node infrastructure in place,
> it fitted more naturally into that scheme.  GI introduction to the
> ACPI spec, and indeed the kernel was originally driven by the needs of
> CCIX (before CXL was public) with CCIX's symmetric view of initiators
> (CPU or other) + a few other existing situations where we'd been
> papering over the topology for years and paying a cost in custom
> load balancing in drivers etc. That more symmetric view meant that the
> natural approach was to treat these as memoryless nodes.
>
> The full handling of nodes is needed to deal with situations like
> the following contrived setup. With a few interconnect
> links I haven't bothered drawing, there are existing systems where
> a portion of the topology looks like this:
>
>
>     RAM                              RAM             RAM
>      |                                |               |
>  --------        ---------        --------        --------
> | a      |      | b       |      | c      |      | d      |
> |   CPUs |------|  PCI RC |------| CPUs   |------|  CPUs  |
> |        |      |         |      |        |      |        |
>  --------        ---------        --------        --------
>                      |
>                   PCI EP
>
> We need the GI representation to allow an "aware" driver to understand
> that the PCI EP is equal distances from CPUs and RAM on (a) and (c),
> (and that using allocations from (d) is a a bad idea).  This would be
> the same as a driver running on an PCI RC attached to a memoryless
> CPU node (you would hope no one would build one of those, but I've seen
> them occasionally).  Such an aware driver carefully places both memory
> and processing threads / interrupts etc to balance the load.

That's an explanation for why GI exists, not an explanation for why a
GI needs to be anything more than translated to a Linux numa node
number and an api to lookup distance.

>
> In pre GI days, can just drop (b) into (a or c) and not worry about it, but
> that comes with a large performance cost (20% plus on network throughput
> on some of our more crazy systems, due to it appearing that balancing
> memory load across (a) and (c) doesn't make sense).  Also, if we happened
> to drop it into (c) then once we run out of space on (c) we'll start
> using (d) which is a bad idea.
>
> With GI nodes, you need an unaware PCI driver to work well and they
> will use allocations linked to the particular NUMA node that are in.
> The kernel needs to know a reasonable place to shunt them to and in
> more complex topologies the zone list may not correspond to that of
> any other node.

The kernel "needs", no it doesn't. Look at the "target_node" handling
for PMEM. Those nodes are offline, the distance can be determined, and
only when they become memory does the node become online.

The only point I can see GI needing anything more than the equivalent
of "target_node" is when the scheduler can submit jobs to GI
initiators like a CPU. Otherwise, GI is just a seed for a node number
plus numa distance.

>   In a CCIX world for example, a GI can sit between
> a pair of Home Agents with memory, and the host on the other side of
> them.  We had a lot of fun working through these cases back when drawing
> up the ACPI changes to support them. :)
>

Yes, I can imagine several interesting ACPI cases, but still
struggling to justify the GI zone list metadata.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-16 16:29               ` Dan Williams
@ 2021-02-16 18:06                 ` Jonathan Cameron
  2021-02-16 18:22                   ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2021-02-16 18:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Tue, 16 Feb 2021 08:29:01 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Feb 16, 2021 at 3:08 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> [..]
> > > Why does GI need anything more than acpi_map_pxm_to_node() to have a
> > > node number assigned?  
> >
> > It might have been possible (with limitations) to do it by making multiple
> > proximity domains map to a single numa node, along with some additional
> > functionality to allow it to retrieve the real node for aware drivers,
> > but seeing as we already had the memoryless node infrastructure in place,
> > it fitted more naturally into that scheme.  GI introduction to the
> > ACPI spec, and indeed the kernel was originally driven by the needs of
> > CCIX (before CXL was public) with CCIX's symmetric view of initiators
> > (CPU or other) + a few other existing situations where we'd been
> > papering over the topology for years and paying a cost in custom
> > load balancing in drivers etc. That more symmetric view meant that the
> > natural approach was to treat these as memoryless nodes.
> >
> > The full handling of nodes is needed to deal with situations like
> > the following contrived setup. With a few interconnect
> > links I haven't bothered drawing, there are existing systems where
> > a portion of the topology looks like this:
> >
> >
> >     RAM                              RAM             RAM
> >      |                                |               |
> >  --------        ---------        --------        --------
> > | a      |      | b       |      | c      |      | d      |
> > |   CPUs |------|  PCI RC |------| CPUs   |------|  CPUs  |
> > |        |      |         |      |        |      |        |
> >  --------        ---------        --------        --------
> >                      |
> >                   PCI EP
> >
> > We need the GI representation to allow an "aware" driver to understand
> > that the PCI EP is equal distances from CPUs and RAM on (a) and (c),
> > (and that using allocations from (d) is a a bad idea).  This would be
> > the same as a driver running on an PCI RC attached to a memoryless
> > CPU node (you would hope no one would build one of those, but I've seen
> > them occasionally).  Such an aware driver carefully places both memory
> > and processing threads / interrupts etc to balance the load.  
> 
> That's an explanation for why GI exists, not an explanation for why a
> GI needs to be anything more than translated to a Linux numa node
> number and an api to lookup distance.

Why should a random driver need to know it needs to do something special?

Random drivers don't lookup distance, they just allocate memory based on their
current numa_node. devm_kzalloc() does this under the hood (an optimization
that rather took me by surprise at the time).
Sure we could add a bunch of new infrastructure to solve that problem
but why not use what is already there?

> 
> >
> > In pre GI days, can just drop (b) into (a or c) and not worry about it, but
> > that comes with a large performance cost (20% plus on network throughput
> > on some of our more crazy systems, due to it appearing that balancing
> > memory load across (a) and (c) doesn't make sense).  Also, if we happened
> > to drop it into (c) then once we run out of space on (c) we'll start
> > using (d) which is a bad idea.
> >
> > With GI nodes, you need an unaware PCI driver to work well and they
> > will use allocations linked to the particular NUMA node that are in.
> > The kernel needs to know a reasonable place to shunt them to and in
> > more complex topologies the zone list may not correspond to that of
> > any other node.  
> 
> The kernel "needs", no it doesn't. Look at the "target_node" handling
> for PMEM. Those nodes are offline, the distance can be determined, and
> only when they become memory does the node become online.

Indeed, custom code for specific cases can work just fine (we've carried
plenty of it in the past to get best performance from systems), but for GIs
the intent was they would just work.  We don't want to have to go and change
stuff in PCI drivers every time we plug a new card into such a system.

> 
> The only point I can see GI needing anything more than the equivalent
> of "target_node" is when the scheduler can submit jobs to GI
> initiators like a CPU. Otherwise, GI is just a seed for a node number
> plus numa distance.

That would be true if Linux didn't already make heavy use of numa_node
for driver allocations.  We could carry a parallel value of 'real_numa_node'
or something like that, but you can't safely use numa_node without the
node being online and zone lists present.
Another way of looking at it is that zone list is a cache solving the
question of where to allocate memory, which you could also solve using
the node number and distances (at the cost of custom handling).

It is of course advantageous to do cleverer things for particular drivers
but the vast majority need to just work.

> 
> >   In a CCIX world for example, a GI can sit between
> > a pair of Home Agents with memory, and the host on the other side of
> > them.  We had a lot of fun working through these cases back when drawing
> > up the ACPI changes to support them. :)
> >  
> 
> Yes, I can imagine several interesting ACPI cases, but still
> struggling to justify the GI zone list metadata.

It works. It solves the problem. It's very little extra code and it
exercises zero paths not already exercised by memoryless nodes.
We certainly wouldn't have invented something as complex as zone lists
if we couldn't leverage what was there of course.

So I have the opposite view point. I can't see why the minor overhead
of zone list metadata for GIs isn't a sensible choice vs cost of
maintaining something entirely different.  This only changes with the
intent to use them to represent something different.

Jonathan




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-16 18:06                 ` Jonathan Cameron
@ 2021-02-16 18:22                   ` Dan Williams
  2021-02-16 18:58                     ` Jonathan Cameron
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2021-02-16 18:22 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Tue, Feb 16, 2021 at 10:08 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Tue, 16 Feb 2021 08:29:01 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > On Tue, Feb 16, 2021 at 3:08 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> > [..]
> > > > Why does GI need anything more than acpi_map_pxm_to_node() to have a
> > > > node number assigned?
> > >
> > > It might have been possible (with limitations) to do it by making multiple
> > > proximity domains map to a single numa node, along with some additional
> > > functionality to allow it to retrieve the real node for aware drivers,
> > > but seeing as we already had the memoryless node infrastructure in place,
> > > it fitted more naturally into that scheme.  GI introduction to the
> > > ACPI spec, and indeed the kernel was originally driven by the needs of
> > > CCIX (before CXL was public) with CCIX's symmetric view of initiators
> > > (CPU or other) + a few other existing situations where we'd been
> > > papering over the topology for years and paying a cost in custom
> > > load balancing in drivers etc. That more symmetric view meant that the
> > > natural approach was to treat these as memoryless nodes.
> > >
> > > The full handling of nodes is needed to deal with situations like
> > > the following contrived setup. With a few interconnect
> > > links I haven't bothered drawing, there are existing systems where
> > > a portion of the topology looks like this:
> > >
> > >
> > >     RAM                              RAM             RAM
> > >      |                                |               |
> > >  --------        ---------        --------        --------
> > > | a      |      | b       |      | c      |      | d      |
> > > |   CPUs |------|  PCI RC |------| CPUs   |------|  CPUs  |
> > > |        |      |         |      |        |      |        |
> > >  --------        ---------        --------        --------
> > >                      |
> > >                   PCI EP
> > >
> > > We need the GI representation to allow an "aware" driver to understand
> > > that the PCI EP is equal distances from CPUs and RAM on (a) and (c),
> > > (and that using allocations from (d) is a a bad idea).  This would be
> > > the same as a driver running on an PCI RC attached to a memoryless
> > > CPU node (you would hope no one would build one of those, but I've seen
> > > them occasionally).  Such an aware driver carefully places both memory
> > > and processing threads / interrupts etc to balance the load.
> >
> > That's an explanation for why GI exists, not an explanation for why a
> > GI needs to be anything more than translated to a Linux numa node
> > number and an api to lookup distance.
>
> Why should a random driver need to know it needs to do something special?
>
> Random drivers don't lookup distance, they just allocate memory based on their
> current numa_node. devm_kzalloc() does this under the hood (an optimization
> that rather took me by surprise at the time).
> Sure we could add a bunch of new infrastructure to solve that problem
> but why not use what is already there?
>
> >
> > >
> > > In pre GI days, can just drop (b) into (a or c) and not worry about it, but
> > > that comes with a large performance cost (20% plus on network throughput
> > > on some of our more crazy systems, due to it appearing that balancing
> > > memory load across (a) and (c) doesn't make sense).  Also, if we happened
> > > to drop it into (c) then once we run out of space on (c) we'll start
> > > using (d) which is a bad idea.
> > >
> > > With GI nodes, you need an unaware PCI driver to work well and they
> > > will use allocations linked to the particular NUMA node that are in.
> > > The kernel needs to know a reasonable place to shunt them to and in
> > > more complex topologies the zone list may not correspond to that of
> > > any other node.
> >
> > The kernel "needs", no it doesn't. Look at the "target_node" handling
> > for PMEM. Those nodes are offline, the distance can be determined, and
> > only when they become memory does the node become online.
>
> Indeed, custom code for specific cases can work just fine (we've carried
> plenty of it in the past to get best performance from systems), but for GIs
> the intent was they would just work.  We don't want to have to go and change
> stuff in PCI drivers every time we plug a new card into such a system.
>
> >
> > The only point I can see GI needing anything more than the equivalent
> > of "target_node" is when the scheduler can submit jobs to GI
> > initiators like a CPU. Otherwise, GI is just a seed for a node number
> > plus numa distance.
>
> That would be true if Linux didn't already make heavy use of numa_node
> for driver allocations.  We could carry a parallel value of 'real_numa_node'
> or something like that, but you can't safely use numa_node without the
> node being online and zone lists present.
> Another way of looking at it is that zone list is a cache solving the
> question of where to allocate memory, which you could also solve using
> the node number and distances (at the cost of custom handling).
>
> It is of course advantageous to do cleverer things for particular drivers
> but the vast majority need to just work.
>
> >
> > >   In a CCIX world for example, a GI can sit between
> > > a pair of Home Agents with memory, and the host on the other side of
> > > them.  We had a lot of fun working through these cases back when drawing
> > > up the ACPI changes to support them. :)
> > >
> >
> > Yes, I can imagine several interesting ACPI cases, but still
> > struggling to justify the GI zone list metadata.
>
> It works. It solves the problem. It's very little extra code and it
> exercises zero paths not already exercised by memoryless nodes.
> We certainly wouldn't have invented something as complex as zone lists
> if we couldn't leverage what was there of course.
>
> So I have the opposite view point. I can't see why the minor overhead
> of zone list metadata for GIs isn't a sensible choice vs cost of
> maintaining something entirely different.  This only changes with the
> intent to use them to represent something different.

What I am missing is what zone-list metadata offers beyond just
assigning the device-numa-node to the closest online memory node, and
let the HMAT-sysfs representation enumerate the next level? For
example, the persistent memory enabling assigns the closest online
memory node for the pmem device. That achieves the traditional
behavior of the device-driver allocating from "local" memory by
default. However the HMAT-sysfs representation indicates the numa node
that pmem represents itself were it to be online. So the question is
why does GI need more than that? To me a GI is "offline" in terms
Linux node representations because numactl can't target it, "closest
online" is good enough for a GI device driver, but if userspace needs
the next level of detail of the performance properties that's what
HMEM sysfs is providing.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-16 18:22                   ` Dan Williams
@ 2021-02-16 18:58                     ` Jonathan Cameron
  2021-02-16 19:41                       ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2021-02-16 18:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Tue, 16 Feb 2021 10:22:28 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Feb 16, 2021 at 10:08 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Tue, 16 Feb 2021 08:29:01 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >  
> > > On Tue, Feb 16, 2021 at 3:08 AM Jonathan Cameron
> > > <Jonathan.Cameron@huawei.com> wrote:
> > > [..]  
> > > > > Why does GI need anything more than acpi_map_pxm_to_node() to have a
> > > > > node number assigned?  
> > > >
> > > > It might have been possible (with limitations) to do it by making multiple
> > > > proximity domains map to a single numa node, along with some additional
> > > > functionality to allow it to retrieve the real node for aware drivers,
> > > > but seeing as we already had the memoryless node infrastructure in place,
> > > > it fitted more naturally into that scheme.  GI introduction to the
> > > > ACPI spec, and indeed the kernel was originally driven by the needs of
> > > > CCIX (before CXL was public) with CCIX's symmetric view of initiators
> > > > (CPU or other) + a few other existing situations where we'd been
> > > > papering over the topology for years and paying a cost in custom
> > > > load balancing in drivers etc. That more symmetric view meant that the
> > > > natural approach was to treat these as memoryless nodes.
> > > >
> > > > The full handling of nodes is needed to deal with situations like
> > > > the following contrived setup. With a few interconnect
> > > > links I haven't bothered drawing, there are existing systems where
> > > > a portion of the topology looks like this:
> > > >
> > > >
> > > >     RAM                              RAM             RAM
> > > >      |                                |               |
> > > >  --------        ---------        --------        --------
> > > > | a      |      | b       |      | c      |      | d      |
> > > > |   CPUs |------|  PCI RC |------| CPUs   |------|  CPUs  |
> > > > |        |      |         |      |        |      |        |
> > > >  --------        ---------        --------        --------
> > > >                      |
> > > >                   PCI EP
> > > >
> > > > We need the GI representation to allow an "aware" driver to understand
> > > > that the PCI EP is equal distances from CPUs and RAM on (a) and (c),
> > > > (and that using allocations from (d) is a a bad idea).  This would be
> > > > the same as a driver running on an PCI RC attached to a memoryless
> > > > CPU node (you would hope no one would build one of those, but I've seen
> > > > them occasionally).  Such an aware driver carefully places both memory
> > > > and processing threads / interrupts etc to balance the load.  
> > >
> > > That's an explanation for why GI exists, not an explanation for why a
> > > GI needs to be anything more than translated to a Linux numa node
> > > number and an api to lookup distance.  
> >
> > Why should a random driver need to know it needs to do something special?
> >
> > Random drivers don't lookup distance, they just allocate memory based on their
> > current numa_node. devm_kzalloc() does this under the hood (an optimization
> > that rather took me by surprise at the time).
> > Sure we could add a bunch of new infrastructure to solve that problem
> > but why not use what is already there?
> >  
> > >  
> > > >
> > > > In pre GI days, can just drop (b) into (a or c) and not worry about it, but
> > > > that comes with a large performance cost (20% plus on network throughput
> > > > on some of our more crazy systems, due to it appearing that balancing
> > > > memory load across (a) and (c) doesn't make sense).  Also, if we happened
> > > > to drop it into (c) then once we run out of space on (c) we'll start
> > > > using (d) which is a bad idea.
> > > >
> > > > With GI nodes, you need an unaware PCI driver to work well and they
> > > > will use allocations linked to the particular NUMA node that are in.
> > > > The kernel needs to know a reasonable place to shunt them to and in
> > > > more complex topologies the zone list may not correspond to that of
> > > > any other node.  
> > >
> > > The kernel "needs", no it doesn't. Look at the "target_node" handling
> > > for PMEM. Those nodes are offline, the distance can be determined, and
> > > only when they become memory does the node become online.  
> >
> > Indeed, custom code for specific cases can work just fine (we've carried
> > plenty of it in the past to get best performance from systems), but for GIs
> > the intent was they would just work.  We don't want to have to go and change
> > stuff in PCI drivers every time we plug a new card into such a system.
> >  
> > >
> > > The only point I can see GI needing anything more than the equivalent
> > > of "target_node" is when the scheduler can submit jobs to GI
> > > initiators like a CPU. Otherwise, GI is just a seed for a node number
> > > plus numa distance.  
> >
> > That would be true if Linux didn't already make heavy use of numa_node
> > for driver allocations.  We could carry a parallel value of 'real_numa_node'
> > or something like that, but you can't safely use numa_node without the
> > node being online and zone lists present.
> > Another way of looking at it is that zone list is a cache solving the
> > question of where to allocate memory, which you could also solve using
> > the node number and distances (at the cost of custom handling).
> >
> > It is of course advantageous to do cleverer things for particular drivers
> > but the vast majority need to just work.
> >  
> > >  
> > > >   In a CCIX world for example, a GI can sit between
> > > > a pair of Home Agents with memory, and the host on the other side of
> > > > them.  We had a lot of fun working through these cases back when drawing
> > > > up the ACPI changes to support them. :)
> > > >  
> > >
> > > Yes, I can imagine several interesting ACPI cases, but still
> > > struggling to justify the GI zone list metadata.  
> >
> > It works. It solves the problem. It's very little extra code and it
> > exercises zero paths not already exercised by memoryless nodes.
> > We certainly wouldn't have invented something as complex as zone lists
> > if we couldn't leverage what was there of course.
> >
> > So I have the opposite view point. I can't see why the minor overhead
> > of zone list metadata for GIs isn't a sensible choice vs cost of
> > maintaining something entirely different.  This only changes with the
> > intent to use them to represent something different.  
> 
> What I am missing is what zone-list metadata offers beyond just
> assigning the device-numa-node to the closest online memory node, and
> let the HMAT-sysfs representation enumerate the next level?

Zone list should not necessarily be the same as it is for the nearest node
(as per the example above) so if you can't get memory in nearest node
your second choice may well not be the nearest node's nearest node.
How bad that is depends on the complexity of the topology.  In simple
cases it can double the latency over the better choice.

> For
> example, the persistent memory enabling assigns the closest online
> memory node for the pmem device. That achieves the traditional
> behavior of the device-driver allocating from "local" memory by
> default. However the HMAT-sysfs representation indicates the numa node
> that pmem represents itself were it to be online. So the question is
> why does GI need more than that? To me a GI is "offline" in terms
> Linux node representations because numactl can't target it,

That's fair. It does exist in an intermediate world. Whether the
internal representation of online vs offline should have anything much
to do with numactl rather than whether than numactl being based on
whether a node has online memory or CPUs isn't clear to me.
It already has to distinguish whether a node has CPUs and / or memory
so this isn't a huge conceptual extension.

> "closest
> online" is good enough for a GI device driver, 

So that's the point. Is it 'good enough'?  Maybe - in some cases.

> but if userspace needs
> the next level of detail of the performance properties that's what
> HMEM sysfs is providing.

sysfs is fine if you are doing userspace allocations or placement
decisions. For GIs it can be relevant if you are using userspace drivers 
(or partially userspace drivers). 

In the GI case, from my point of view the sysfs stuff was a nice addition
but not actually that useful.  The info in HMAT is useful, but too little
of it was exposed.  What I don't yet have (due to lack of time), is a good
set of examples that show more info is needed.  Maybe I'll get to that
one day!

Jonathan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-16 18:58                     ` Jonathan Cameron
@ 2021-02-16 19:41                       ` Dan Williams
  2021-02-17  9:53                         ` Jonathan Cameron
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2021-02-16 19:41 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Tue, Feb 16, 2021 at 11:00 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
[..]
> > For
> > example, the persistent memory enabling assigns the closest online
> > memory node for the pmem device. That achieves the traditional
> > behavior of the device-driver allocating from "local" memory by
> > default. However the HMAT-sysfs representation indicates the numa node
> > that pmem represents itself were it to be online. So the question is
> > why does GI need more than that? To me a GI is "offline" in terms
> > Linux node representations because numactl can't target it,
>
> That's fair. It does exist in an intermediate world. Whether the
> internal representation of online vs offline should have anything much
> to do with numactl rather than whether than numactl being based on
> whether a node has online memory or CPUs isn't clear to me.
> It already has to distinguish whether a node has CPUs and / or memory
> so this isn't a huge conceptual extension.
>
> > "closest
> > online" is good enough for a GI device driver,
>
> So that's the point. Is it 'good enough'?  Maybe - in some cases.
>
> > but if userspace needs
> > the next level of detail of the performance properties that's what
> > HMEM sysfs is providing.
>
> sysfs is fine if you are doing userspace allocations or placement
> decisions. For GIs it can be relevant if you are using userspace drivers
> (or partially userspace drivers).

That's unfortunate, please tell me there's another use for this
infrastructure than userspace drivers? The kernel should not be
carrying core-mm debt purely on behalf of userspace drivers.

> In the GI case, from my point of view the sysfs stuff was a nice addition
> but not actually that useful.  The info in HMAT is useful, but too little
> of it was exposed.  What I don't yet have (due to lack of time), is a good
> set of examples that show more info is needed.  Maybe I'll get to that
> one day!

See the "migration in lieu of discard" [1] series as an attempt to
make HMAT-like info more useful. The node-demotion infrastructure puts
HMAT data to more use in the sense that the next phase of memory
tiering enabling is to have the kernel automatically manage it rather
than have a few enlightened apps consume the HMEM sysfs directly.

[1]: http://lore.kernel.org/r/20210126003411.2AC51464@viggo.jf.intel.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] ACPI Code First ECR: Generic Target
  2021-02-16 19:41                       ` Dan Williams
@ 2021-02-17  9:53                         ` Jonathan Cameron
  0 siblings, 0 replies; 18+ messages in thread
From: Jonathan Cameron @ 2021-02-17  9:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-cxl, Linux ACPI, Natu, Mahesh, Chet R Douglas,
	Ben Widawsky, Vishal L Verma

On Tue, 16 Feb 2021 11:41:40 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Feb 16, 2021 at 11:00 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> [..]
> > > For
> > > example, the persistent memory enabling assigns the closest online
> > > memory node for the pmem device. That achieves the traditional
> > > behavior of the device-driver allocating from "local" memory by
> > > default. However the HMAT-sysfs representation indicates the numa node
> > > that pmem represents itself were it to be online. So the question is
> > > why does GI need more than that? To me a GI is "offline" in terms
> > > Linux node representations because numactl can't target it,  
> >
> > That's fair. It does exist in an intermediate world. Whether the
> > internal representation of online vs offline should have anything much
> > to do with numactl rather than whether than numactl being based on
> > whether a node has online memory or CPUs isn't clear to me.
> > It already has to distinguish whether a node has CPUs and / or memory
> > so this isn't a huge conceptual extension.
> >  
> > > "closest
> > > online" is good enough for a GI device driver,  
> >
> > So that's the point. Is it 'good enough'?  Maybe - in some cases.
> >  
> > > but if userspace needs
> > > the next level of detail of the performance properties that's what
> > > HMEM sysfs is providing.  
> >
> > sysfs is fine if you are doing userspace allocations or placement
> > decisions. For GIs it can be relevant if you are using userspace drivers
> > (or partially userspace drivers).  
> 
> That's unfortunate, please tell me there's another use for this
> infrastructure than userspace drivers? The kernel should not be
> carrying core-mm debt purely on behalf of userspace drivers.

Sorry I wasn't clear - that's the exact opposite of what I was trying to say
and perhaps I'd managed to confuse the HMAT sysfs interface with HMEM
(though google suggests they are different names for the same thing).

Firstly I'll note that the inclusion of GIs in that sysfs representation
was a compromise in the first place (I didn't initially do so, because of
the lack of proven use cases for this information in usespace
- we compromised on the access0 vs access1 distinction.)

The HMAT presentation via sysfs interface, for GIs, is useful
for userspace drivers. It can also be used for steering stuff in kernel
drivers where there are sufficiently rich interfaces (some networking
stuff + GPU drivers as I understand it). In those cases, userspace
can directly influence kernel driver placement decisions.
Userspace applications may also be able to place themselves locally to
devices they know they make heavy use of, but I'm not yet aware of any
application actually doing so using information from this interface
(as opposed to hand tuned setups).

Of course in kernel use cases exist for HMAT, but those are not
AFAIK common yet - at least partly because few existing
systems ship with HMAT.  It will be a while yet before we can rely on HMAT
being present in enough systems.

The core GI stuff - i.e. zone lists etc is useful for kernel drivers today
and is only tangentially connected to HMAT in that it is a new
form of initiator node alongside CPUs.

I would argue we did not add any core-mm debt to enable GI nodes as we
did. It uses support that is already there for memoryless nodes.  If you look
back at the patch adding GI there aren't any core-mm changes at all.
There is one extra registration pass needed on x86 because it has different
ordering of node onlining to arm64 (which didn't need any arch changes at all).

Sure, we can argue we are possibly complicating some potential changes in
core-mm sometime in the future, though the most I can think of is we would
need to 'not enable' something CPU specific for memoryless and CPU less
nodes.

> 
> > In the GI case, from my point of view the sysfs stuff was a nice addition
> > but not actually that useful.  The info in HMAT is useful, but too little
> > of it was exposed.  What I don't yet have (due to lack of time), is a good
> > set of examples that show more info is needed.  Maybe I'll get to that
> > one day!  
> 
> See the "migration in lieu of discard" [1] series as an attempt to
> make HMAT-like info more useful. The node-demotion infrastructure puts
> HMAT data to more use in the sense that the next phase of memory
> tiering enabling is to have the kernel automatically manage it rather
> than have a few enlightened apps consume the HMEM sysfs directly.
> 
> [1]: http://lore.kernel.org/r/20210126003411.2AC51464@viggo.jf.intel.com

We've been following that series (and related ones) with a great deal
of interest.  As I understand it that code still uses SLIT rather than
HMAT data - I guess that's why you said HMAT-like ;)

Absolutely agree though that using this stuff in kernel is going
to get us a lot more wins than exposing it to userspace is likely
to provide.

Jonathan




^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-02-17  9:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-10  3:55 [RFC] ACPI Code First ECR: Generic Target Dan Williams
2021-02-10 11:23 ` Jonathan Cameron
2021-02-10 15:18   ` Natu, Mahesh
2021-02-10 16:02     ` Jonathan Cameron
2021-02-10 16:24   ` Dan Williams
2021-02-11  9:42     ` Jonathan Cameron
2021-02-11 17:06       ` Dan Williams
2021-02-12 12:24         ` Jonathan Cameron
2021-02-12 23:51           ` Dan Williams
2021-02-16 11:06             ` Jonathan Cameron
2021-02-16 16:29               ` Dan Williams
2021-02-16 18:06                 ` Jonathan Cameron
2021-02-16 18:22                   ` Dan Williams
2021-02-16 18:58                     ` Jonathan Cameron
2021-02-16 19:41                       ` Dan Williams
2021-02-17  9:53                         ` Jonathan Cameron
2021-02-10 17:02 ` Vikram Sethi
2021-02-12  0:13   ` Natu, Mahesh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.