Re: [RFC] ACPI Code First ECR: Generic Target

From: Dan Williams <dan.j.williams@intel.com>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: linux-cxl@vger.kernel.org,
	Linux ACPI <linux-acpi@vger.kernel.org>,
	"Natu, Mahesh" <mahesh.natu@intel.com>,
	Chet R Douglas <chet.r.douglas@intel.com>,
	Ben Widawsky <ben.widawsky@intel.com>,
	Vishal L Verma <vishal.l.verma@intel.com>
Subject: Re: [RFC] ACPI Code First ECR: Generic Target
Date: Tue, 16 Feb 2021 10:22:28 -0800	[thread overview]
Message-ID: <CAPcyv4h=e_a-YD2pAzY5k8Qc-+EMeBNyfzLfpuC01Jey6_sQ5g@mail.gmail.com> (raw)
In-Reply-To: <20210216180634.00007178@Huawei.com>

On Tue, Feb 16, 2021 at 10:08 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Tue, 16 Feb 2021 08:29:01 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > On Tue, Feb 16, 2021 at 3:08 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> > [..]
> > > > Why does GI need anything more than acpi_map_pxm_to_node() to have a
> > > > node number assigned?
> > >
> > > It might have been possible (with limitations) to do it by making multiple
> > > proximity domains map to a single numa node, along with some additional
> > > functionality to allow it to retrieve the real node for aware drivers,
> > > but seeing as we already had the memoryless node infrastructure in place,
> > > it fitted more naturally into that scheme.  GI introduction to the
> > > ACPI spec, and indeed the kernel was originally driven by the needs of
> > > CCIX (before CXL was public) with CCIX's symmetric view of initiators
> > > (CPU or other) + a few other existing situations where we'd been
> > > papering over the topology for years and paying a cost in custom
> > > load balancing in drivers etc. That more symmetric view meant that the
> > > natural approach was to treat these as memoryless nodes.
> > >
> > > The full handling of nodes is needed to deal with situations like
> > > the following contrived setup. With a few interconnect
> > > links I haven't bothered drawing, there are existing systems where
> > > a portion of the topology looks like this:
> > >
> > >
> > >     RAM                              RAM             RAM
> > >      |                                |               |
> > >  --------        ---------        --------        --------
> > > | a      |      | b       |      | c      |      | d      |
> > > |   CPUs |------|  PCI RC |------| CPUs   |------|  CPUs  |
> > > |        |      |         |      |        |      |        |
> > >  --------        ---------        --------        --------
> > >                      |
> > >                   PCI EP
> > >
> > > We need the GI representation to allow an "aware" driver to understand
> > > that the PCI EP is equal distances from CPUs and RAM on (a) and (c),
> > > (and that using allocations from (d) is a a bad idea).  This would be
> > > the same as a driver running on an PCI RC attached to a memoryless
> > > CPU node (you would hope no one would build one of those, but I've seen
> > > them occasionally).  Such an aware driver carefully places both memory
> > > and processing threads / interrupts etc to balance the load.
> >
> > That's an explanation for why GI exists, not an explanation for why a
> > GI needs to be anything more than translated to a Linux numa node
> > number and an api to lookup distance.
>
> Why should a random driver need to know it needs to do something special?
>
> Random drivers don't lookup distance, they just allocate memory based on their
> current numa_node. devm_kzalloc() does this under the hood (an optimization
> that rather took me by surprise at the time).
> Sure we could add a bunch of new infrastructure to solve that problem
> but why not use what is already there?
>
> >
> > >
> > > In pre GI days, can just drop (b) into (a or c) and not worry about it, but
> > > that comes with a large performance cost (20% plus on network throughput
> > > on some of our more crazy systems, due to it appearing that balancing
> > > memory load across (a) and (c) doesn't make sense).  Also, if we happened
> > > to drop it into (c) then once we run out of space on (c) we'll start
> > > using (d) which is a bad idea.
> > >
> > > With GI nodes, you need an unaware PCI driver to work well and they
> > > will use allocations linked to the particular NUMA node that are in.
> > > The kernel needs to know a reasonable place to shunt them to and in
> > > more complex topologies the zone list may not correspond to that of
> > > any other node.
> >
> > The kernel "needs", no it doesn't. Look at the "target_node" handling
> > for PMEM. Those nodes are offline, the distance can be determined, and
> > only when they become memory does the node become online.
>
> Indeed, custom code for specific cases can work just fine (we've carried
> plenty of it in the past to get best performance from systems), but for GIs
> the intent was they would just work.  We don't want to have to go and change
> stuff in PCI drivers every time we plug a new card into such a system.
>
> >
> > The only point I can see GI needing anything more than the equivalent
> > of "target_node" is when the scheduler can submit jobs to GI
> > initiators like a CPU. Otherwise, GI is just a seed for a node number
> > plus numa distance.
>
> That would be true if Linux didn't already make heavy use of numa_node
> for driver allocations.  We could carry a parallel value of 'real_numa_node'
> or something like that, but you can't safely use numa_node without the
> node being online and zone lists present.
> Another way of looking at it is that zone list is a cache solving the
> question of where to allocate memory, which you could also solve using
> the node number and distances (at the cost of custom handling).
>
> It is of course advantageous to do cleverer things for particular drivers
> but the vast majority need to just work.
>
> >
> > >   In a CCIX world for example, a GI can sit between
> > > a pair of Home Agents with memory, and the host on the other side of
> > > them.  We had a lot of fun working through these cases back when drawing
> > > up the ACPI changes to support them. :)
> > >
> >
> > Yes, I can imagine several interesting ACPI cases, but still
> > struggling to justify the GI zone list metadata.
>
> It works. It solves the problem. It's very little extra code and it
> exercises zero paths not already exercised by memoryless nodes.
> We certainly wouldn't have invented something as complex as zone lists
> if we couldn't leverage what was there of course.
>
> So I have the opposite view point. I can't see why the minor overhead
> of zone list metadata for GIs isn't a sensible choice vs cost of
> maintaining something entirely different.  This only changes with the
> intent to use them to represent something different.

What I am missing is what zone-list metadata offers beyond just
assigning the device-numa-node to the closest online memory node, and
let the HMAT-sysfs representation enumerate the next level? For
example, the persistent memory enabling assigns the closest online
memory node for the pmem device. That achieves the traditional
behavior of the device-driver allocating from "local" memory by
default. However the HMAT-sysfs representation indicates the numa node
that pmem represents itself were it to be online. So the question is
why does GI need more than that? To me a GI is "offline" in terms
Linux node representations because numactl can't target it, "closest
online" is good enough for a GI device driver, but if userspace needs
the next level of detail of the performance properties that's what
HMEM sysfs is providing.