linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>,
	Ben Widawsky <ben.widawsky@intel.com>,
	linux-cxl@vger.kernel.org, Linux PCI <linux-pci@vger.kernel.org>,
	Alison Schofield <alison.schofield@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Stuart Hayes <Stuart.Hayes@dell.com>
Subject: Re: [PATCH 20/23] cxl/port: Introduce a port driver
Date: Thu, 2 Dec 2021 17:38:17 -0800	[thread overview]
Message-ID: <CAPcyv4jrm+9UmV=6UHpBDkDd+5TTFvWo-jqDKQ6JZyaygmeB+g@mail.gmail.com> (raw)
In-Reply-To: <20211202212405.GA2918514@bhelgaas>

[ add Stuart for NPEM feedback ]

On Thu, Dec 2, 2021 at 1:24 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote:
> > On Tue, Nov 23, 2021 at 10:33 PM Christoph Hellwig <hch@lst.de> wrote:
> > > On Tue, Nov 23, 2021 at 04:40:06PM -0800, Dan Williams wrote:
> > > > Let me ask a clarifying question coming from the other direction that
> > > > resulted in the creation of the auxiliary bus architecture. Some
> > > > background. RDMA is a protocol that may run on top of Ethernet.
> > >
> > > No, RDMA is a concept.  Linux supports 2 and a half RDMA protocols
> > > that run over ethernet (RoCE v1 and v2 and iWarp).
> >
> > Yes, I was being too coarse, point taken. However, I don't think that
> > changes the observation that multiple vendors are using aux bus to
> > share a feature driver across multiple base Ethernet drivers.
> >
> > > > Consider the case where you have multiple generations of Ethernet
> > > > adapter devices, but they all support common RDMA functionality. You
> > > > only have the one PCI device to attach a unique Ethernet driver. What
> > > > is an idiomatic way to deploy a module that automatically loads and
> > > > attaches to the exported common functionality across adapters that
> > > > otherwise have a unique native driver for the hardware device?
> > >
> > > The whole aux bus drama is mostly because the intel design for these
> > > is really fucked up.  All the sane HCAs do not use this model.  All
> > > this attchment crap really should not be there.
> >
> > I am missing the counter proposal in both Bjorn's and your distaste
> > for aux bus and PCIe portdrv?
>
> For the case of PCIe portdrv, the functionality involved is Power
> Management Events (PME), Advanced Error Reporting (AER), PCIe native
> hotplug, Downstream Port Containment (DPC), and Bandwidth
> Notifications.
>
> Currently each has a separate "port service driver" with .probe(),
> .remove(), .suspend(), .resume(), etc.
>
> The services share interrupt vectors.  It's quite complicated to set
> them up, and it has to be done in the portdrv, not in the individual
> drivers.
>
> They also share power state (D0, D3hot, etc).
>
> In my mind these are not separate devices from the underlying PCI
> device, and I don't think splitting the support into "service drivers"
> made things better.  I think it would be simpler if these were just
> added to pci_init_capabilities() like other optional pieces of PCI
> functionality.
>
> Sysfs looks like this:
>
>   /sys/devices/pci0000:00/0000:00:1c.0/                       # Root Port
>   /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/  # AER "device"
>   /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie010/  # BW notif
>
>   /sys/bus/pci/devices/0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0/
>   /sys/bus/pci_express/devices/0000:00:1c.0:pcie002 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/
>
> The "pcie002" names (hex for PCIE_PORT_SERVICE_AER, etc.) are
> unintelligible.  I don't know why we have a separate
> /sys/bus/pci_express hierarchy.
>
> IIUC, CXL devices will be enumerated by the usual PCI enumeration, so
> there will be a struct pci_dev for them, and they will appear under
> /sys/devices/pci*/.
>
> They will have the usual PCI Power Management, MSI, AER, DPC, and
> similar Capabilites, so the PCI core will manage them.
>
> CXL devices have lots of fancy additional features.  Does that merit
> making a separate struct device and a separate sysfs hierarchy for
> them?  I don't know.

Thank you, this framing really helps.

So, if I look at this through the lens of the "can this just be
handled by pci_init_capabilities()?" guardband, where the capability
is device-scoped and shares resources that a driver for the same
device would use, then yes, PCIe port services do not merit a separate
bus.

CXL memory expansion is distinctly not in that category. CXL is a
distinct specification (i.e. unlike PCIe port services which are
literally in the PCI Sig purview), distinct transport/data layer
(effectively a different physical connection on the wire), and is
composable in ways that PCIe is not.

For example, if you trigger FLR on a PCI device you would expect the
entirety of pci_init_capabilities() needs to be saved / restored, CXL
state is not affected by FLR.

The separate link layer for CXL means that mere device visibility is
not sufficient to determine CXL operation. Whereas with a PCI driver
if you can see the device you know that the device is ready to be the
target of config, io, and mmio cycles, CXL.mem operation may not be
available when the device is visible to enumeration.

The composability of CXL places the highest demand for an independent
'struct bus_type' and registering CXL devices for their corresponding
PCIe devices. The bus is a rendezvous for all the moving pieces needed
to validate and provision a CXL memory region that may span multiple
endpoints, switches and host bridges. An action like reprogramming
memory decode of an endpoint needs to be coordinated across the entire
topology.

The fact that the PCI core remains blissfully unaware of all these
interdependencies is a feature because CXL is effectively a super-set
of PCIe for fast-path CXL.mem operation, even though it is derivative
for enumeration and slow-path manageability.

So I hope you see CXL's need to create some PCIe-symbiotic objects in
order to compose CXL objects (like interleaved memory regions) that
have no PCIe analog.

> > > > Another example, the Native PCIe Enclosure Management (NPEM)
> > > > specification defines a handful of registers that can appear anywhere
> > > > in the PCIe hierarchy. How can you write a common driver that is
> > > > generically applicable to any given NPEM instance?
> > >
> > > Another totally messed up spec.  But then pretty much everything coming
> > > from the PCIe SIG in terms of interface tends to be really, really
> > > broken lately.
>
> Hotplug is more central to PCI than NPEM is, but NPEM is a little bit
> like PCIe native hotplug in concept: hotplug has a few registers that
> control downstream indicators, interlock, and power controller; NPEM
> has registers that control downstream indicators.
>
> Both are prescribed by the PCIe spec and presumably designed to work
> alongside the usual device-specific drivers for bridges, SSDs, etc.
>
> I would at least explore the idea of doing common support by
> integrating NPEM into the PCI core.  There would have to be some hook
> for the enclosure-specific bits, but I think it's fair for the details
> of sending commands and polling for command completed to be part of
> the PCI core.

The problem with NPEM compared to hotplug LED signaling is that it has
the strange property that an NPEM higher in the topology might
override one lower in the topology. This is to support things like
NVME enclosures where the NVME device itself may have an NPEM
instance, but that instance is of course not suitable for signaling
that the device is not present. So, instead the system may be designed
to have an NPEM in the upstream switch port and disable the NPEM
instances in the device. Platform firmware decides which NPEM is in
use.

It also has the "fun" property of additionally being overridden by ACPI.

Stuart, have a look at collapsing the auxiliary-device approach into
pci_init_capabilities() and whether that can still coordinate with the
enclosure code.

One of the nice properties of the auxiliary organization is well
defined module boundaries. Will NPEM in the PCI core now force
CONFIG_ENCLOSURE_SERVICES to be built-in? That seems an unwanted side
effect to me.

> > DVSEC and DOE is more of the same in terms of composing add-on
> > features into devices. Hardware vendors want to mix multiple hard-IPs
> > into a single device, aux bus is one response. Topology specific buses
> > like /sys/bus/cxl are another.
>
> VSEC and DVSEC are pretty much wild cards since the PCIe spec says
> nothing about what registers they may contain or how they should work.
>
> DOE *is* specified by PCIe, at least in terms of the data transfer
> protocol (interrupt usage, read/write mailbox, etc).  I think that,
> and the fact that it's not specific to CXL, means we need some kind of
> PCI core interface to do the transfers.

DOE transfer code belongs in drivers/pci/ period, but does it really
need to be in drivers/pci/built-in.a for everyone regardless of
whether it is being used or not?

  reply	other threads:[~2021-12-03  1:38 UTC|newest]

Thread overview: 127+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-20  0:02 [PATCH 00/23] Add drivers for CXL ports and mem devices Ben Widawsky
2021-11-20  0:02 ` [PATCH 01/23] cxl: Rename CXL_MEM to CXL_PCI Ben Widawsky
2021-11-22 14:47   ` Jonathan Cameron
2021-11-24  4:15   ` Dan Williams
2021-11-20  0:02 ` [PATCH 02/23] cxl: Flesh out register names Ben Widawsky
2021-11-22 14:49   ` Jonathan Cameron
2021-11-24  4:24   ` Dan Williams
2021-11-20  0:02 ` [PATCH 03/23] cxl/pci: Extract device status check Ben Widawsky
2021-11-22 15:03   ` Jonathan Cameron
2021-11-24 19:30   ` Dan Williams
2021-11-20  0:02 ` [PATCH 04/23] cxl/pci: Implement Interface Ready Timeout Ben Widawsky
2021-11-22 15:02   ` Jonathan Cameron
2021-11-22 17:17     ` Ben Widawsky
2021-11-22 17:53       ` Jonathan Cameron
2021-11-24 19:56         ` Dan Williams
2021-11-25  6:17           ` Ben Widawsky
2021-11-25  7:14             ` Dan Williams
2021-11-20  0:02 ` [PATCH 05/23] cxl/pci: Don't poll doorbell for mailbox access Ben Widawsky
2021-11-22 15:11   ` Jonathan Cameron
2021-11-22 17:24     ` Ben Widawsky
2021-11-24 21:55   ` Dan Williams
2021-11-29 18:33     ` Ben Widawsky
2021-11-29 19:02       ` Dan Williams
2021-11-29 19:11         ` Ben Widawsky
2021-11-29 19:18           ` Dan Williams
2021-11-29 19:31             ` Ben Widawsky
2021-11-29 19:37               ` Dan Williams
2021-11-29 19:50                 ` Ben Widawsky
2021-11-20  0:02 ` [PATCH 06/23] cxl/pci: Don't check media status for mbox access Ben Widawsky
2021-11-22 15:19   ` Jonathan Cameron
2021-11-24 21:58   ` Dan Williams
2021-11-20  0:02 ` [PATCH 07/23] cxl/pci: Add new DVSEC definitions Ben Widawsky
2021-11-22 15:22   ` Jonathan Cameron
2021-11-22 17:32     ` Ben Widawsky
2021-11-24 22:03       ` Dan Williams
2021-11-20  0:02 ` [PATCH 08/23] cxl/acpi: Map component registers for Root Ports Ben Widawsky
2021-11-22 15:51   ` Jonathan Cameron
2021-11-22 19:28     ` Ben Widawsky
2021-11-24 22:18   ` Dan Williams
2021-11-20  0:02 ` [PATCH 09/23] cxl: Introduce module_cxl_driver Ben Widawsky
2021-11-22 15:54   ` Jonathan Cameron
2021-11-24 22:22   ` Dan Williams
2021-11-20  0:02 ` [PATCH 10/23] cxl/core: Convert decoder range to resource Ben Widawsky
2021-11-22 16:08   ` Jonathan Cameron
2021-11-24 22:41   ` Dan Williams
2021-11-20  0:02 ` [PATCH 11/23] cxl/core: Document and tighten up decoder APIs Ben Widawsky
2021-11-22 16:13   ` Jonathan Cameron
2021-11-24 22:55   ` Dan Williams
2021-11-20  0:02 ` [PATCH 12/23] cxl: Introduce endpoint decoders Ben Widawsky
2021-11-22 16:20   ` Jonathan Cameron
2021-11-22 19:37     ` Ben Widawsky
2021-11-25  0:07       ` Dan Williams
2021-11-29 20:05         ` Ben Widawsky
2021-11-29 20:07           ` Dan Williams
2021-11-29 20:12             ` Ben Widawsky
2021-11-20  0:02 ` [PATCH 13/23] cxl/core: Move target population locking to caller Ben Widawsky
2021-11-22 16:33   ` Jonathan Cameron
2021-11-22 21:58     ` Ben Widawsky
2021-11-23 11:05       ` Jonathan Cameron
2021-11-25  0:34   ` Dan Williams
2021-11-20  0:02 ` [PATCH 14/23] cxl: Introduce topology host registration Ben Widawsky
2021-11-22 18:20   ` Jonathan Cameron
2021-11-22 22:30     ` Ben Widawsky
2021-11-25  1:09   ` Dan Williams
2021-11-29 21:23     ` Ben Widawsky
2021-11-29 11:42   ` Dan Carpenter
2021-11-20  0:02 ` [PATCH 15/23] cxl/core: Store global list of root ports Ben Widawsky
2021-11-22 18:22   ` Jonathan Cameron
2021-11-22 22:32     ` Ben Widawsky
2021-11-20  0:02 ` [PATCH 16/23] cxl/pci: Cache device DVSEC offset Ben Widawsky
2021-11-22 16:46   ` Jonathan Cameron
2021-11-22 22:34     ` Ben Widawsky
2021-11-20  0:02 ` [PATCH 17/23] cxl: Cache and pass DVSEC ranges Ben Widawsky
2021-11-20  4:29   ` kernel test robot
2021-11-22 17:00   ` Jonathan Cameron
2021-11-22 22:50     ` Ben Widawsky
2021-11-26 11:37   ` Jonathan Cameron
2021-11-20  0:02 ` [PATCH 18/23] cxl/pci: Implement wait for media active Ben Widawsky
2021-11-22 17:03   ` Jonathan Cameron
2021-11-22 22:57     ` Ben Widawsky
2021-11-23 11:09       ` Jonathan Cameron
2021-11-23 16:04         ` Ben Widawsky
2021-11-23 17:48           ` Bjorn Helgaas
2021-11-23 19:37             ` Ben Widawsky
2021-11-26 11:36     ` Jonathan Cameron
2021-11-20  0:02 ` [PATCH 19/23] cxl/pci: Store component register base in cxlds Ben Widawsky
2021-11-20  7:28   ` kernel test robot
2021-11-22 17:11   ` Jonathan Cameron
2021-11-22 23:01     ` Ben Widawsky
2021-11-20  0:02 ` [PATCH 20/23] cxl/port: Introduce a port driver Ben Widawsky
2021-11-20  3:14   ` kernel test robot
2021-11-20  5:38   ` kernel test robot
2021-11-22 17:41   ` Jonathan Cameron
2021-11-22 23:38     ` Ben Widawsky
2021-11-23 11:38       ` Jonathan Cameron
2021-11-23 16:14         ` Ben Widawsky
2021-11-23 18:21   ` Bjorn Helgaas
2021-11-23 22:03     ` Ben Widawsky
2021-11-23 22:36       ` Dan Williams
2021-11-23 23:38         ` Ben Widawsky
2021-11-23 23:55         ` Bjorn Helgaas
2021-11-24  0:40           ` Dan Williams
2021-11-24  6:33             ` Christoph Hellwig
2021-11-24  7:17               ` Dan Williams
2021-11-24  7:28                 ` Christoph Hellwig
2021-11-24  7:33                   ` Greg Kroah-Hartman
2021-11-24  7:54                     ` Dan Williams
2021-11-24  8:21                       ` Greg Kroah-Hartman
2021-11-24 18:24                         ` Dan Williams
2021-12-02 21:24                 ` Bjorn Helgaas
2021-12-03  1:38                   ` Dan Williams [this message]
2021-12-03 22:03                     ` Bjorn Helgaas
2021-12-04  1:24                       ` Dan Williams
2021-12-07  2:56                         ` Bjorn Helgaas
2021-12-07  4:48                           ` Dan Williams
2021-11-24 21:31       ` Bjorn Helgaas
2021-11-20  0:02 ` [PATCH 21/23] cxl: Unify port enumeration for decoders Ben Widawsky
2021-11-22 17:48   ` Jonathan Cameron
2021-11-22 23:44     ` Ben Widawsky
2021-11-20  0:02 ` [PATCH 22/23] cxl/mem: Introduce cxl_mem driver Ben Widawsky
2021-11-20  0:40   ` Randy Dunlap
2021-11-21  3:55     ` Ben Widawsky
2021-11-22 18:17   ` Jonathan Cameron
2021-11-23  0:05     ` Ben Widawsky
2021-11-20  0:02 ` [PATCH 23/23] cxl/mem: Disable switch hierarchies for now Ben Widawsky
2021-11-22 18:19   ` Jonathan Cameron
2021-11-22 19:17     ` Ben Widawsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4jrm+9UmV=6UHpBDkDd+5TTFvWo-jqDKQ6JZyaygmeB+g@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Stuart.Hayes@dell.com \
    --cc=alison.schofield@intel.com \
    --cc=ben.widawsky@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@lst.de \
    --cc=helgaas@kernel.org \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=rafael@kernel.org \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).