* RFC: Restricting userspace interfaces for CXL fabric management @ 2024-03-21 17:44 Jonathan Cameron 2024-03-21 21:41 ` Sreenivas Bagalkote 2024-04-06 0:04 ` Dan Williams 0 siblings, 2 replies; 23+ messages in thread From: Jonathan Cameron @ 2024-03-21 17:44 UTC (permalink / raw) To: linux-cxl Cc: Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, Williams, Dan J, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh Hi All, This is has come up in a number of discussions both on list and in private, so I wanted to lay out a potential set of rules when deciding whether or not to provide a user space interface for a particular feature of CXL Fabric Management. The intent is to drive discussion, not to simply tell people a set of rules. I've brought this to the public lists as it's a Linux kernel policy discussion, not a standards one. Whilst I'm writing the RFC this my attempt to summarize a possible position rather than necessarily being my personal view. It's a straw man - shoot at it! Not everyone in this discussion is familiar with relevant kernel or CXL concepts so I've provided more info than I normally would. First some background: ====================== CXL has two different types of Fabric. The comments here refer to both, but for now the kernel stack is focused on the simpler VCS fabric, not the more recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts connected to a common switch looks something like: ________________ _______________ | | | | Hosts - each sees | HOST A | | HOST B | a PCIe style tree | | | | but from a fabric config | |Root Port| | | |Root Port| | point of view it's more -------|-------- -------|------- complex. | | | | _______|______________________________|________ | USP (SW-CCI) USP | Switch can have lots of | | | | Upstream Ports. Each one | ____|________ _______|______ | has a virtual hierarchy. | | | | | | | vPPB vPPB vPPB vPPB| There are virtual | x | | | | "downstream ports."(vPPBs) | \ / / | That can be bound to real | \ / / | downstream ports. | \ / / | | \ / / | Multi Logical Devices are | DSP0 DSP1 DSP 2 | support more than one vPPB ------------------------------------------------ bound to a single physical | | | DSP (transactions are tagged | | | with an LD-ID) SLD0 MLD0 SLD1 Some typical fabric management activities: 1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug events) 2) Access config space or BAR space of End Points below the switch. 3) Tunneling messages through to devices downstream (e.g Dynamic Capacity Forced Remove that will blow away some memory even if a host is using it). 4) Non destructive stuff like status read back. Given the hosts may be using the Type 3 hosted memory (either Single Logical Device - SLD, or an LD on a Multi logical Device - MLD) as normal memory, unbinding a device in use can result in the memory access from a different host being removed. The 'blast radius' is perhaps a rack of servers. This discussion applies equally to FM-API commands sent to Multi Head Devices (see CXL r3.1). The Fabric Management actions are done using the CXL spec defined Fabric Management API, (FM-API) which is transported over various means including OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via normal PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR space on a function found alongside one of the switch upstream ports; this mailbox is very similar to the MMPT definition found in PCIe r6.2. In many cases this switch CCI / MCTP connection is used by a BMC rather than a normal host, but there have been some questions raised about whether a general purpose server OS would have a valid reason to use this interface (beyond debug and testing) to configure the switch or an MHD. If people have a use case for this, please reply to this thread to give more details. The most recently posted CXL Switch-CCI support only provided the RAW CXL command IOCTL interface that is already available for Type 3 memory devices. That allows for unfettered control of the switch but, because it is extremely easy to shoot yourself in the foot and cause unsolvable bug reports, it taints the kernel. There have been several requests to provide this interface without the taint for these switch configuration mailboxes. Last posted series: https://lore.kernel.org/all/20231016125323.18318-1-Jonathan.Cameron@huawei.com/ Note there are unrelated reasons why that code hasn't been updated since v6.6 time, but I am planning to get back to it shortly. Similar issues will occur for other uses of PCIe MMPT (new mailbox in PCI that sometimes is used for similarly destructive activity such as PLDM based firmware update). On to the proposed rules: 1) Kernel space use of the various mailboxes, or filtered controls from user space. ================================================================================== Absolutely fine - no one worries about this, but the mediated traffic will be filtered for potentially destructive side effects. E.g. it will reject attempts to change anything routing related if the kernel either knows a host is using memory that will be blown away, or has no way to know (so affecting routing to another host). This includes blocking 'all' vendor defined messages as we have no idea what the do. Note this means the kernel has an allow list and new commands are not initially allowed. This isn't currently enabled for Switch CCIs because they are only really interesting if the potentially destructive stuff is available (an earlier version did enable query commands, but it wasn't particularly useful to know what your switch could do but not be allowed to do any of it). If you take a MMPT usecase of PLDM firmware update, the filtering would check that the device was in a state where a firmware update won't rip memory out from under a host, which would be messy if that host is doing the update. 2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels ========================================================================== (This would just be a kernel option that we'd advise normal server distributions not to turn on. Would be enabled by openBMC etc) This is fine - there is some work to do, but the switch-cci PCI driver will hopefully be ready for upstream merge soon. There is no filtering of accesses. Think of this as similar to all the damage you can do via MCTP from a BMC. Similarly it is likely that much of the complexity of the actual commands will be left to user space tooling: https://gitlab.com/jic23/cxl-fmapi-tests has some test examples. Whether Kconfig help text is strong enough to ensure this only gets enabled for BMC targeted distros is an open question we can address alongside an updated patch set. (On to the one that the "debate" is about) 3) Unfiltered user space use of mailbox for Fabric Management - Distro kernels ============================================================================= (General purpose Linux Server Distro (Redhat, Suse etc)) This is equivalent of RAW command support on CXL Type 3 memory devices. You can enable those in a distro kernel build despite the scary config help text, but if you use it the kernel is tainted. The result of the taint is to add a flag to bug reports and print a big message to say that you've used a feature that might result in you shooting yourself in the foot. The taint is there because software is not at first written to deal with everything that can happen smoothly (e.g. surprise removal) It's hard to survive some of these events, so is never on the initial feature list for any bus, so this flag is just to indicate we have entered a world where almost all bets are off wrt to stability. We might not know what a command does so we can't assess the impact (and no one trusts vendor commands to report affects right in the Command Effects Log - which in theory tells you if a command can result problems). A concern was raised about GAE/FAST/LDST tables for CXL Fabrics (a r3.1 feature) but, as I understand it, these are intended for a host to configure and should not have side effects on other hosts? My working assumption is that the kernel driver stack will handle these (once we catch up with the current feature backlog!) Currently we have no visibility of what the OS driver stack for a fabrics will actually look like - the spec is just the starting point for that. (patches welcome ;) The various CXL upstream developers and maintainers may have differing views of course, but my current understanding is we want to support 1 and 2, but are very resistant to 3! General Notes ============= One side aspect of why we really don't like unfiltered userspace access to any of these devices is that people start building non standard hacks in and we lose the ecosystem advantages. Forcing a considered discussion + patches to let a particular command be supported, drives standardization. https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ provides some history on vendor specific extensions and why in general we won't support them upstream. To address another question raised in an earlier discussion: Putting these Fabric Management interfaces behind guard rails of some type (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk of non standard interfaces, because we will be even less likely to accept those upstream! If anyone needs more details on any aspect of this please ask. There are a lot of things involved and I've only tried to give a fairly minimal illustration to drive the discussion. I may well have missed something crucial. Jonathan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-03-21 17:44 RFC: Restricting userspace interfaces for CXL fabric management Jonathan Cameron @ 2024-03-21 21:41 ` Sreenivas Bagalkote 2024-03-22 9:32 ` Jonathan Cameron 2024-04-06 0:04 ` Dan Williams 1 sibling, 1 reply; 23+ messages in thread From: Sreenivas Bagalkote @ 2024-03-21 21:41 UTC (permalink / raw) To: Jonathan Cameron Cc: linux-cxl, Brett Henning, Harold Johnson, Sumanesh Samanta, Williams, Dan J, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh [-- Attachment #1.1: Type: text/plain, Size: 12219 bytes --] Thank you for kicking off this discussion, Jonathan. We need guidance from the community. 1. Datacenter customers must be able to manage PCIe switches in-band. 2. Management of switches includes getting health, performance, and error telemetry. 3. These telemetry functions are not yet part of the CXL standard 4. We built the CCI mailboxes into our PCIe switches per CXL spec and developed our management scheme around them. If the Linux community does not allow a CXL spec-compliant switch to be managed via the CXL spec-defined CCI mailbox, then please guide us on the right approach. Please tell us how you propose we manage our switches in-band. Thank you Sreeni On Thu, Mar 21, 2024 at 10:44 AM Jonathan Cameron < Jonathan.Cameron@huawei.com> wrote: > Hi All, > > This is has come up in a number of discussions both on list and in private, > so I wanted to lay out a potential set of rules when deciding whether or > not > to provide a user space interface for a particular feature of CXL Fabric > Management. The intent is to drive discussion, not to simply tell people > a set of rules. I've brought this to the public lists as it's a Linux > kernel > policy discussion, not a standards one. > > Whilst I'm writing the RFC this my attempt to summarize a possible > position rather than necessarily being my personal view. > > It's a straw man - shoot at it! > > Not everyone in this discussion is familiar with relevant kernel or CXL > concepts > so I've provided more info than I normally would. > > First some background: > ====================== > > CXL has two different types of Fabric. The comments here refer to both, but > for now the kernel stack is focused on the simpler VCS fabric, not the more > recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts > connected to a common switch looks something like: > > ________________ _______________ > | | | | Hosts - each sees > | HOST A | | HOST B | a PCIe style tree > | | | | but from a fabric > config > | |Root Port| | | |Root Port| | point of view it's more > -------|-------- -------|------- complex. > | | > | | > _______|______________________________|________ > | USP (SW-CCI) USP | Switch can have lots of > | | | | Upstream Ports. Each one > | ____|________ _______|______ | has a virtual hierarchy. > | | | | | | > | vPPB vPPB vPPB vPPB| There are virtual > | x | | | | "downstream > ports."(vPPBs) > | \ / / | That can be bound to real > | \ / / | downstream ports. > | \ / / | > | \ / / | Multi Logical Devices are > | DSP0 DSP1 DSP 2 | support more than one > vPPB > ------------------------------------------------ bound to a single > physical > | | | DSP (transactions are > tagged > | | | with an LD-ID) > SLD0 MLD0 SLD1 > > Some typical fabric management activities: > 1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug events) > 2) Access config space or BAR space of End Points below the switch. > 3) Tunneling messages through to devices downstream (e.g Dynamic Capacity > Forced Remove that will blow away some memory even if a host is using > it). > 4) Non destructive stuff like status read back. > > Given the hosts may be using the Type 3 hosted memory (either Single > Logical > Device - SLD, or an LD on a Multi logical Device - MLD) as normal memory, > unbinding a device in use can result in the memory access from a > different host being removed. The 'blast radius' is perhaps a rack of > servers. This discussion applies equally to FM-API commands sent to Multi > Head Devices (see CXL r3.1). > > The Fabric Management actions are done using the CXL spec defined Fabric > Management API, (FM-API) which is transported over various means including > OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via normal > PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR > space on a function found alongside one of the switch upstream ports; > this mailbox is very similar to the MMPT definition found in PCIe r6.2. > > In many cases this switch CCI / MCTP connection is used by a BMC rather > than a normal host, but there have been some questions raised about whether > a general purpose server OS would have a valid reason to use this interface > (beyond debug and testing) to configure the switch or an MHD. > > If people have a use case for this, please reply to this thread to give > more details. > > The most recently posted CXL Switch-CCI support only provided the RAW CXL > command IOCTL interface that is already available for Type 3 memory > devices. > That allows for unfettered control of the switch but, because it is > extremely easy to shoot yourself in the foot and cause unsolvable bug > reports, > it taints the kernel. There have been several requests to provide this > interface > without the taint for these switch configuration mailboxes. > > Last posted series: > > https://lore.kernel.org/all/20231016125323.18318-1-Jonathan.Cameron@huawei.com/ > Note there are unrelated reasons why that code hasn't been updated since > v6.6 time, > but I am planning to get back to it shortly. > > Similar issues will occur for other uses of PCIe MMPT (new mailbox in PCI > that > sometimes is used for similarly destructive activity such as PLDM based > firmware update). > > > On to the proposed rules: > > 1) Kernel space use of the various mailboxes, or filtered controls from > user space. > > ================================================================================== > > Absolutely fine - no one worries about this, but the mediated traffic will > be filtered for potentially destructive side effects. E.g. it will reject > attempts to change anything routing related if the kernel either knows a > host is > using memory that will be blown away, or has no way to know (so affecting > routing to another host). This includes blocking 'all' vendor defined > messages as we have no idea what the do. Note this means the kernel has > an allow list and new commands are not initially allowed. > > This isn't currently enabled for Switch CCIs because they are only really > interesting if the potentially destructive stuff is available (an earlier > version did enable query commands, but it wasn't particularly useful to > know what your switch could do but not be allowed to do any of it). > If you take a MMPT usecase of PLDM firmware update, the filtering would > check that the device was in a state where a firmware update won't rip > memory out from under a host, which would be messy if that host is > doing the update. > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels > ========================================================================== > > (This would just be a kernel option that we'd advise normal server > distributions not to turn on. Would be enabled by openBMC etc) > > This is fine - there is some work to do, but the switch-cci PCI driver > will hopefully be ready for upstream merge soon. There is no filtering of > accesses. Think of this as similar to all the damage you can do via > MCTP from a BMC. Similarly it is likely that much of the complexity > of the actual commands will be left to user space tooling: > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples. > > Whether Kconfig help text is strong enough to ensure this only gets > enabled for BMC targeted distros is an open question we can address > alongside an updated patch set. > > (On to the one that the "debate" is about) > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro > kernels > > ============================================================================= > (General purpose Linux Server Distro (Redhat, Suse etc)) > > This is equivalent of RAW command support on CXL Type 3 memory devices. > You can enable those in a distro kernel build despite the scary config > help text, but if you use it the kernel is tainted. The result > of the taint is to add a flag to bug reports and print a big message to say > that you've used a feature that might result in you shooting yourself > in the foot. > > The taint is there because software is not at first written to deal with > everything that can happen smoothly (e.g. surprise removal) It's hard > to survive some of these events, so is never on the initial feature list > for any bus, so this flag is just to indicate we have entered a world > where almost all bets are off wrt to stability. We might not know what > a command does so we can't assess the impact (and no one trusts vendor > commands to report affects right in the Command Effects Log - which > in theory tells you if a command can result problems). > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics > (a r3.1 feature) but, as I understand it, these are intended for a > host to configure and should not have side effects on other hosts? > My working assumption is that the kernel driver stack will handle > these (once we catch up with the current feature backlog!) Currently > we have no visibility of what the OS driver stack for a fabrics will > actually look like - the spec is just the starting point for that. > (patches welcome ;) > > The various CXL upstream developers and maintainers may have > differing views of course, but my current understanding is we want > to support 1 and 2, but are very resistant to 3! > > General Notes > ============= > > One side aspect of why we really don't like unfiltered userspace access to > any > of these devices is that people start building non standard hacks in and we > lose the ecosystem advantages. Forcing a considered discussion + patches > to let a particular command be supported, drives standardization. > > > https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ > provides some history on vendor specific extensions and why in general we > won't support them upstream. > > To address another question raised in an earlier discussion: > Putting these Fabric Management interfaces behind guard rails of some type > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk > of non standard interfaces, because we will be even less likely to accept > those upstream! > > If anyone needs more details on any aspect of this please ask. > There are a lot of things involved and I've only tried to give a fairly > minimal illustration to drive the discussion. I may well have missed > something crucial. > > Jonathan > > -- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it. [-- Attachment #1.2: Type: text/html, Size: 14975 bytes --] [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4230 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-03-21 21:41 ` Sreenivas Bagalkote @ 2024-03-22 9:32 ` Jonathan Cameron 2024-03-22 13:24 ` Sreenivas Bagalkote 0 siblings, 1 reply; 23+ messages in thread From: Jonathan Cameron @ 2024-03-22 9:32 UTC (permalink / raw) To: Sreenivas Bagalkote Cc: linux-cxl, Brett Henning, Harold Johnson, Sumanesh Samanta, Williams, Dan J, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh, Ariel.Sibley On Thu, 21 Mar 2024 14:41:00 -0700 Sreenivas Bagalkote <sreenivas.bagalkote@broadcom.com> wrote: > Thank you for kicking off this discussion, Jonathan. Hi Sreenivas, > > We need guidance from the community. > > 1. Datacenter customers must be able to manage PCIe switches in-band. What is the use case? My understanding so far is that clouds and similar sometimes use an in band path but it would be from a management only host, not a general purpose host running other software. Sure that control host just connects to a different upstream port so, from a switch point of view, it's the same as any other host. From a host software point of view it's not running general cloud workloads or (at least in most cases) a general purpose OS distribution. This is the key question behind this discussion. > 2. Management of switches includes getting health, performance, and error > telemetry. For telemetry(subject to any odd corners like commands that might lock the interface up for a long time, which we've seen with commands in the Spec!) I don't see any problem supporting those on all host software. They should be non destructive to other hosts etc. > 3. These telemetry functions are not yet part of the CXL standard Ok, so this we should try to pin down the boundaries around this. The thread linked below lays out the reasoning behind a general rule of not accepting vendor defined commands, but perhaps there are routes to answer some of those concerns. 'Maybe' if you were to publish a specification for those particular vendor defined commands, it might be fine to add them to the allow list for the switch-cci. Key here is that Broadcom would be committing to not using those particular opcodes from the vendor space for anything else in the future (so we could match on VID + opcode). This is similar to some DVSEC usage in PCIe (and why DVSEC is different from VSEC). Effectively you'd be publishing an additional specification building on CXL. Those are expected to surface anyway from various standards orgs - should we treat a company published one differently? I don't see why. Exactly how this would work might take some figuring out (in main code, separate driver module etc?) That specification would be expected to provide a similar level of detail to CXL spec defined commands (ideally the less vague ones, but meh, up to you as long as any side effects are clearly documented!) Speaking for myself, I'd consider this approach. Particularly true if I see clear effort in the standards org to push these into future specifications as that shows broadcom are trying to enhance the ecosystems. > 4. We built the CCI mailboxes into our PCIe switches per CXL spec and > developed our management scheme around them. > > If the Linux community does not allow a CXL spec-compliant switch to be > managed via the CXL spec-defined CCI mailbox, then please guide us on > the right approach. Please tell us how you propose we manage our switches > in-band. The Linux community is fine supporting this in the kernel (the BMC or Fabric Management only host case - option 2 below, so the code will be there) the question here is what advice we offer to the general purpose distributions and what protections we need to put in place to mitigate the 'blast radius' concerns. Jonathan > > Thank you > Sreeni > > On Thu, Mar 21, 2024 at 10:44 AM Jonathan Cameron < > Jonathan.Cameron@huawei.com> wrote: > > > Hi All, > > > > This is has come up in a number of discussions both on list and in private, > > so I wanted to lay out a potential set of rules when deciding whether or > > not > > to provide a user space interface for a particular feature of CXL Fabric > > Management. The intent is to drive discussion, not to simply tell people > > a set of rules. I've brought this to the public lists as it's a Linux > > kernel > > policy discussion, not a standards one. > > > > Whilst I'm writing the RFC this my attempt to summarize a possible > > position rather than necessarily being my personal view. > > > > It's a straw man - shoot at it! > > > > Not everyone in this discussion is familiar with relevant kernel or CXL > > concepts > > so I've provided more info than I normally would. > > > > First some background: > > ====================== > > > > CXL has two different types of Fabric. The comments here refer to both, but > > for now the kernel stack is focused on the simpler VCS fabric, not the more > > recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts > > connected to a common switch looks something like: > > > > ________________ _______________ > > | | | | Hosts - each sees > > | HOST A | | HOST B | a PCIe style tree > > | | | | but from a fabric > > config > > | |Root Port| | | |Root Port| | point of view it's more > > -------|-------- -------|------- complex. > > | | > > | | > > _______|______________________________|________ > > | USP (SW-CCI) USP | Switch can have lots of > > | | | | Upstream Ports. Each one > > | ____|________ _______|______ | has a virtual hierarchy. > > | | | | | | > > | vPPB vPPB vPPB vPPB| There are virtual > > | x | | | | "downstream > > ports."(vPPBs) > > | \ / / | That can be bound to real > > | \ / / | downstream ports. > > | \ / / | > > | \ / / | Multi Logical Devices are > > | DSP0 DSP1 DSP 2 | support more than one > > vPPB > > ------------------------------------------------ bound to a single > > physical > > | | | DSP (transactions are > > tagged > > | | | with an LD-ID) > > SLD0 MLD0 SLD1 > > > > Some typical fabric management activities: > > 1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug events) > > 2) Access config space or BAR space of End Points below the switch. > > 3) Tunneling messages through to devices downstream (e.g Dynamic Capacity > > Forced Remove that will blow away some memory even if a host is using > > it). > > 4) Non destructive stuff like status read back. > > > > Given the hosts may be using the Type 3 hosted memory (either Single > > Logical > > Device - SLD, or an LD on a Multi logical Device - MLD) as normal memory, > > unbinding a device in use can result in the memory access from a > > different host being removed. The 'blast radius' is perhaps a rack of > > servers. This discussion applies equally to FM-API commands sent to Multi > > Head Devices (see CXL r3.1). > > > > The Fabric Management actions are done using the CXL spec defined Fabric > > Management API, (FM-API) which is transported over various means including > > OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via normal > > PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR > > space on a function found alongside one of the switch upstream ports; > > this mailbox is very similar to the MMPT definition found in PCIe r6.2. > > > > In many cases this switch CCI / MCTP connection is used by a BMC rather > > than a normal host, but there have been some questions raised about whether > > a general purpose server OS would have a valid reason to use this interface > > (beyond debug and testing) to configure the switch or an MHD. > > > > If people have a use case for this, please reply to this thread to give > > more details. > > > > The most recently posted CXL Switch-CCI support only provided the RAW CXL > > command IOCTL interface that is already available for Type 3 memory > > devices. > > That allows for unfettered control of the switch but, because it is > > extremely easy to shoot yourself in the foot and cause unsolvable bug > > reports, > > it taints the kernel. There have been several requests to provide this > > interface > > without the taint for these switch configuration mailboxes. > > > > Last posted series: > > > > https://lore.kernel.org/all/20231016125323.18318-1-Jonathan.Cameron@huawei.com/ > > Note there are unrelated reasons why that code hasn't been updated since > > v6.6 time, > > but I am planning to get back to it shortly. > > > > Similar issues will occur for other uses of PCIe MMPT (new mailbox in PCI > > that > > sometimes is used for similarly destructive activity such as PLDM based > > firmware update). > > > > > > On to the proposed rules: > > > > 1) Kernel space use of the various mailboxes, or filtered controls from > > user space. > > > > ================================================================================== > > > > Absolutely fine - no one worries about this, but the mediated traffic will > > be filtered for potentially destructive side effects. E.g. it will reject > > attempts to change anything routing related if the kernel either knows a > > host is > > using memory that will be blown away, or has no way to know (so affecting > > routing to another host). This includes blocking 'all' vendor defined > > messages as we have no idea what the do. Note this means the kernel has > > an allow list and new commands are not initially allowed. > > > > This isn't currently enabled for Switch CCIs because they are only really > > interesting if the potentially destructive stuff is available (an earlier > > version did enable query commands, but it wasn't particularly useful to > > know what your switch could do but not be allowed to do any of it). > > If you take a MMPT usecase of PLDM firmware update, the filtering would > > check that the device was in a state where a firmware update won't rip > > memory out from under a host, which would be messy if that host is > > doing the update. > > > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels > > ========================================================================== > > > > (This would just be a kernel option that we'd advise normal server > > distributions not to turn on. Would be enabled by openBMC etc) > > > > This is fine - there is some work to do, but the switch-cci PCI driver > > will hopefully be ready for upstream merge soon. There is no filtering of > > accesses. Think of this as similar to all the damage you can do via > > MCTP from a BMC. Similarly it is likely that much of the complexity > > of the actual commands will be left to user space tooling: > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples. > > > > Whether Kconfig help text is strong enough to ensure this only gets > > enabled for BMC targeted distros is an open question we can address > > alongside an updated patch set. > > > > (On to the one that the "debate" is about) > > > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro > > kernels > > > > ============================================================================= > > (General purpose Linux Server Distro (Redhat, Suse etc)) > > > > This is equivalent of RAW command support on CXL Type 3 memory devices. > > You can enable those in a distro kernel build despite the scary config > > help text, but if you use it the kernel is tainted. The result > > of the taint is to add a flag to bug reports and print a big message to say > > that you've used a feature that might result in you shooting yourself > > in the foot. > > > > The taint is there because software is not at first written to deal with > > everything that can happen smoothly (e.g. surprise removal) It's hard > > to survive some of these events, so is never on the initial feature list > > for any bus, so this flag is just to indicate we have entered a world > > where almost all bets are off wrt to stability. We might not know what > > a command does so we can't assess the impact (and no one trusts vendor > > commands to report affects right in the Command Effects Log - which > > in theory tells you if a command can result problems). > > > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics > > (a r3.1 feature) but, as I understand it, these are intended for a > > host to configure and should not have side effects on other hosts? > > My working assumption is that the kernel driver stack will handle > > these (once we catch up with the current feature backlog!) Currently > > we have no visibility of what the OS driver stack for a fabrics will > > actually look like - the spec is just the starting point for that. > > (patches welcome ;) > > > > The various CXL upstream developers and maintainers may have > > differing views of course, but my current understanding is we want > > to support 1 and 2, but are very resistant to 3! > > > > General Notes > > ============= > > > > One side aspect of why we really don't like unfiltered userspace access to > > any > > of these devices is that people start building non standard hacks in and we > > lose the ecosystem advantages. Forcing a considered discussion + patches > > to let a particular command be supported, drives standardization. > > > > > > https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ > > provides some history on vendor specific extensions and why in general we > > won't support them upstream. > > > > To address another question raised in an earlier discussion: > > Putting these Fabric Management interfaces behind guard rails of some type > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk > > of non standard interfaces, because we will be even less likely to accept > > those upstream! > > > > If anyone needs more details on any aspect of this please ask. > > There are a lot of things involved and I've only tried to give a fairly > > minimal illustration to drive the discussion. I may well have missed > > something crucial. > > > > Jonathan > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-03-22 9:32 ` Jonathan Cameron @ 2024-03-22 13:24 ` Sreenivas Bagalkote 2024-04-01 16:51 ` Sreenivas Bagalkote 0 siblings, 1 reply; 23+ messages in thread From: Sreenivas Bagalkote @ 2024-03-22 13:24 UTC (permalink / raw) To: Jonathan Cameron Cc: linux-cxl, Brett Henning, Harold Johnson, Sumanesh Samanta, Williams, Dan J, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh, Ariel.Sibley [-- Attachment #1.1: Type: text/plain, Size: 17763 bytes --] Jonathan, > > What is the use case? My understanding so far is that clouds and > similar sometimes use an in band path but it would be from a management > only host, not a general purpose host running other software > The overwhelming majority of the PCIe switches get deployed in a single server. Typically four to eight switches are connected to two or more root complexes in one or two CPUs. The deployment scenario you have in mind - multiple physical hosts running general workloads and a management-only host - exists. But it is insignificant. > > For telemetry(subject to any odd corners like commands that might lock > the interface up for a long time, which we've seen with commands in the > Spec!) I don't see any problem supporting those on all host software. > They should be non destructive to other hosts etc. > Thank you. As you do this, please keep in mind that your concern about not affecting "other" hosts is theoretically valid but doesn't exist in the real world beyond science experiments. If there are real-world deployments, they are insignificant. I urge you all to make your stuff work with 99.99% of the deployments. > > 'Maybe' if you were to publish a specification for those particular > vendor defined commands, it might be fine to add them to the allow list > for the switch-cci. > Your proposal sounds reasonable. I will let you all experts figure out how to support the vendor-defined commands. CXL spec has them for a reason and they need to be supported. Sreeni On Fri, Mar 22, 2024 at 2:32 AM Jonathan Cameron < Jonathan.Cameron@huawei.com> wrote: > On Thu, 21 Mar 2024 14:41:00 -0700 > Sreenivas Bagalkote <sreenivas.bagalkote@broadcom.com> wrote: > > > Thank you for kicking off this discussion, Jonathan. > > Hi Sreenivas, > > > > > We need guidance from the community. > > > > 1. Datacenter customers must be able to manage PCIe switches in-band. > > What is the use case? My understanding so far is that clouds and > similar sometimes use an in band path but it would be from a management > only host, not a general purpose host running other software. Sure > that control host just connects to a different upstream port so, from > a switch point of view, it's the same as any other host. From a host > software point of view it's not running general cloud workloads or > (at least in most cases) a general purpose OS distribution. > > This is the key question behind this discussion. > > > 2. Management of switches includes getting health, performance, and error > > telemetry. > > For telemetry(subject to any odd corners like commands that might lock > the interface up for a long time, which we've seen with commands in the > Spec!) I don't see any problem supporting those on all host software. > They should be non destructive to other hosts etc. > > > 3. These telemetry functions are not yet part of the CXL standard > > Ok, so this we should try to pin down the boundaries around this. > The thread linked below lays out the reasoning behind a general rule > of not accepting vendor defined commands, but perhaps there are routes > to answer some of those concerns. > > 'Maybe' if you were to publish a specification for those particular > vendor defined commands, it might be fine to add them to the allow list > for the switch-cci. Key here is that Broadcom would be committing to not > using those particular opcodes from the vendor space for anything else > in the future (so we could match on VID + opcode). This is similar to > some DVSEC usage in PCIe (and why DVSEC is different from VSEC). > > Effectively you'd be publishing an additional specification building on > CXL. > Those are expected to surface anyway from various standards orgs - should > we treat a company published one differently? I don't see why. > Exactly how this would work might take some figuring out (in main code, > separate driver module etc?) > > That specification would be expected to provide a similar level of detail > to CXL spec defined commands (ideally the less vague ones, but meh, up to > you as long as any side effects are clearly documented!) > > Speaking for myself, I'd consider this approach. > Particularly true if I see clear effort in the standards org to push > these into future specifications as that shows broadcom are trying to > enhance the ecosystems. > > > > 4. We built the CCI mailboxes into our PCIe switches per CXL spec and > > developed our management scheme around them. > > > > If the Linux community does not allow a CXL spec-compliant switch to be > > managed via the CXL spec-defined CCI mailbox, then please guide us on > > the right approach. Please tell us how you propose we manage our switches > > in-band. > > The Linux community is fine supporting this in the kernel (the BMC or > Fabric Management only host case - option 2 below, so the code will be > there) > the question here is what advice we offer to the general purpose > distributions and what protections we need to put in place to mitigate the > 'blast radius' concerns. > > Jonathan > > > > Thank you > > Sreeni > > > > On Thu, Mar 21, 2024 at 10:44 AM Jonathan Cameron < > > Jonathan.Cameron@huawei.com> wrote: > > > > > Hi All, > > > > > > This is has come up in a number of discussions both on list and in > private, > > > so I wanted to lay out a potential set of rules when deciding whether > or > > > not > > > to provide a user space interface for a particular feature of CXL > Fabric > > > Management. The intent is to drive discussion, not to simply tell > people > > > a set of rules. I've brought this to the public lists as it's a Linux > > > kernel > > > policy discussion, not a standards one. > > > > > > Whilst I'm writing the RFC this my attempt to summarize a possible > > > position rather than necessarily being my personal view. > > > > > > It's a straw man - shoot at it! > > > > > > Not everyone in this discussion is familiar with relevant kernel or CXL > > > concepts > > > so I've provided more info than I normally would. > > > > > > First some background: > > > ====================== > > > > > > CXL has two different types of Fabric. The comments here refer to > both, but > > > for now the kernel stack is focused on the simpler VCS fabric, not the > more > > > recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts > > > connected to a common switch looks something like: > > > > > > ________________ _______________ > > > | | | | Hosts - each sees > > > | HOST A | | HOST B | a PCIe style tree > > > | | | | but from a fabric > > > config > > > | |Root Port| | | |Root Port| | point of view it's > more > > > -------|-------- -------|------- complex. > > > | | > > > | | > > > _______|______________________________|________ > > > | USP (SW-CCI) USP | Switch can have lots > of > > > | | | | Upstream Ports. Each > one > > > | ____|________ _______|______ | has a virtual > hierarchy. > > > | | | | | | > > > | vPPB vPPB vPPB vPPB| There are virtual > > > | x | | | | "downstream > > > ports."(vPPBs) > > > | \ / / | That can be bound to > real > > > | \ / / | downstream ports. > > > | \ / / | > > > | \ / / | Multi Logical > Devices are > > > | DSP0 DSP1 DSP 2 | support more than one > > > vPPB > > > ------------------------------------------------ bound to a single > > > physical > > > | | | DSP (transactions are > > > tagged > > > | | | with an LD-ID) > > > SLD0 MLD0 SLD1 > > > > > > Some typical fabric management activities: > > > 1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug > events) > > > 2) Access config space or BAR space of End Points below the switch. > > > 3) Tunneling messages through to devices downstream (e.g Dynamic > Capacity > > > Forced Remove that will blow away some memory even if a host is > using > > > it). > > > 4) Non destructive stuff like status read back. > > > > > > Given the hosts may be using the Type 3 hosted memory (either Single > > > Logical > > > Device - SLD, or an LD on a Multi logical Device - MLD) as normal > memory, > > > unbinding a device in use can result in the memory access from a > > > different host being removed. The 'blast radius' is perhaps a rack of > > > servers. This discussion applies equally to FM-API commands sent to > Multi > > > Head Devices (see CXL r3.1). > > > > > > The Fabric Management actions are done using the CXL spec defined > Fabric > > > Management API, (FM-API) which is transported over various means > including > > > OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via normal > > > PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR > > > space on a function found alongside one of the switch upstream ports; > > > this mailbox is very similar to the MMPT definition found in PCIe r6.2. > > > > > > In many cases this switch CCI / MCTP connection is used by a BMC rather > > > than a normal host, but there have been some questions raised about > whether > > > a general purpose server OS would have a valid reason to use this > interface > > > (beyond debug and testing) to configure the switch or an MHD. > > > > > > If people have a use case for this, please reply to this thread to give > > > more details. > > > > > > The most recently posted CXL Switch-CCI support only provided the RAW > CXL > > > command IOCTL interface that is already available for Type 3 memory > > > devices. > > > That allows for unfettered control of the switch but, because it is > > > extremely easy to shoot yourself in the foot and cause unsolvable bug > > > reports, > > > it taints the kernel. There have been several requests to provide this > > > interface > > > without the taint for these switch configuration mailboxes. > > > > > > Last posted series: > > > > > > > https://lore.kernel.org/all/20231016125323.18318-1-Jonathan.Cameron@huawei.com/ > > > Note there are unrelated reasons why that code hasn't been updated > since > > > v6.6 time, > > > but I am planning to get back to it shortly. > > > > > > Similar issues will occur for other uses of PCIe MMPT (new mailbox in > PCI > > > that > > > sometimes is used for similarly destructive activity such as PLDM based > > > firmware update). > > > > > > > > > On to the proposed rules: > > > > > > 1) Kernel space use of the various mailboxes, or filtered controls from > > > user space. > > > > > > > ================================================================================== > > > > > > Absolutely fine - no one worries about this, but the mediated traffic > will > > > be filtered for potentially destructive side effects. E.g. it will > reject > > > attempts to change anything routing related if the kernel either knows > a > > > host is > > > using memory that will be blown away, or has no way to know (so > affecting > > > routing to another host). This includes blocking 'all' vendor defined > > > messages as we have no idea what the do. Note this means the kernel > has > > > an allow list and new commands are not initially allowed. > > > > > > This isn't currently enabled for Switch CCIs because they are only > really > > > interesting if the potentially destructive stuff is available (an > earlier > > > version did enable query commands, but it wasn't particularly useful to > > > know what your switch could do but not be allowed to do any of it). > > > If you take a MMPT usecase of PLDM firmware update, the filtering would > > > check that the device was in a state where a firmware update won't rip > > > memory out from under a host, which would be messy if that host is > > > doing the update. > > > > > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC > kernels > > > > ========================================================================== > > > > > > (This would just be a kernel option that we'd advise normal server > > > distributions not to turn on. Would be enabled by openBMC etc) > > > > > > This is fine - there is some work to do, but the switch-cci PCI driver > > > will hopefully be ready for upstream merge soon. There is no filtering > of > > > accesses. Think of this as similar to all the damage you can do via > > > MCTP from a BMC. Similarly it is likely that much of the complexity > > > of the actual commands will be left to user space tooling: > > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples. > > > > > > Whether Kconfig help text is strong enough to ensure this only gets > > > enabled for BMC targeted distros is an open question we can address > > > alongside an updated patch set. > > > > > > (On to the one that the "debate" is about) > > > > > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro > > > kernels > > > > > > > ============================================================================= > > > (General purpose Linux Server Distro (Redhat, Suse etc)) > > > > > > This is equivalent of RAW command support on CXL Type 3 memory devices. > > > You can enable those in a distro kernel build despite the scary config > > > help text, but if you use it the kernel is tainted. The result > > > of the taint is to add a flag to bug reports and print a big message > to say > > > that you've used a feature that might result in you shooting yourself > > > in the foot. > > > > > > The taint is there because software is not at first written to deal > with > > > everything that can happen smoothly (e.g. surprise removal) It's hard > > > to survive some of these events, so is never on the initial feature > list > > > for any bus, so this flag is just to indicate we have entered a world > > > where almost all bets are off wrt to stability. We might not know what > > > a command does so we can't assess the impact (and no one trusts vendor > > > commands to report affects right in the Command Effects Log - which > > > in theory tells you if a command can result problems). > > > > > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics > > > (a r3.1 feature) but, as I understand it, these are intended for a > > > host to configure and should not have side effects on other hosts? > > > My working assumption is that the kernel driver stack will handle > > > these (once we catch up with the current feature backlog!) Currently > > > we have no visibility of what the OS driver stack for a fabrics will > > > actually look like - the spec is just the starting point for that. > > > (patches welcome ;) > > > > > > The various CXL upstream developers and maintainers may have > > > differing views of course, but my current understanding is we want > > > to support 1 and 2, but are very resistant to 3! > > > > > > General Notes > > > ============= > > > > > > One side aspect of why we really don't like unfiltered userspace > access to > > > any > > > of these devices is that people start building non standard hacks in > and we > > > lose the ecosystem advantages. Forcing a considered discussion + > patches > > > to let a particular command be supported, drives standardization. > > > > > > > > > > https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ > > > provides some history on vendor specific extensions and why in general > we > > > won't support them upstream. > > > > > > To address another question raised in an earlier discussion: > > > Putting these Fabric Management interfaces behind guard rails of some > type > > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk > > > of non standard interfaces, because we will be even less likely to > accept > > > those upstream! > > > > > > If anyone needs more details on any aspect of this please ask. > > > There are a lot of things involved and I've only tried to give a fairly > > > minimal illustration to drive the discussion. I may well have missed > > > something crucial. > > > > > > Jonathan > > > > > > > > > > -- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it. [-- Attachment #1.2: Type: text/html, Size: 22928 bytes --] [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4230 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-03-22 13:24 ` Sreenivas Bagalkote @ 2024-04-01 16:51 ` Sreenivas Bagalkote 0 siblings, 0 replies; 23+ messages in thread From: Sreenivas Bagalkote @ 2024-04-01 16:51 UTC (permalink / raw) To: Jonathan Cameron, Williams, Dan J, Natu, Mahesh Cc: linux-cxl, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Ariel.Sibley [-- Attachment #1.1: Type: text/plain, Size: 19024 bytes --] Hello Dan, Jonathan - > > > > 'Maybe' if you were to publish a specification for those particular > > vendor defined commands, it might be fine to add them to the allow list > > for the switch-cci. > > > > Your proposal sounds reasonable. I will let you all experts figure out how to support the vendor-defined commands. CXL spec has them for a reason and they need to be supported. Please confirm that you will support an "allow list" of the vendor-defined commands. We will publish it once you confirm. Thank you Sreeni Sreenivas Bagalkote <Sreenivas.Bagalkote@broadcom.com> Product Planning & Management Broadcom Datacenter Solutions Group On Fri, Mar 22, 2024 at 6:54 PM Sreenivas Bagalkote < sreenivas.bagalkote@broadcom.com> wrote: > Jonathan, > > > > > What is the use case? My understanding so far is that clouds and > > similar sometimes use an in band path but it would be from a management > > only host, not a general purpose host running other software > > > > The overwhelming majority of the PCIe switches get deployed in a single > server. Typically four to eight switches are connected to two or more root > complexes in one or two CPUs. The deployment scenario you have in mind - > multiple physical hosts running general workloads and a management-only > host - exists. But it is insignificant. > > > > > For telemetry(subject to any odd corners like commands that might lock > > the interface up for a long time, which we've seen with commands in the > > Spec!) I don't see any problem supporting those on all host software. > > They should be non destructive to other hosts etc. > > > > Thank you. As you do this, please keep in mind that your concern about not > affecting "other" hosts is theoretically valid but doesn't exist in the > real world beyond science experiments. If there are real-world deployments, > they are insignificant. I urge you all to make your stuff work with 99.99% > of the deployments. > > > > > 'Maybe' if you were to publish a specification for those particular > > vendor defined commands, it might be fine to add them to the allow list > > for the switch-cci. > > > > Your proposal sounds reasonable. I will let you all experts figure out how > to support the vendor-defined commands. CXL spec has them for a reason and > they need to be supported. > > Sreeni > > On Fri, Mar 22, 2024 at 2:32 AM Jonathan Cameron < > Jonathan.Cameron@huawei.com> wrote: > >> On Thu, 21 Mar 2024 14:41:00 -0700 >> Sreenivas Bagalkote <sreenivas.bagalkote@broadcom.com> wrote: >> >> > Thank you for kicking off this discussion, Jonathan. >> >> Hi Sreenivas, >> >> > >> > We need guidance from the community. >> > >> > 1. Datacenter customers must be able to manage PCIe switches in-band. >> >> What is the use case? My understanding so far is that clouds and >> similar sometimes use an in band path but it would be from a management >> only host, not a general purpose host running other software. Sure >> that control host just connects to a different upstream port so, from >> a switch point of view, it's the same as any other host. From a host >> software point of view it's not running general cloud workloads or >> (at least in most cases) a general purpose OS distribution. >> >> This is the key question behind this discussion. >> >> > 2. Management of switches includes getting health, performance, and >> error >> > telemetry. >> >> For telemetry(subject to any odd corners like commands that might lock >> the interface up for a long time, which we've seen with commands in the >> Spec!) I don't see any problem supporting those on all host software. >> They should be non destructive to other hosts etc. >> >> > 3. These telemetry functions are not yet part of the CXL standard >> >> Ok, so this we should try to pin down the boundaries around this. >> The thread linked below lays out the reasoning behind a general rule >> of not accepting vendor defined commands, but perhaps there are routes >> to answer some of those concerns. >> >> 'Maybe' if you were to publish a specification for those particular >> vendor defined commands, it might be fine to add them to the allow list >> for the switch-cci. Key here is that Broadcom would be committing to not >> using those particular opcodes from the vendor space for anything else >> in the future (so we could match on VID + opcode). This is similar to >> some DVSEC usage in PCIe (and why DVSEC is different from VSEC). >> >> Effectively you'd be publishing an additional specification building on >> CXL. >> Those are expected to surface anyway from various standards orgs - should >> we treat a company published one differently? I don't see why. >> Exactly how this would work might take some figuring out (in main code, >> separate driver module etc?) >> >> That specification would be expected to provide a similar level of detail >> to CXL spec defined commands (ideally the less vague ones, but meh, up to >> you as long as any side effects are clearly documented!) >> >> Speaking for myself, I'd consider this approach. >> Particularly true if I see clear effort in the standards org to push >> these into future specifications as that shows broadcom are trying to >> enhance the ecosystems. >> >> >> > 4. We built the CCI mailboxes into our PCIe switches per CXL spec and >> > developed our management scheme around them. >> > >> > If the Linux community does not allow a CXL spec-compliant switch to be >> > managed via the CXL spec-defined CCI mailbox, then please guide us on >> > the right approach. Please tell us how you propose we manage our >> switches >> > in-band. >> >> The Linux community is fine supporting this in the kernel (the BMC or >> Fabric Management only host case - option 2 below, so the code will be >> there) >> the question here is what advice we offer to the general purpose >> distributions and what protections we need to put in place to mitigate the >> 'blast radius' concerns. >> >> Jonathan >> > >> > Thank you >> > Sreeni >> > >> > On Thu, Mar 21, 2024 at 10:44 AM Jonathan Cameron < >> > Jonathan.Cameron@huawei.com> wrote: >> > >> > > Hi All, >> > > >> > > This is has come up in a number of discussions both on list and in >> private, >> > > so I wanted to lay out a potential set of rules when deciding whether >> or >> > > not >> > > to provide a user space interface for a particular feature of CXL >> Fabric >> > > Management. The intent is to drive discussion, not to simply tell >> people >> > > a set of rules. I've brought this to the public lists as it's a Linux >> > > kernel >> > > policy discussion, not a standards one. >> > > >> > > Whilst I'm writing the RFC this my attempt to summarize a possible >> > > position rather than necessarily being my personal view. >> > > >> > > It's a straw man - shoot at it! >> > > >> > > Not everyone in this discussion is familiar with relevant kernel or >> CXL >> > > concepts >> > > so I've provided more info than I normally would. >> > > >> > > First some background: >> > > ====================== >> > > >> > > CXL has two different types of Fabric. The comments here refer to >> both, but >> > > for now the kernel stack is focused on the simpler VCS fabric, not >> the more >> > > recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts >> > > connected to a common switch looks something like: >> > > >> > > ________________ _______________ >> > > | | | | Hosts - each sees >> > > | HOST A | | HOST B | a PCIe style tree >> > > | | | | but from a fabric >> > > config >> > > | |Root Port| | | |Root Port| | point of view >> it's more >> > > -------|-------- -------|------- complex. >> > > | | >> > > | | >> > > _______|______________________________|________ >> > > | USP (SW-CCI) USP | Switch can have >> lots of >> > > | | | | Upstream Ports. >> Each one >> > > | ____|________ _______|______ | has a virtual >> hierarchy. >> > > | | | | | | >> > > | vPPB vPPB vPPB vPPB| There are virtual >> > > | x | | | | "downstream >> > > ports."(vPPBs) >> > > | \ / / | That can be bound >> to real >> > > | \ / / | downstream ports. >> > > | \ / / | >> > > | \ / / | Multi Logical >> Devices are >> > > | DSP0 DSP1 DSP 2 | support more than >> one >> > > vPPB >> > > ------------------------------------------------ bound to a single >> > > physical >> > > | | | DSP (transactions >> are >> > > tagged >> > > | | | with an LD-ID) >> > > SLD0 MLD0 SLD1 >> > > >> > > Some typical fabric management activities: >> > > 1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug >> events) >> > > 2) Access config space or BAR space of End Points below the switch. >> > > 3) Tunneling messages through to devices downstream (e.g Dynamic >> Capacity >> > > Forced Remove that will blow away some memory even if a host is >> using >> > > it). >> > > 4) Non destructive stuff like status read back. >> > > >> > > Given the hosts may be using the Type 3 hosted memory (either Single >> > > Logical >> > > Device - SLD, or an LD on a Multi logical Device - MLD) as normal >> memory, >> > > unbinding a device in use can result in the memory access from a >> > > different host being removed. The 'blast radius' is perhaps a rack of >> > > servers. This discussion applies equally to FM-API commands sent to >> Multi >> > > Head Devices (see CXL r3.1). >> > > >> > > The Fabric Management actions are done using the CXL spec defined >> Fabric >> > > Management API, (FM-API) which is transported over various means >> including >> > > OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via >> normal >> > > PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR >> > > space on a function found alongside one of the switch upstream ports; >> > > this mailbox is very similar to the MMPT definition found in PCIe >> r6.2. >> > > >> > > In many cases this switch CCI / MCTP connection is used by a BMC >> rather >> > > than a normal host, but there have been some questions raised about >> whether >> > > a general purpose server OS would have a valid reason to use this >> interface >> > > (beyond debug and testing) to configure the switch or an MHD. >> > > >> > > If people have a use case for this, please reply to this thread to >> give >> > > more details. >> > > >> > > The most recently posted CXL Switch-CCI support only provided the RAW >> CXL >> > > command IOCTL interface that is already available for Type 3 memory >> > > devices. >> > > That allows for unfettered control of the switch but, because it is >> > > extremely easy to shoot yourself in the foot and cause unsolvable bug >> > > reports, >> > > it taints the kernel. There have been several requests to provide this >> > > interface >> > > without the taint for these switch configuration mailboxes. >> > > >> > > Last posted series: >> > > >> > > >> https://lore.kernel.org/all/20231016125323.18318-1-Jonathan.Cameron@huawei.com/ >> > > Note there are unrelated reasons why that code hasn't been updated >> since >> > > v6.6 time, >> > > but I am planning to get back to it shortly. >> > > >> > > Similar issues will occur for other uses of PCIe MMPT (new mailbox in >> PCI >> > > that >> > > sometimes is used for similarly destructive activity such as PLDM >> based >> > > firmware update). >> > > >> > > >> > > On to the proposed rules: >> > > >> > > 1) Kernel space use of the various mailboxes, or filtered controls >> from >> > > user space. >> > > >> > > >> ================================================================================== >> > > >> > > Absolutely fine - no one worries about this, but the mediated traffic >> will >> > > be filtered for potentially destructive side effects. E.g. it will >> reject >> > > attempts to change anything routing related if the kernel either >> knows a >> > > host is >> > > using memory that will be blown away, or has no way to know (so >> affecting >> > > routing to another host). This includes blocking 'all' vendor defined >> > > messages as we have no idea what the do. Note this means the kernel >> has >> > > an allow list and new commands are not initially allowed. >> > > >> > > This isn't currently enabled for Switch CCIs because they are only >> really >> > > interesting if the potentially destructive stuff is available (an >> earlier >> > > version did enable query commands, but it wasn't particularly useful >> to >> > > know what your switch could do but not be allowed to do any of it). >> > > If you take a MMPT usecase of PLDM firmware update, the filtering >> would >> > > check that the device was in a state where a firmware update won't rip >> > > memory out from under a host, which would be messy if that host is >> > > doing the update. >> > > >> > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC >> kernels >> > > >> ========================================================================== >> > > >> > > (This would just be a kernel option that we'd advise normal server >> > > distributions not to turn on. Would be enabled by openBMC etc) >> > > >> > > This is fine - there is some work to do, but the switch-cci PCI driver >> > > will hopefully be ready for upstream merge soon. There is no >> filtering of >> > > accesses. Think of this as similar to all the damage you can do via >> > > MCTP from a BMC. Similarly it is likely that much of the complexity >> > > of the actual commands will be left to user space tooling: >> > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples. >> > > >> > > Whether Kconfig help text is strong enough to ensure this only gets >> > > enabled for BMC targeted distros is an open question we can address >> > > alongside an updated patch set. >> > > >> > > (On to the one that the "debate" is about) >> > > >> > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro >> > > kernels >> > > >> > > >> ============================================================================= >> > > (General purpose Linux Server Distro (Redhat, Suse etc)) >> > > >> > > This is equivalent of RAW command support on CXL Type 3 memory >> devices. >> > > You can enable those in a distro kernel build despite the scary config >> > > help text, but if you use it the kernel is tainted. The result >> > > of the taint is to add a flag to bug reports and print a big message >> to say >> > > that you've used a feature that might result in you shooting yourself >> > > in the foot. >> > > >> > > The taint is there because software is not at first written to deal >> with >> > > everything that can happen smoothly (e.g. surprise removal) It's hard >> > > to survive some of these events, so is never on the initial feature >> list >> > > for any bus, so this flag is just to indicate we have entered a world >> > > where almost all bets are off wrt to stability. We might not know >> what >> > > a command does so we can't assess the impact (and no one trusts vendor >> > > commands to report affects right in the Command Effects Log - which >> > > in theory tells you if a command can result problems). >> > > >> > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics >> > > (a r3.1 feature) but, as I understand it, these are intended for a >> > > host to configure and should not have side effects on other hosts? >> > > My working assumption is that the kernel driver stack will handle >> > > these (once we catch up with the current feature backlog!) Currently >> > > we have no visibility of what the OS driver stack for a fabrics will >> > > actually look like - the spec is just the starting point for that. >> > > (patches welcome ;) >> > > >> > > The various CXL upstream developers and maintainers may have >> > > differing views of course, but my current understanding is we want >> > > to support 1 and 2, but are very resistant to 3! >> > > >> > > General Notes >> > > ============= >> > > >> > > One side aspect of why we really don't like unfiltered userspace >> access to >> > > any >> > > of these devices is that people start building non standard hacks in >> and we >> > > lose the ecosystem advantages. Forcing a considered discussion + >> patches >> > > to let a particular command be supported, drives standardization. >> > > >> > > >> > > >> https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ >> > > provides some history on vendor specific extensions and why in >> general we >> > > won't support them upstream. >> > > >> > > To address another question raised in an earlier discussion: >> > > Putting these Fabric Management interfaces behind guard rails of some >> type >> > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk >> > > of non standard interfaces, because we will be even less likely to >> accept >> > > those upstream! >> > > >> > > If anyone needs more details on any aspect of this please ask. >> > > There are a lot of things involved and I've only tried to give a >> fairly >> > > minimal illustration to drive the discussion. I may well have missed >> > > something crucial. >> > > >> > > Jonathan >> > > >> > > >> > >> >> -- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it. [-- Attachment #1.2: Type: text/html, Size: 25792 bytes --] [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4230 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: RFC: Restricting userspace interfaces for CXL fabric management 2024-03-21 17:44 RFC: Restricting userspace interfaces for CXL fabric management Jonathan Cameron 2024-03-21 21:41 ` Sreenivas Bagalkote @ 2024-04-06 0:04 ` Dan Williams 2024-04-10 11:45 ` Jonathan Cameron 1 sibling, 1 reply; 23+ messages in thread From: Dan Williams @ 2024-04-06 0:04 UTC (permalink / raw) To: Jonathan Cameron, linux-cxl Cc: Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, Williams, Dan J, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh Jonathan Cameron wrote: > Hi All, > > This is has come up in a number of discussions both on list and in private, > so I wanted to lay out a potential set of rules when deciding whether or not > to provide a user space interface for a particular feature of CXL Fabric > Management. The intent is to drive discussion, not to simply tell people > a set of rules. I've brought this to the public lists as it's a Linux kernel > policy discussion, not a standards one. > > Whilst I'm writing the RFC this my attempt to summarize a possible > position rather than necessarily being my personal view. > > It's a straw man - shoot at it! > > Not everyone in this discussion is familiar with relevant kernel or CXL concepts > so I've provided more info than I normally would. Thanks for writing this up Jonathan! [..] > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels > ========================================================================== > > (This would just be a kernel option that we'd advise normal server > distributions not to turn on. Would be enabled by openBMC etc) > > This is fine - there is some work to do, but the switch-cci PCI driver > will hopefully be ready for upstream merge soon. There is no filtering of > accesses. Think of this as similar to all the damage you can do via > MCTP from a BMC. Similarly it is likely that much of the complexity > of the actual commands will be left to user space tooling: > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples. > > Whether Kconfig help text is strong enough to ensure this only gets > enabled for BMC targeted distros is an open question we can address > alongside an updated patch set. It is not clear to me that this material makes sense to house in drivers/ vs tools/ or even out-of-tree just for maintenance burden relief of keeping the universes separated. What does the Linux kernel project get out of carrying this in mainline alongside the inband code? I do think the mailbox refactoring to support non-CXL use cases is interesting, but only so far as refactoring is consumed for inband use cases like RAS API. > (On to the one that the "debate" is about) > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro kernels > ============================================================================= > (General purpose Linux Server Distro (Redhat, Suse etc)) > > This is equivalent of RAW command support on CXL Type 3 memory devices. > You can enable those in a distro kernel build despite the scary config > help text, but if you use it the kernel is tainted. The result > of the taint is to add a flag to bug reports and print a big message to say > that you've used a feature that might result in you shooting yourself > in the foot. > > The taint is there because software is not at first written to deal with > everything that can happen smoothly (e.g. surprise removal) It's hard > to survive some of these events, so is never on the initial feature list > for any bus, so this flag is just to indicate we have entered a world > where almost all bets are off wrt to stability. We might not know what > a command does so we can't assess the impact (and no one trusts vendor > commands to report affects right in the Command Effects Log - which > in theory tells you if a command can result problems). That is a secondary reason that the taint is there. Yes, it helps upstream not waste their time on bug reports from proprietary use cases, but the effect of that is to make "raw" command mode unattractive for deploying solutions at scale. It clarifies that this interface is a debug-tool that enterprise environment need not worry about. The more salient reason for the taint, speaking only for myself as a Linux kernel community member not for $employer, is to encourage open collaboration. Take firmware-update for example that is a standard command with known side effects that is inaccessible via the ioctl() path. It is placed behind an ABI that is easier to maintain and reason about. Everyone has the firmware update tool if they have the 'cat' command. Distros appreciate the fact that they do not need ship yet another vendor device-update tool, vendors get free tooling and end users also appreciate one flow for all devices. As I alluded here [1], I am not against innovation outside of the specification, but it needs to be open, and it needs to plausibly become if not a de jure standard at least a de facto standard. [1]: https://lore.kernel.org/all/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics > (a r3.1 feature) but, as I understand it, these are intended for a > host to configure and should not have side effects on other hosts? > My working assumption is that the kernel driver stack will handle > these (once we catch up with the current feature backlog!) Currently > we have no visibility of what the OS driver stack for a fabrics will > actually look like - the spec is just the starting point for that. > (patches welcome ;) > > The various CXL upstream developers and maintainers may have > differing views of course, but my current understanding is we want > to support 1 and 2, but are very resistant to 3! 1, yes, 2, need to see the patches, and agree on 3. > General Notes > ============= > > One side aspect of why we really don't like unfiltered userspace access to any > of these devices is that people start building non standard hacks in and we > lose the ecosystem advantages. Forcing a considered discussion + patches > to let a particular command be supported, drives standardization. Like I said above, I think this is not a side aspect. It is fundamental to the viability Linux as a project. This project only works because organizations with competing goals realize they need some common infrastructure and that there is little to be gained by competing on the commons. > https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ > provides some history on vendor specific extensions and why in general we > won't support them upstream. Oh, you linked my writeup... I will leave the commentary I added here in case restating it helps. > To address another question raised in an earlier discussion: > Putting these Fabric Management interfaces behind guard rails of some type > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk > of non standard interfaces, because we will be even less likely to accept > those upstream! > > If anyone needs more details on any aspect of this please ask. > There are a lot of things involved and I've only tried to give a fairly > minimal illustration to drive the discussion. I may well have missed > something crucial. You captured it well, and this is open source so I may have missed something crucial as well. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-06 0:04 ` Dan Williams @ 2024-04-10 11:45 ` Jonathan Cameron 2024-04-15 20:09 ` Sreenivas Bagalkote 2024-04-24 0:07 ` Dan Williams 0 siblings, 2 replies; 23+ messages in thread From: Jonathan Cameron @ 2024-04-10 11:45 UTC (permalink / raw) To: Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh On Fri, 5 Apr 2024 17:04:34 -0700 Dan Williams <dan.j.williams@intel.com> wrote: Hi Dan, > Jonathan Cameron wrote: > > Hi All, > > > > This is has come up in a number of discussions both on list and in private, > > so I wanted to lay out a potential set of rules when deciding whether or not > > to provide a user space interface for a particular feature of CXL Fabric > > Management. The intent is to drive discussion, not to simply tell people > > a set of rules. I've brought this to the public lists as it's a Linux kernel > > policy discussion, not a standards one. > > > > Whilst I'm writing the RFC this my attempt to summarize a possible > > position rather than necessarily being my personal view. > > > > It's a straw man - shoot at it! > > > > Not everyone in this discussion is familiar with relevant kernel or CXL concepts > > so I've provided more info than I normally would. > > Thanks for writing this up Jonathan! > > [..] > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels > > ========================================================================== > > > > (This would just be a kernel option that we'd advise normal server > > distributions not to turn on. Would be enabled by openBMC etc) > > > > This is fine - there is some work to do, but the switch-cci PCI driver > > will hopefully be ready for upstream merge soon. There is no filtering of > > accesses. Think of this as similar to all the damage you can do via > > MCTP from a BMC. Similarly it is likely that much of the complexity > > of the actual commands will be left to user space tooling: > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples. > > > > Whether Kconfig help text is strong enough to ensure this only gets > > enabled for BMC targeted distros is an open question we can address > > alongside an updated patch set. > > It is not clear to me that this material makes sense to house in > drivers/ vs tools/ or even out-of-tree just for maintenance burden > relief of keeping the universes separated. What does the Linux kernel > project get out of carrying this in mainline alongside the inband code? I'm not sure what you mean by in band. Aim here was to discuss in-band drivers for switch CCI etc. Same reason from a kernel point of view for why we include embedded drivers. I'll interpret in band as host driven and not inband as FM-API stuff. > I do think the mailbox refactoring to support non-CXL use cases is > interesting, but only so far as refactoring is consumed for inband use > cases like RAS API. If I read this right, I disagree with the 'only so far' bit. In all substantial ways we should support BMC use case of the Linux Kernel at a similar level to how we support forms of Linux Distros. It may not be our target market as developers for particular parts of our companies, but we should not block those who want to support it. We should support them in drivers/ - maybe with example userspace code in tools. Linux distros on BMCs is a big market, there are a number of different distros using (and in some cases contributing to) the upstream kernel. Not everyone is using openBMC so there is not one common place where downstream patches could be carried. From a personal point of view, I like that for the same reasons that I like there being multiple Linux sever focused distros. It's a sign of a healthy ecosystem to have diverse options taking the mainline kernel as their starting point. BMCs are just another embedded market, and like other embedded markets we want to encourage upstream first etc. openBMC has a policy on this: https://github.com/openbmc/docs/blob/master/kernel-development.md "The OpenBMC project maintains a kernel tree for use by the project. The tree's general development policy is that code must be upstream first." There are paths to bypass that for openBMC so it's a little more relaxed than some enterprise distros (today, their policies used to look very similar to this) but we should not be telling them they need to carry support downstream. If we are going to tell them that, we need to be able to point at a major sticking point for maintenance burden. So far I don't see the additional complexity as remotely close reaching that bar. So I think we do want switch-cci support and for that matter the equivalent for MHDs in the upstream kernel. One place I think there is some wiggle room is the taint on use of raw commands. Leaving removal of that for BMC kernels as a patch they need to carry downstream doesn't seem too burdensome. I'm sure they'll push back if it is a problem for them! So I think we can kick that question into the future. Addressing maintenance burden, there is a question of where we split the stack. Ignore MHDs for now (I won't go into why in this forum...) The current proposal is (simplified to ignore some sharing in lookup code etc that I can rip out if we think it might be a long term problem) _____________ _____________________ | | | | | Switch CCI | | Type 3 Driver stack| |_____________| |_____________________| |___________________________| Whatever GPU etc _______|_______ _______|______ | | | | | CXL MBOX | | RAS API etc | |_______________| |______________| |_____________________________| | _________|______ | | | MMPT mbox | |________________| Switch CCI Driver: PCI driver doing everything beyond the CXL mbox specific bit. Type 3 Stack: All the normal stack just with the CXL Mailbox specific stuff factored out. Note we can move different amounts of shared logic in here, but in essence it deals with the extra layer on top of the raw MMPT mbox. MMPT Mbox: Mailbox as per the PCI spec. RAS API: Shared RAS API specific infrastructure used by other drivers. If we see a significant maintenance burden, maybe we duplicate the CXL specific MBOX layer - I can see advantages in that as there is some stuff not relevant to the Switch CCI. There will be some duplication of logic however such as background command support (which is CXL only IIUC) We can even use a difference IOCTL number so the two can diverge if needed in the long run. e.g. If it makes it easier to get upstream, we can merrily duplicated code so that only the bit common with RAS API etc is shared (assuming the actually end up with MMPT, not the CXL mailbox which is what their current publicly available spec talks about and I assume is a pref MMPT left over?) _____________ _____________________ | | | | | Switch CCI | | Type 3 Driver stack| |_____________| |_____________________| | | Whatever GPU etc _______|_______ _______|_______ ______|_______ | | | | | | | CXL MBOX | | CXL MBOX | | RAS API etc | |_______________| |_______________| |______________| |_____________________________|____________________| | ________|______ | | | MMPT mbox | |_______________| > > (On to the one that the "debate" is about) > > > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro kernels > > ============================================================================= > > (General purpose Linux Server Distro (Redhat, Suse etc)) > > > > This is equivalent of RAW command support on CXL Type 3 memory devices. > > You can enable those in a distro kernel build despite the scary config > > help text, but if you use it the kernel is tainted. The result > > of the taint is to add a flag to bug reports and print a big message to say > > that you've used a feature that might result in you shooting yourself > > in the foot. > > > > The taint is there because software is not at first written to deal with > > everything that can happen smoothly (e.g. surprise removal) It's hard > > to survive some of these events, so is never on the initial feature list > > for any bus, so this flag is just to indicate we have entered a world > > where almost all bets are off wrt to stability. We might not know what > > a command does so we can't assess the impact (and no one trusts vendor > > commands to report affects right in the Command Effects Log - which > > in theory tells you if a command can result problems). > > That is a secondary reason that the taint is there. Yes, it helps > upstream not waste their time on bug reports from proprietary use cases, > but the effect of that is to make "raw" command mode unattractive for > deploying solutions at scale. It clarifies that this interface is a > debug-tool that enterprise environment need not worry about. > > The more salient reason for the taint, speaking only for myself as a > Linux kernel community member not for $employer, is to encourage open > collaboration. Take firmware-update for example that is a standard > command with known side effects that is inaccessible via the ioctl() > path. It is placed behind an ABI that is easier to maintain and reason > about. Everyone has the firmware update tool if they have the 'cat' > command. Distros appreciate the fact that they do not need ship yet > another vendor device-update tool, vendors get free tooling and end > users also appreciate one flow for all devices. > > As I alluded here [1], I am not against innovation outside of the > specification, but it needs to be open, and it needs to plausibly become > if not a de jure standard at least a de facto standard. > > [1]: https://lore.kernel.org/all/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ Agree with all this. > > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics > > (a r3.1 feature) but, as I understand it, these are intended for a > > host to configure and should not have side effects on other hosts? > > My working assumption is that the kernel driver stack will handle > > these (once we catch up with the current feature backlog!) Currently > > we have no visibility of what the OS driver stack for a fabrics will > > actually look like - the spec is just the starting point for that. > > (patches welcome ;) > > > > The various CXL upstream developers and maintainers may have > > differing views of course, but my current understanding is we want > > to support 1 and 2, but are very resistant to 3! > > 1, yes, 2, need to see the patches, and agree on 3. If we end up with top architecture of the diagrams above, 2 will look pretty similar to last version of the switch-cci patches. So raw commands only + taint. Factoring out MMPT is another layer that doesn't make that much difference in practice to this discussion. Good to have, but the reuse here would be one layer above that. Or we just say go for second proposed architecture and 0 impact on the CXL specific code, just reuse of the MMPT layer. I'd imagine people will get grumpy on code duplication (and we'll spend years rejecting patch sets that try to share the cdoe) but there should be no maintenance burden as a result. > > > General Notes > > ============= > > > > One side aspect of why we really don't like unfiltered userspace access to any > > of these devices is that people start building non standard hacks in and we > > lose the ecosystem advantages. Forcing a considered discussion + patches > > to let a particular command be supported, drives standardization. > > Like I said above, I think this is not a side aspect. It is fundamental > to the viability Linux as a project. This project only works because > organizations with competing goals realize they need some common > infrastructure and that there is little to be gained by competing on the > commons. > > > https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ > > provides some history on vendor specific extensions and why in general we > > won't support them upstream. > > Oh, you linked my writeup... I will leave the commentary I added here in case > restating it helps. > > > To address another question raised in an earlier discussion: > > Putting these Fabric Management interfaces behind guard rails of some type > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk > > of non standard interfaces, because we will be even less likely to accept > > those upstream! > > > > If anyone needs more details on any aspect of this please ask. > > There are a lot of things involved and I've only tried to give a fairly > > minimal illustration to drive the discussion. I may well have missed > > something crucial. > > You captured it well, and this is open source so I may have missed > something crucial as well. > Thanks for detailed reply! Jonathan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-10 11:45 ` Jonathan Cameron @ 2024-04-15 20:09 ` Sreenivas Bagalkote 2024-04-23 22:44 ` Sreenivas Bagalkote 2024-04-24 0:07 ` Dan Williams 1 sibling, 1 reply; 23+ messages in thread From: Sreenivas Bagalkote @ 2024-04-15 20:09 UTC (permalink / raw) To: Jonathan Cameron, Dan Williams Cc: linux-cxl, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh [-- Attachment #1.1: Type: text/plain, Size: 16205 bytes --] Hello, >> We need guidance from the community. >> 1. Datacenter customers must be able to manage PCIe switches in-band.>> 2. Management of switches includes getting health, performance, and error telemetry.>> 3. These telemetry functions are not yet part of the CXL standard>> 4. We built the CCI mailboxes into our PCIe switches per CXL spec and developed our management scheme around them.>> >> If the Linux community does not allow a CXL spec-compliant switch to be>> managed via the CXL spec-defined CCI mailbox, then please guide us on>> the right approach. Please tell us how you propose we manage our switches>> in-band. I am still looking for your guidance. We need to be able to manage our switch via the CCI mailbox. We need to use vendor-defined commands per CXL spec. You talked about whitelisting commands (allow-list) which we agreed to. Would you please confirm that you will allow the vendor-defined allow-list of commands? Thank you Sreeni On Wed, Apr 10, 2024 at 5:45 AM Jonathan Cameron < Jonathan.Cameron@huawei.com> wrote: > On Fri, 5 Apr 2024 17:04:34 -0700 > Dan Williams <dan.j.williams@intel.com> wrote: > > Hi Dan, > > > Jonathan Cameron wrote: > > > Hi All, > > > > > > This is has come up in a number of discussions both on list and in > private, > > > so I wanted to lay out a potential set of rules when deciding whether > or not > > > to provide a user space interface for a particular feature of CXL > Fabric > > > Management. The intent is to drive discussion, not to simply tell > people > > > a set of rules. I've brought this to the public lists as it's a Linux > kernel > > > policy discussion, not a standards one. > > > > > > Whilst I'm writing the RFC this my attempt to summarize a possible > > > position rather than necessarily being my personal view. > > > > > > It's a straw man - shoot at it! > > > > > > Not everyone in this discussion is familiar with relevant kernel or > CXL concepts > > > so I've provided more info than I normally would. > > > > Thanks for writing this up Jonathan! > > > > [..] > > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC > kernels > > > > ========================================================================== > > > > > > (This would just be a kernel option that we'd advise normal server > > > distributions not to turn on. Would be enabled by openBMC etc) > > > > > > This is fine - there is some work to do, but the switch-cci PCI driver > > > will hopefully be ready for upstream merge soon. There is no filtering > of > > > accesses. Think of this as similar to all the damage you can do via > > > MCTP from a BMC. Similarly it is likely that much of the complexity > > > of the actual commands will be left to user space tooling: > > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples. > > > > > > Whether Kconfig help text is strong enough to ensure this only gets > > > enabled for BMC targeted distros is an open question we can address > > > alongside an updated patch set. > > > > It is not clear to me that this material makes sense to house in > > drivers/ vs tools/ or even out-of-tree just for maintenance burden > > relief of keeping the universes separated. What does the Linux kernel > > project get out of carrying this in mainline alongside the inband code? > > I'm not sure what you mean by in band. Aim here was to discuss > in-band drivers for switch CCI etc. Same reason from a kernel point of > view for why we include embedded drivers. I'll interpret in band > as host driven and not inband as FM-API stuff. > > > I do think the mailbox refactoring to support non-CXL use cases is > > interesting, but only so far as refactoring is consumed for inband use > > cases like RAS API. > > If I read this right, I disagree with the 'only so far' bit. > > In all substantial ways we should support BMC use case of the Linux Kernel > at a similar level to how we support forms of Linux Distros. It may > not be our target market as developers for particular parts of our > companies, > but we should not block those who want to support it. > > We should support them in drivers/ - maybe with example userspace code > in tools. Linux distros on BMCs is a big market, there are a number > of different distros using (and in some cases contributing to) the > upstream kernel. Not everyone is using openBMC so there is not one > common place where downstream patches could be carried. > From a personal point of view, I like that for the same reasons that > I like there being multiple Linux sever focused distros. It's a sign > of a healthy ecosystem to have diverse options taking the mainline > kernel as their starting point. > > BMCs are just another embedded market, and like other embedded markets > we want to encourage upstream first etc. > openBMC has a policy on this: > https://github.com/openbmc/docs/blob/master/kernel-development.md > "The OpenBMC project maintains a kernel tree for use by the project. > The tree's general development policy is that code must be upstream > first." There are paths to bypass that for openBMC so it's a little > more relaxed than some enterprise distros (today, their policies used > to look very similar to this) but we should not be telling > them they need to carry support downstream. If we are > going to tell them that, we need to be able to point at a major > sticking point for maintenance burden. So far I don't see the > additional complexity as remotely close reaching that bar. > > So I think we do want switch-cci support and for that matter the equivalent > for MHDs in the upstream kernel. > > One place I think there is some wiggle room is the taint on use of raw > commands. Leaving removal of that for BMC kernels as a patch they need > to carry downstream doesn't seem too burdensome. I'm sure they'll push > back if it is a problem for them! So I think we can kick that question > into the future. > > Addressing maintenance burden, there is a question of where we split > the stack. Ignore MHDs for now (I won't go into why in this forum...) > > The current proposal is (simplified to ignore some sharing in lookup code > etc > that I can rip out if we think it might be a long term problem) > > _____________ _____________________ > | | | | > | Switch CCI | | Type 3 Driver stack| > |_____________| |_____________________| > |___________________________| Whatever GPU etc > _______|_______ _______|______ > | | | | > | CXL MBOX | | RAS API etc | > |_______________| |______________| > |_____________________________| > | > _________|______ > | | > | MMPT mbox | > |________________| > > Switch CCI Driver: PCI driver doing everything beyond the CXL mbox > specific bit. > Type 3 Stack: All the normal stack just with the CXL Mailbox specific > stuff factored > out. Note we can move different amounts of shared logic in > here, but > in essence it deals with the extra layer on top of the raw > MMPT mbox. > MMPT Mbox: Mailbox as per the PCI spec. > RAS API: Shared RAS API specific infrastructure used by other drivers. > > If we see a significant maintenance burden, maybe we duplicate the CXL > specific > MBOX layer - I can see advantages in that as there is some stuff not > relevant > to the Switch CCI. There will be some duplication of logic however such > as background command support (which is CXL only IIUC) We can even use > a difference IOCTL number so the two can diverge if needed in the long run. > > e.g. If it makes it easier to get upstream, we can merrily duplicated code > so that only the bit common with RAS API etc is shared (assuming the > actually end up with MMPT, not the CXL mailbox which is what their current > publicly available spec talks about and I assume is a pref MMPT left over?) > > _____________ _____________________ > | | | | > | Switch CCI | | Type 3 Driver stack| > |_____________| |_____________________| > | | Whatever GPU etc > _______|_______ _______|_______ ______|_______ > | | | | | | > | CXL MBOX | | CXL MBOX | | RAS API etc | > |_______________| |_______________| |______________| > |_____________________________|____________________| > | > ________|______ > | | > | MMPT mbox | > |_______________| > > > > > (On to the one that the "debate" is about) > > > > > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro > kernels > > > > ============================================================================= > > > (General purpose Linux Server Distro (Redhat, Suse etc)) > > > > > > This is equivalent of RAW command support on CXL Type 3 memory devices. > > > You can enable those in a distro kernel build despite the scary config > > > help text, but if you use it the kernel is tainted. The result > > > of the taint is to add a flag to bug reports and print a big message > to say > > > that you've used a feature that might result in you shooting yourself > > > in the foot. > > > > > > The taint is there because software is not at first written to deal > with > > > everything that can happen smoothly (e.g. surprise removal) It's hard > > > to survive some of these events, so is never on the initial feature > list > > > for any bus, so this flag is just to indicate we have entered a world > > > where almost all bets are off wrt to stability. We might not know what > > > a command does so we can't assess the impact (and no one trusts vendor > > > commands to report affects right in the Command Effects Log - which > > > in theory tells you if a command can result problems). > > > > That is a secondary reason that the taint is there. Yes, it helps > > upstream not waste their time on bug reports from proprietary use cases, > > but the effect of that is to make "raw" command mode unattractive for > > deploying solutions at scale. It clarifies that this interface is a > > debug-tool that enterprise environment need not worry about. > > > > The more salient reason for the taint, speaking only for myself as a > > Linux kernel community member not for $employer, is to encourage open > > collaboration. Take firmware-update for example that is a standard > > command with known side effects that is inaccessible via the ioctl() > > path. It is placed behind an ABI that is easier to maintain and reason > > about. Everyone has the firmware update tool if they have the 'cat' > > command. Distros appreciate the fact that they do not need ship yet > > another vendor device-update tool, vendors get free tooling and end > > users also appreciate one flow for all devices. > > > > As I alluded here [1], I am not against innovation outside of the > > specification, but it needs to be open, and it needs to plausibly become > > if not a de jure standard at least a de facto standard. > > > > [1]: > https://lore.kernel.org/all/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ > > Agree with all this. > > > > > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics > > > (a r3.1 feature) but, as I understand it, these are intended for a > > > host to configure and should not have side effects on other hosts? > > > My working assumption is that the kernel driver stack will handle > > > these (once we catch up with the current feature backlog!) Currently > > > we have no visibility of what the OS driver stack for a fabrics will > > > actually look like - the spec is just the starting point for that. > > > (patches welcome ;) > > > > > > The various CXL upstream developers and maintainers may have > > > differing views of course, but my current understanding is we want > > > to support 1 and 2, but are very resistant to 3! > > > > 1, yes, 2, need to see the patches, and agree on 3. > > If we end up with top architecture of the diagrams above, 2 will look > pretty > similar to last version of the switch-cci patches. So raw commands only + > taint. > Factoring out MMPT is another layer that doesn't make that much difference > in > practice to this discussion. Good to have, but the reuse here would be one > layer > above that. > > Or we just say go for second proposed architecture and 0 impact on the > CXL specific code, just reuse of the MMPT layer. I'd imagine people will > get > grumpy on code duplication (and we'll spend years rejecting patch sets that > try to share the cdoe) but there should be no maintenance burden as > a result. > > > > > > General Notes > > > ============= > > > > > > One side aspect of why we really don't like unfiltered userspace > access to any > > > of these devices is that people start building non standard hacks in > and we > > > lose the ecosystem advantages. Forcing a considered discussion + > patches > > > to let a particular command be supported, drives standardization. > > > > Like I said above, I think this is not a side aspect. It is fundamental > > to the viability Linux as a project. This project only works because > > organizations with competing goals realize they need some common > > infrastructure and that there is little to be gained by competing on the > > commons. > > > > > > https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ > > > provides some history on vendor specific extensions and why in general > we > > > won't support them upstream. > > > > Oh, you linked my writeup... I will leave the commentary I added here in > case > > restating it helps. > > > > > To address another question raised in an earlier discussion: > > > Putting these Fabric Management interfaces behind guard rails of some > type > > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk > > > of non standard interfaces, because we will be even less likely to > accept > > > those upstream! > > > > > > If anyone needs more details on any aspect of this please ask. > > > There are a lot of things involved and I've only tried to give a fairly > > > minimal illustration to drive the discussion. I may well have missed > > > something crucial. > > > > You captured it well, and this is open source so I may have missed > > something crucial as well. > > > > Thanks for detailed reply! > > Jonathan > > > -- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it. [-- Attachment #1.2: Type: text/html, Size: 20734 bytes --] [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4230 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-15 20:09 ` Sreenivas Bagalkote @ 2024-04-23 22:44 ` Sreenivas Bagalkote 2024-04-23 23:24 ` Greg KH 0 siblings, 1 reply; 23+ messages in thread From: Sreenivas Bagalkote @ 2024-04-23 22:44 UTC (permalink / raw) To: Jonathan Cameron, Dan Williams Cc: linux-cxl, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh [-- Attachment #1.1: Type: text/plain, Size: 16895 bytes --] Can somebody please at least acknowledge that you are getting my emails? Thank you Sreeni Sreenivas Bagalkote <Sreenivas.Bagalkote@broadcom.com> Product Planning & Management Broadcom Datacenter Solutions Group On Mon, Apr 15, 2024 at 2:09 PM Sreenivas Bagalkote < sreenivas.bagalkote@broadcom.com> wrote: > Hello, > > >> We need guidance from the community. > > >> 1. Datacenter customers must be able to manage PCIe switches in-band.>> 2. Management of switches includes getting health, performance, and error telemetry.>> 3. These telemetry functions are not yet part of the CXL standard>> 4. We built the CCI mailboxes into our PCIe switches per CXL spec and developed our management scheme around them.>> >> If the Linux community does not allow a CXL spec-compliant switch to be>> managed via the CXL spec-defined CCI mailbox, then please guide us on>> the right approach. Please tell us how you propose we manage our switches>> in-band. > > I am still looking for your guidance. We need to be able to manage our > switch via the CCI mailbox. We need to use vendor-defined commands per CXL > spec. > > You talked about whitelisting commands (allow-list) which we agreed to. > Would you please confirm that you will allow the vendor-defined allow-list > of commands? > > Thank you > Sreeni > > On Wed, Apr 10, 2024 at 5:45 AM Jonathan Cameron < > Jonathan.Cameron@huawei.com> wrote: > >> On Fri, 5 Apr 2024 17:04:34 -0700 >> Dan Williams <dan.j.williams@intel.com> wrote: >> >> Hi Dan, >> >> > Jonathan Cameron wrote: >> > > Hi All, >> > > >> > > This is has come up in a number of discussions both on list and in >> private, >> > > so I wanted to lay out a potential set of rules when deciding whether >> or not >> > > to provide a user space interface for a particular feature of CXL >> Fabric >> > > Management. The intent is to drive discussion, not to simply tell >> people >> > > a set of rules. I've brought this to the public lists as it's a >> Linux kernel >> > > policy discussion, not a standards one. >> > > >> > > Whilst I'm writing the RFC this my attempt to summarize a possible >> > > position rather than necessarily being my personal view. >> > > >> > > It's a straw man - shoot at it! >> > > >> > > Not everyone in this discussion is familiar with relevant kernel or >> CXL concepts >> > > so I've provided more info than I normally would. >> > >> > Thanks for writing this up Jonathan! >> > >> > [..] >> > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC >> kernels >> > > >> ========================================================================== >> > > >> > > (This would just be a kernel option that we'd advise normal server >> > > distributions not to turn on. Would be enabled by openBMC etc) >> > > >> > > This is fine - there is some work to do, but the switch-cci PCI driver >> > > will hopefully be ready for upstream merge soon. There is no >> filtering of >> > > accesses. Think of this as similar to all the damage you can do via >> > > MCTP from a BMC. Similarly it is likely that much of the complexity >> > > of the actual commands will be left to user space tooling: >> > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples. >> > > >> > > Whether Kconfig help text is strong enough to ensure this only gets >> > > enabled for BMC targeted distros is an open question we can address >> > > alongside an updated patch set. >> > >> > It is not clear to me that this material makes sense to house in >> > drivers/ vs tools/ or even out-of-tree just for maintenance burden >> > relief of keeping the universes separated. What does the Linux kernel >> > project get out of carrying this in mainline alongside the inband code? >> >> I'm not sure what you mean by in band. Aim here was to discuss >> in-band drivers for switch CCI etc. Same reason from a kernel point of >> view for why we include embedded drivers. I'll interpret in band >> as host driven and not inband as FM-API stuff. >> >> > I do think the mailbox refactoring to support non-CXL use cases is >> > interesting, but only so far as refactoring is consumed for inband use >> > cases like RAS API. >> >> If I read this right, I disagree with the 'only so far' bit. >> >> In all substantial ways we should support BMC use case of the Linux Kernel >> at a similar level to how we support forms of Linux Distros. It may >> not be our target market as developers for particular parts of our >> companies, >> but we should not block those who want to support it. >> >> We should support them in drivers/ - maybe with example userspace code >> in tools. Linux distros on BMCs is a big market, there are a number >> of different distros using (and in some cases contributing to) the >> upstream kernel. Not everyone is using openBMC so there is not one >> common place where downstream patches could be carried. >> From a personal point of view, I like that for the same reasons that >> I like there being multiple Linux sever focused distros. It's a sign >> of a healthy ecosystem to have diverse options taking the mainline >> kernel as their starting point. >> >> BMCs are just another embedded market, and like other embedded markets >> we want to encourage upstream first etc. >> openBMC has a policy on this: >> https://github.com/openbmc/docs/blob/master/kernel-development.md >> "The OpenBMC project maintains a kernel tree for use by the project. >> The tree's general development policy is that code must be upstream >> first." There are paths to bypass that for openBMC so it's a little >> more relaxed than some enterprise distros (today, their policies used >> to look very similar to this) but we should not be telling >> them they need to carry support downstream. If we are >> going to tell them that, we need to be able to point at a major >> sticking point for maintenance burden. So far I don't see the >> additional complexity as remotely close reaching that bar. >> >> So I think we do want switch-cci support and for that matter the >> equivalent >> for MHDs in the upstream kernel. >> >> One place I think there is some wiggle room is the taint on use of raw >> commands. Leaving removal of that for BMC kernels as a patch they need >> to carry downstream doesn't seem too burdensome. I'm sure they'll push >> back if it is a problem for them! So I think we can kick that question >> into the future. >> >> Addressing maintenance burden, there is a question of where we split >> the stack. Ignore MHDs for now (I won't go into why in this forum...) >> >> The current proposal is (simplified to ignore some sharing in lookup code >> etc >> that I can rip out if we think it might be a long term problem) >> >> _____________ _____________________ >> | | | | >> | Switch CCI | | Type 3 Driver stack| >> |_____________| |_____________________| >> |___________________________| Whatever GPU etc >> _______|_______ _______|______ >> | | | | >> | CXL MBOX | | RAS API etc | >> |_______________| |______________| >> |_____________________________| >> | >> _________|______ >> | | >> | MMPT mbox | >> |________________| >> >> Switch CCI Driver: PCI driver doing everything beyond the CXL mbox >> specific bit. >> Type 3 Stack: All the normal stack just with the CXL Mailbox specific >> stuff factored >> out. Note we can move different amounts of shared logic in >> here, but >> in essence it deals with the extra layer on top of the raw >> MMPT mbox. >> MMPT Mbox: Mailbox as per the PCI spec. >> RAS API: Shared RAS API specific infrastructure used by other drivers. >> >> If we see a significant maintenance burden, maybe we duplicate the CXL >> specific >> MBOX layer - I can see advantages in that as there is some stuff not >> relevant >> to the Switch CCI. There will be some duplication of logic however such >> as background command support (which is CXL only IIUC) We can even use >> a difference IOCTL number so the two can diverge if needed in the long >> run. >> >> e.g. If it makes it easier to get upstream, we can merrily duplicated code >> so that only the bit common with RAS API etc is shared (assuming the >> actually end up with MMPT, not the CXL mailbox which is what their current >> publicly available spec talks about and I assume is a pref MMPT left >> over?) >> >> _____________ _____________________ >> | | | | >> | Switch CCI | | Type 3 Driver stack| >> |_____________| |_____________________| >> | | Whatever GPU etc >> _______|_______ _______|_______ ______|_______ >> | | | | | | >> | CXL MBOX | | CXL MBOX | | RAS API etc | >> |_______________| |_______________| |______________| >> |_____________________________|____________________| >> | >> ________|______ >> | | >> | MMPT mbox | >> |_______________| >> >> >> > > (On to the one that the "debate" is about) >> > > >> > > 3) Unfiltered user space use of mailbox for Fabric Management - >> Distro kernels >> > > >> ============================================================================= >> > > (General purpose Linux Server Distro (Redhat, Suse etc)) >> > > >> > > This is equivalent of RAW command support on CXL Type 3 memory >> devices. >> > > You can enable those in a distro kernel build despite the scary config >> > > help text, but if you use it the kernel is tainted. The result >> > > of the taint is to add a flag to bug reports and print a big message >> to say >> > > that you've used a feature that might result in you shooting yourself >> > > in the foot. >> > > >> > > The taint is there because software is not at first written to deal >> with >> > > everything that can happen smoothly (e.g. surprise removal) It's hard >> > > to survive some of these events, so is never on the initial feature >> list >> > > for any bus, so this flag is just to indicate we have entered a world >> > > where almost all bets are off wrt to stability. We might not know >> what >> > > a command does so we can't assess the impact (and no one trusts vendor >> > > commands to report affects right in the Command Effects Log - which >> > > in theory tells you if a command can result problems). >> > >> > That is a secondary reason that the taint is there. Yes, it helps >> > upstream not waste their time on bug reports from proprietary use cases, >> > but the effect of that is to make "raw" command mode unattractive for >> > deploying solutions at scale. It clarifies that this interface is a >> > debug-tool that enterprise environment need not worry about. >> > >> > The more salient reason for the taint, speaking only for myself as a >> > Linux kernel community member not for $employer, is to encourage open >> > collaboration. Take firmware-update for example that is a standard >> > command with known side effects that is inaccessible via the ioctl() >> > path. It is placed behind an ABI that is easier to maintain and reason >> > about. Everyone has the firmware update tool if they have the 'cat' >> > command. Distros appreciate the fact that they do not need ship yet >> > another vendor device-update tool, vendors get free tooling and end >> > users also appreciate one flow for all devices. >> > >> > As I alluded here [1], I am not against innovation outside of the >> > specification, but it needs to be open, and it needs to plausibly become >> > if not a de jure standard at least a de facto standard. >> > >> > [1]: >> https://lore.kernel.org/all/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ >> >> Agree with all this. >> >> > >> > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics >> > > (a r3.1 feature) but, as I understand it, these are intended for a >> > > host to configure and should not have side effects on other hosts? >> > > My working assumption is that the kernel driver stack will handle >> > > these (once we catch up with the current feature backlog!) Currently >> > > we have no visibility of what the OS driver stack for a fabrics will >> > > actually look like - the spec is just the starting point for that. >> > > (patches welcome ;) >> > > >> > > The various CXL upstream developers and maintainers may have >> > > differing views of course, but my current understanding is we want >> > > to support 1 and 2, but are very resistant to 3! >> > >> > 1, yes, 2, need to see the patches, and agree on 3. >> >> If we end up with top architecture of the diagrams above, 2 will look >> pretty >> similar to last version of the switch-cci patches. So raw commands only >> + taint. >> Factoring out MMPT is another layer that doesn't make that much >> difference in >> practice to this discussion. Good to have, but the reuse here would be >> one layer >> above that. >> >> Or we just say go for second proposed architecture and 0 impact on the >> CXL specific code, just reuse of the MMPT layer. I'd imagine people will >> get >> grumpy on code duplication (and we'll spend years rejecting patch sets >> that >> try to share the cdoe) but there should be no maintenance burden as >> a result. >> >> > >> > > General Notes >> > > ============= >> > > >> > > One side aspect of why we really don't like unfiltered userspace >> access to any >> > > of these devices is that people start building non standard hacks in >> and we >> > > lose the ecosystem advantages. Forcing a considered discussion + >> patches >> > > to let a particular command be supported, drives standardization. >> > >> > Like I said above, I think this is not a side aspect. It is fundamental >> > to the viability Linux as a project. This project only works because >> > organizations with competing goals realize they need some common >> > infrastructure and that there is little to be gained by competing on the >> > commons. >> > >> > > >> https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/ >> > > provides some history on vendor specific extensions and why in >> general we >> > > won't support them upstream. >> > >> > Oh, you linked my writeup... I will leave the commentary I added here >> in case >> > restating it helps. >> > >> > > To address another question raised in an earlier discussion: >> > > Putting these Fabric Management interfaces behind guard rails of some >> type >> > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk >> > > of non standard interfaces, because we will be even less likely to >> accept >> > > those upstream! >> > > >> > > If anyone needs more details on any aspect of this please ask. >> > > There are a lot of things involved and I've only tried to give a >> fairly >> > > minimal illustration to drive the discussion. I may well have missed >> > > something crucial. >> > >> > You captured it well, and this is open source so I may have missed >> > something crucial as well. >> > >> >> Thanks for detailed reply! >> >> Jonathan >> >> >> -- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it. [-- Attachment #1.2: Type: text/html, Size: 22034 bytes --] [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4230 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-23 22:44 ` Sreenivas Bagalkote @ 2024-04-23 23:24 ` Greg KH 0 siblings, 0 replies; 23+ messages in thread From: Greg KH @ 2024-04-23 23:24 UTC (permalink / raw) To: Sreenivas Bagalkote Cc: Jonathan Cameron, Dan Williams, linux-cxl, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh On Tue, Apr 23, 2024 at 04:44:37PM -0600, Sreenivas Bagalkote wrote: > This electronic communication and the information and any files transmitted > with it, or attached to it, are confidential and are intended solely for > the use of the individual or entity to whom it is addressed and may contain > information that is confidential, legally privileged, protected by privacy > laws, or otherwise restricted from disclosure to anyone else. If you are > not the intended recipient or the person responsible for delivering the > e-mail to the intended recipient, you are hereby notified that any use, > copying, distributing, dissemination, forwarding, printing, or copying of > this e-mail is strictly prohibited. If you received this e-mail in error, > please return the e-mail to the sender, delete it from your computer, and > destroy any printed copy of it. Now deleted. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-10 11:45 ` Jonathan Cameron 2024-04-15 20:09 ` Sreenivas Bagalkote @ 2024-04-24 0:07 ` Dan Williams 2024-04-25 11:33 ` Jonathan Cameron 1 sibling, 1 reply; 23+ messages in thread From: Dan Williams @ 2024-04-24 0:07 UTC (permalink / raw) To: Jonathan Cameron, Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh Jonathan Cameron wrote: [..] > > It is not clear to me that this material makes sense to house in > > drivers/ vs tools/ or even out-of-tree just for maintenance burden > > relief of keeping the universes separated. What does the Linux kernel > > project get out of carrying this in mainline alongside the inband code? > > I'm not sure what you mean by in band. Aim here was to discuss > in-band drivers for switch CCI etc. Same reason from a kernel point of > view for why we include embedded drivers. I'll interpret in band > as host driven and not inband as FM-API stuff. > > > I do think the mailbox refactoring to support non-CXL use cases is > > interesting, but only so far as refactoring is consumed for inband use > > cases like RAS API. > > If I read this right, I disagree with the 'only so far' bit. > > In all substantial ways we should support BMC use case of the Linux Kernel > at a similar level to how we support forms of Linux Distros. I think we need to talk in terms of specifics, because in the general case I do not see the blockage. OpenBMC currently is based on v6.6.28 and carries 136 patches. An additional patch to turn off raw commands restrictions over there would not even be noticed. > It may not be our target market as developers for particular parts of > our companies, but we should not block those who want to support it. It is also the case that there is a responsibility to build maintainable kernel interfaces that can be reasoned about, especially with devices as powerful as CXL that are trusted to host system memory and be caching agents. For example, I do not want to be in the position of auditing whether proposed tunnels and passthroughs violate lockdown expectations. Also, the assertion that these kernels will be built with CONFIG_SECURITY_LOCKDOWN_LSM=n and likely CONFIG_STRICT_DEVMEM=n, then the entire user-mode driver ABI is available for use. CXL commands are simple polled mmio, does Linux really benefit from carrying drivers in the kernel that the kernel itself does not care about? [..] > Switch CCI Driver: PCI driver doing everything beyond the CXL mbox specific bit. > Type 3 Stack: All the normal stack just with the CXL Mailbox specific stuff factored > out. Note we can move different amounts of shared logic in here, but > in essence it deals with the extra layer on top of the raw MMPT mbox. > MMPT Mbox: Mailbox as per the PCI spec. > RAS API: Shared RAS API specific infrastructure used by other drivers. Once the CXL mailbox core is turned into a library for kernel internal consumers, like RAS API, or CXL accelerators, then it becomes easier to add a Switch CCI consumer (perhaps as an out-of-tree module in tools/), but it is still not clear why the kernel benefits from that arrangement. This is less about blocking developers that have different goals it is about finding the right projects / places to solve the problem especially when disjoint design goals are in play and user space drivers might be in reach. [..] > > > The various CXL upstream developers and maintainers may have > > > differing views of course, but my current understanding is we want > > > to support 1 and 2, but are very resistant to 3! > > > > 1, yes, 2, need to see the patches, and agree on 3. > > If we end up with top architecture of the diagrams above, 2 will look pretty > similar to last version of the switch-cci patches. So raw commands only + taint. > Factoring out MMPT is another layer that doesn't make that much difference in > practice to this discussion. Good to have, but the reuse here would be one layer > above that. > > Or we just say go for second proposed architecture and 0 impact on the > CXL specific code, just reuse of the MMPT layer. I'd imagine people will get > grumpy on code duplication (and we'll spend years rejecting patch sets that > try to share the cdoe) but there should be no maintenance burden as > a result. I am assuming that the shared code between MMPT and CXL will happen and that all of the command infrastructure is where centralized policy can not keep up. If OpenBMC wants to land a driver that consumes the MMPT core in tools/ that would seem to satisfy both the concerns of mainline not shipping ABI that host kernels need to strictly reason about while letting OpenBMC not need to carry out-of-tree patches indefinitely. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-24 0:07 ` Dan Williams @ 2024-04-25 11:33 ` Jonathan Cameron 2024-04-25 16:18 ` Dan Williams 0 siblings, 1 reply; 23+ messages in thread From: Jonathan Cameron @ 2024-04-25 11:33 UTC (permalink / raw) To: Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh On Tue, 23 Apr 2024 17:07:58 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > [..] > > > It is not clear to me that this material makes sense to house in > > > drivers/ vs tools/ or even out-of-tree just for maintenance burden > > > relief of keeping the universes separated. What does the Linux kernel > > > project get out of carrying this in mainline alongside the inband code? > > > > I'm not sure what you mean by in band. Aim here was to discuss > > in-band drivers for switch CCI etc. Same reason from a kernel point of > > view for why we include embedded drivers. I'll interpret in band > > as host driven and not inband as FM-API stuff. > > > > > I do think the mailbox refactoring to support non-CXL use cases is > > > interesting, but only so far as refactoring is consumed for inband use > > > cases like RAS API. > > > > If I read this right, I disagree with the 'only so far' bit. > > > > In all substantial ways we should support BMC use case of the Linux Kernel > > at a similar level to how we support forms of Linux Distros. > > I think we need to talk in terms of specifics, because in the general > case I do not see the blockage. OpenBMC currently is based on v6.6.28 > and carries 136 patches. An additional patch to turn off raw commands > restrictions over there would not even be noticed. Hi Dan, That I'm fine with - it's a reasonable middle ground where we ensure they have a sensible upstream solution, but just patch around the taint etc in the downstream projects. Note 136 patches is tiny for a distro and reflects their hard work upstreaming stuff. > > > It may not be our target market as developers for particular parts of > > our companies, but we should not block those who want to support it. > > It is also the case that there is a responsibility to build maintainable > kernel interfaces that can be reasoned about, especially with devices as > powerful as CXL that are trusted to host system memory and be caching > agents. For example, I do not want to be in the position of auditing > whether proposed tunnels and passthroughs violate lockdown expectations. I agree with that - this can be made dependent on not locking down in the same way lots of other somewhat dangerous interfaces are. We can relax that restriction as things do get audited - sure tunnels aren't going to be on that allowed list in the short term. > > Also, the assertion that these kernels will be built with > CONFIG_SECURITY_LOCKDOWN_LSM=n and likely CONFIG_STRICT_DEVMEM=n, then > the entire user-mode driver ABI is available for use. CXL commands are > simple polled mmio, does Linux really benefit from carrying drivers in > the kernel that the kernel itself does not care about? Sure we could it in userspace... It's bad engineering, limits the design to polling only and uses a bunch of interfaces we put a lot of effort into telling people not to use except for debug. I really don't see the advantage in pushing a project/group of projects all of which are picking the upstream kernel up directly, to do a dirty hack. We loose all the advantages of a proper well maintained kernel driver purely on the argument that one use model is not the same as this one. Sensible security lockdown requirements is fine (along with all the other kernel features that must be disable for that to work), making open kernel development on for a large Linux market harder is not. > > [..] > > Switch CCI Driver: PCI driver doing everything beyond the CXL mbox specific bit. > > Type 3 Stack: All the normal stack just with the CXL Mailbox specific stuff factored > > out. Note we can move different amounts of shared logic in here, but > > in essence it deals with the extra layer on top of the raw MMPT mbox. > > MMPT Mbox: Mailbox as per the PCI spec. > > RAS API: Shared RAS API specific infrastructure used by other drivers. > > Once the CXL mailbox core is turned into a library for kernel internal > consumers, like RAS API, or CXL accelerators, then it becomes easier to > add a Switch CCI consumer (perhaps as an out-of-tree module in tools/), > but it is still not clear why the kernel benefits from that arrangement. We can argue later on this. But from my point of view, in tree module not in tools is a must. Doesn't have to be in drivers/cxl if its simply the association aspect that is a blocker. > > This is less about blocking developers that have different goals it is > about finding the right projects / places to solve the problem > especially when disjoint design goals are in play and user space drivers > might be in reach. Key here is this is not a case of openBMC is the one true distro on which all Linux BMCs and fabric management platforms are based. So we are really talking a random out of tree driver with all the maintenance overhead that brings. Yuck. So I don't see there being any good solution out side of upstream support other than pushing this to be a userspace hack. > > [..] > > > > The various CXL upstream developers and maintainers may have > > > > differing views of course, but my current understanding is we want > > > > to support 1 and 2, but are very resistant to 3! > > > > > > 1, yes, 2, need to see the patches, and agree on 3. > > > > If we end up with top architecture of the diagrams above, 2 will look pretty > > similar to last version of the switch-cci patches. So raw commands only + taint. > > Factoring out MMPT is another layer that doesn't make that much difference in > > practice to this discussion. Good to have, but the reuse here would be one layer > > above that. > > > > Or we just say go for second proposed architecture and 0 impact on the > > CXL specific code, just reuse of the MMPT layer. I'd imagine people will get > > grumpy on code duplication (and we'll spend years rejecting patch sets that > > try to share the cdoe) but there should be no maintenance burden as > > a result. > > I am assuming that the shared code between MMPT and CXL will happen and > that all of the command infrastructure is where centralized policy can > not keep up. There is actually very little to MMPT, but sure there will be some sharing of code and the policy won't sit in that shared part as it is protocol specific. > If OpenBMC wants to land a driver that consumes the MMPT > core in tools/ that would seem to satisfy both the concerns of mainline > not shipping ABI that host kernels need to strictly reason about while > letting OpenBMC not need to carry out-of-tree patches indefinitely. We can argue that detail later, but tools is not in my opinion a valid solution to supporting properly maintained upstream drivers. Its a hack for test and example modules only. The path of just blocking this driver in any locked down situation seems much more inline with kernel norms. It is also extremely bad precedence. Jonathan p.s. I don't care in the slightest about openBMC (other than general warm fuzzy feelings about a good open source project), I do care rather more about BMCs and other fabric managers. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-25 11:33 ` Jonathan Cameron @ 2024-04-25 16:18 ` Dan Williams 2024-04-25 17:26 ` Jonathan Cameron 0 siblings, 1 reply; 23+ messages in thread From: Dan Williams @ 2024-04-25 16:18 UTC (permalink / raw) To: Jonathan Cameron, Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh Jonathan Cameron wrote: [..] > > Also, the assertion that these kernels will be built with > > CONFIG_SECURITY_LOCKDOWN_LSM=n and likely CONFIG_STRICT_DEVMEM=n, then > > the entire user-mode driver ABI is available for use. CXL commands are > > simple polled mmio, does Linux really benefit from carrying drivers in > > the kernel that the kernel itself does not care about? > > Sure we could it in userspace... It's bad engineering, limits the design > to polling only and uses a bunch of interfaces we put a lot of effort into > telling people not to use except for debug. > > I really don't see the advantage in pushing a project/group of projects > all of which are picking the upstream kernel up directly, to do a dirty > hack. We loose all the advantages of a proper well maintained kernel > driver purely on the argument that one use model is not the same as > this one. Sensible security lockdown requirements is fine (along > with all the other kernel features that must be disable for that > to work), making open kernel development on for a large Linux > market harder is not. The minimum requirement for justifying an in kernel driver is that something else in the kernel consumes that facility. So, again, I want to get back to specifics what else in the kernel is going to leverage the Switch CCI mailbox? The generic-Type-3-device mailbox has an in kernel driver because the kernel has need to send mailbox commands internally and it is fundamental to RAS and provisioning flows that the kernel have this coordination. What are the motivations for an in-band Switch CCI command submission path? It could be the case that you have a self-evident example in mind that I have thus far failed to realize. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-25 16:18 ` Dan Williams @ 2024-04-25 17:26 ` Jonathan Cameron 2024-04-25 19:25 ` Dan Williams 0 siblings, 1 reply; 23+ messages in thread From: Jonathan Cameron @ 2024-04-25 17:26 UTC (permalink / raw) To: Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh, gregkh On Thu, 25 Apr 2024 09:18:53 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > [..] > > > Also, the assertion that these kernels will be built with > > > CONFIG_SECURITY_LOCKDOWN_LSM=n and likely CONFIG_STRICT_DEVMEM=n, then > > > the entire user-mode driver ABI is available for use. CXL commands are > > > simple polled mmio, does Linux really benefit from carrying drivers in > > > the kernel that the kernel itself does not care about? > > > > Sure we could it in userspace... It's bad engineering, limits the design > > to polling only and uses a bunch of interfaces we put a lot of effort into > > telling people not to use except for debug. > > > > I really don't see the advantage in pushing a project/group of projects > > all of which are picking the upstream kernel up directly, to do a dirty > > hack. We loose all the advantages of a proper well maintained kernel > > driver purely on the argument that one use model is not the same as > > this one. Sensible security lockdown requirements is fine (along > > with all the other kernel features that must be disable for that > > to work), making open kernel development on for a large Linux > > market harder is not. > > The minimum requirement for justifying an in kernel driver is that > something else in the kernel consumes that facility. So, again, I want > to get back to specifics what else in the kernel is going to leverage > the Switch CCI mailbox? Why? I've never heard of such as a requirement and numerous drivers provide fairly direct access to hardware. Sometimes there is a subsystem aiding the data formatting etc, but fundamentally that's a convenience. Taking this to a silly level, on this basis all networking drivers would not be in the kernel. They are there mainly to provide userspace access to a network. Any of the hardware access subsystems such hwmon, input, IIO etc are primarily about providing a convenient way to get data to/from a device. They are kernel drivers because that is the cleaner path for data marshaling, interrupt handling etc. In kernel users are a perfectly valid reason to have a kernel driver, but it's far from the only one. None of the AI accelerators have in kernel users today (maybe they will in future). Sure there are other arguments that mean only a few such devices have been upstreamed, but it's not that they need in kernel users. If it's really an issue I'll just submit it to driver/misc and Greg can take a view on whether it's an acceptable device to have driver for... (after he's asked the obvious question of why aren't the CXL folk taking it!) +cc Greg to save providing info later. For background this is a PCI function with a mailbox used for switch configuration. The mailbox is identical to the one found on CXL type3 devices. Whole thing defined in the CXL spec. It gets a little complex because you can tunnel commands to devices connected to the switch, potentially affecting other hosts. Typical Linux device doing this would be a BMC, but there have been repeated questions about providing a subset of access to any Linux system (avoiding the foot guns) Whole thing fully discoverable - proposal is a standard PCI driver. > > The generic-Type-3-device mailbox has an in kernel driver because the > kernel has need to send mailbox commands internally and it is > fundamental to RAS and provisioning flows that the kernel have this > coordination. What are the motivations for an in-band Switch CCI command > submission path? > > It could be the case that you have a self-evident example in mind that I > have thus far failed to realize. > There are possibilities, but for now it's a transport driver just like MCTP etc with a well defined chardev interface, with documented ioctl interface etc (which I'd keep inline with one the CXL mailbox uses just to avoid reinventing the wheel - I'd prefer to use that directly to avoid divergence but I don't care that much). As far as I can see, with the security / blast radius concern alleviated by disabling this if lockdown is in use + taint for unaudited commands (and a nasty sounding config similar to the cxl mailbox one) there is little reason not to take such a driver into the kernel. It has next to no maintenance impact outside of itself and a bit of library code which I've proposed pushing down to the level of MMPT (so PCI not CXL) if you think that is necessary. We want interrupt handling and basic access controls / command interface to userspace. Apologies if I'm grumpy - several long days of battling cpu hotplug code. Jonathan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-25 17:26 ` Jonathan Cameron @ 2024-04-25 19:25 ` Dan Williams 2024-04-26 8:45 ` Jonathan Cameron 0 siblings, 1 reply; 23+ messages in thread From: Dan Williams @ 2024-04-25 19:25 UTC (permalink / raw) To: Jonathan Cameron, Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh, gregkh Jonathan Cameron wrote: > Dan Williams <dan.j.williams@intel.com> wrote: > > The minimum requirement for justifying an in kernel driver is that > > something else in the kernel consumes that facility. So, again, I want > > to get back to specifics what else in the kernel is going to leverage > > the Switch CCI mailbox? > > Why? I've never heard of such as a requirement and numerous drivers > provide fairly direct access to hardware. Sometimes there is a subsystem > aiding the data formatting etc, but fundamentally that's a convenience. > > Taking this to a silly level, on this basis all networking drivers would > not be in the kernel. They are there mainly to provide userspace access to > a network. Networking is an odd choice to bring into this discussion because that subsystem has a long history of wrestling with the "kernel bypass" concern. It has largely been able to weather the storm of calls to get out of the way and let vendor drivers have free reign. The AF_XDP socket family was the result of finding a path to let userspace networking stacks build functionality without forfeiting the relevance and ongoing collaboration on the in-kernel stack. > Any of the hardware access subsystems such hwmon, input, IIO > etc are primarily about providing a convenient way to get data to/from > a device. They are kernel drivers because that is the cleaner path > for data marshaling, interrupt handling etc. Those are drivers supporting a subsystem to bring a sane kernel interface to front potenitally multiple vendor implementations of similar functionality. They are not asking for kernel bypass facilities that defeat the purpose of ever talking to the kernel community again for potentially system-integrity violating functionality behind disparate vendor interfaces. > In kernel users are a perfectly valid reason to have a kernel driver, > but it's far from the only one. None of the AI accelerators have in kernel > users today (maybe they will in future). Sure there are other arguments > that mean only a few such devices have been upstreamed, but it's not > that they need in kernel users. If it's really an issue I'll just submit > it to driver/misc and Greg can take a view on whether it's an acceptable > device to have driver for... (after he's asked the obvious question of > why aren't the CXL folk taking it!) +cc Greg to save providing info later. AI accelerators are heavy consumers of the core-mm you can not reasonably coordinate with the core-mm from userspace. If the proposal is to build a new CXL Fabric Management subsystem with proper ABIs and openly defined command sets that will sit behind thought out kernel interfaces then I can get on board with that. Where I am stuck currently is the assertion that step 1 is "build ioctl passthrough tunnels with 'do anything you want and get away with it' semantics". Recall that the current restriction for raw commands was to encourage vendor collaboration and building sane kernel interfaces, and that distros would enable it in their "debug" kernels to enable hardware validation test benches. If the assertion is "that's too restrictive, enable a vendor ecosystem based on kernel bypass" that goes too far. > For background this is a PCI function with a mailbox used for switch > configuration. The mailbox is identical to the one found on CXL type3 > devices. Whole thing defined in the CXL spec. It gets a little complex > because you can tunnel commands to devices connected to the switch, > potentially affecting other hosts. Typical Linux device doing this > would be a BMC, but there have been repeated questions about providing > a subset of access to any Linux system (avoiding the foot guns) > Whole thing fully discoverable - proposal is a standard PCI driver. > > > The generic-Type-3-device mailbox has an in kernel driver because the > > kernel has need to send mailbox commands internally and it is > > fundamental to RAS and provisioning flows that the kernel have this > > coordination. What are the motivations for an in-band Switch CCI command > > submission path? > > > > It could be the case that you have a self-evident example in mind that I > > have thus far failed to realize. > > > > There are possibilities, but for now it's a transport driver just like > MCTP etc with a well defined chardev interface, with documented ioctl > interface etc (which I'd keep inline with one the CXL mailbox uses > just to avoid reinventing the wheel - I'd prefer to use that directly > to avoid divergence but I don't care that much). > > As far as I can see, with the security / blast radius concern alleviated > by disabling this if lockdown is in use + taint for unaudited commands > (and a nasty sounding config similar to the cxl mailbox one) > there is little reason not to take such a driver into the kernel. > It has next to no maintenance impact outside of itself and a bit of > library code which I've proposed pushing down to the level of MMPT > (so PCI not CXL) if you think that is necessary. > > We want interrupt handling and basic access controls / command > interface to userspace. > > Apologies if I'm grumpy - several long days of battling cpu hotplug code. Again, can we please get back to the specifics of the commands to be enabled here? I am open to CXL Fabric Management as a first class citizen, I am not currently open to CXL Fabric Management gets to live in the corner of the kernel that is unreviewable because all it does is take opaque ioctl blobs and marshal them to hardware. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-25 19:25 ` Dan Williams @ 2024-04-26 8:45 ` Jonathan Cameron 2024-04-26 16:16 ` Dan Williams 0 siblings, 1 reply; 23+ messages in thread From: Jonathan Cameron @ 2024-04-26 8:45 UTC (permalink / raw) To: Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh, gregkh On Thu, 25 Apr 2024 12:25:35 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > > Dan Williams <dan.j.williams@intel.com> wrote: > > > The minimum requirement for justifying an in kernel driver is that > > > something else in the kernel consumes that facility. So, again, I want > > > to get back to specifics what else in the kernel is going to leverage > > > the Switch CCI mailbox? > > > > Why? I've never heard of such as a requirement and numerous drivers > > provide fairly direct access to hardware. Sometimes there is a subsystem > > aiding the data formatting etc, but fundamentally that's a convenience. > > > > Taking this to a silly level, on this basis all networking drivers would > > not be in the kernel. They are there mainly to provide userspace access to > > a network. > > Networking is an odd choice to bring into this discussion because that > subsystem has a long history of wrestling with the "kernel bypass" > concern. It has largely been able to weather the storm of calls to get > out of the way and let vendor drivers have free reign. > > The AF_XDP socket family was the result of finding a path to let > userspace networking stacks build functionality without forfeiting the > relevance and ongoing collaboration on the in-kernel stack. This first chunk of my reply was all about poking holes in the 'must have an in kernel user' not trying to make any broader points. That argument is one bullet of perhaps 20 good reasons for in-kernel support, you've listed quite a few more so point hopefully made. I fully agree with the other reasons you've given for why things are in the kernel, but key here is we don't need to meet the in-kernel user requirement. Here the focus is on encouraging standards adoption. > > > Any of the hardware access subsystems such hwmon, input, IIO > > etc are primarily about providing a convenient way to get data to/from > > a device. They are kernel drivers because that is the cleaner path > > for data marshaling, interrupt handling etc. > > Those are drivers supporting a subsystem to bring a sane kernel > interface to front potenitally multiple vendor implementations of > similar functionality. I have painful memories of early feedback on IIO that was very similar to arguments here - it wasn't far off killing of the whole effort. > > They are not asking for kernel bypass facilities that defeat the purpose > of ever talking to the kernel community again for potentially > system-integrity violating functionality behind disparate vendor > interfaces. I'm not asking for that at all. This discussion got a bit derailed though so I can see why you might have taken that impression away. > > > In kernel users are a perfectly valid reason to have a kernel driver, > > but it's far from the only one. None of the AI accelerators have in kernel > > users today (maybe they will in future). Sure there are other arguments > > that mean only a few such devices have been upstreamed, but it's not > > that they need in kernel users. If it's really an issue I'll just submit > > it to driver/misc and Greg can take a view on whether it's an acceptable > > device to have driver for... (after he's asked the obvious question of > > why aren't the CXL folk taking it!) +cc Greg to save providing info later. > > AI accelerators are heavy consumers of the core-mm you can not > reasonably coordinate with the core-mm from userspace. > > If the proposal is to build a new CXL Fabric Management subsystem with > proper ABIs and openly defined command sets that will sit behind thought > out kernel interfaces then I can get on board with that. > > Where I am stuck currently is the assertion that step 1 is "build ioctl > passthrough tunnels with 'do anything you want and get away with it' > semantics". If you think step 1 is the end goal, then indeed I see your problem. Reality is that this stuff is needed today and similar to type 3 there will be people who use the raw interface to enabled their custom stuff - we can enable that but put similar measures in place to to encourage upstreaming / standarization. I should perhaps not have diverted into the fact openBMC 'might' patch out the taint. I suspect some server distros will do that for type 3 devices to, but the advantages to vendors of playing nicely with upstream still apply. I do however want to encourage openBMC folk to collaborate on support and if that means letting them play fast and loose with what they do downstream for now that's fine - they will not want to carry such patches long term anyway. The end goal is to enable multiple things: 1) Standardization: That's exactly why I proposed that if broadcomm want to have upstream support for their interfaces they should do two things: a) Take a proposal to the consortium to extend the standard definition to incorporate those features (we want to see them playing the game so everyone benefits!) b) Provide clear description of the vendor defined commands so that the safe ones can be audited and enabled as 'defacto' standards. We do this with lots of other things - if you are playing the game nicely we relax the rules a little to encourage you to keep doing so. This is the path being proposed for the upstream driver. 2) A path sharing that same standard infrastructure for the more destructive commands. That can have all the protections we want to add. For the BMC stacks some of that stuff will be vital and it's complex so there is an argument that not all that belongs in the kernel or at least not today. One of the other comments I recall for the raw command support in the type 3 driver was to allow people to test stuff that was safe, was in the spec, but was not yet audited by the driver. I'd put a lot of this in that category. We can absolutely do tunneled commands nicely and filter them for destructive activity etc - it just doesn't belong in the first patch set. Ultimately I would expect approaches to enable the destructive commands as well via standard interfaces, but those will of course need guard rails / opt in etc. I never intended to propose that the raw command path is the main one used - that's there for the same reasons we have one in CXL type3, though here I suspect the path to supporting everything needed via non raw interfaces will be longer. > > Recall that the current restriction for raw commands was to encourage > vendor collaboration and building sane kernel interfaces, and that > distros would enable it in their "debug" kernels to enable hardware > validation test benches. If the assertion is "that's too restrictive, > enable a vendor ecosystem based on kernel bypass" that goes too far. This is a clear description of the concern and it's one I fully share. My interpretation of your earlier responses was clearly wrong - this is about avoiding kernel bypass and encouraging standardization. Your comments on doing user space drivers read to me like you were telling the switch vendors to go do exactly that and that is why I was pushing back strongly! All the other responses were based on me trying to answer other concerns (maintainability for example) rather than this one. WRT to relationship to the raw commands: this is the same proposal. Much like the initial CXL mailbox support provide a raw bypass for the cases not supported and put the same guards around it (or tighter ones). That is better in the upstream kernel than pushing the problem downstream and basically removing any reason at all for switch vendors to do things right. Give them a path and they will come. So far what I suspect they are seeing is no path at all - driving fragmentation. (handy switch vendors in the CC list feel free to say how you are interpretting this discussion!) If we'd had upstream support already in place with those restrictions the rules for switch vendors would have been clear and (assuming they have some active ecosystem folk) their design would have taken this into account. The early switch-cci code proposal was exactly the same as the CXL mailbox. It had a bunch of commands (IIRC) included get temperature, get timestamp etc that were safe and handled in exactly the same fashion as the in kernel interfaces. The reason that I ripped that out was to reduce the amount and complexity of the first proposal so the focus would lie on the necessary refactoring of the CXL core to allow for code reuse. That stuff will comeback in patch set 2 - we can easily put it back in the first patch set if that helps make the strategy clear - it's just a few lines of code. All the infrastructure is still there to support safe commands vs unsafe ones. Same logic, same approach and actually the same code as for the CXL mailbox. I fully agree with all that approach. > > > For background this is a PCI function with a mailbox used for switch > > configuration. The mailbox is identical to the one found on CXL type3 > > devices. Whole thing defined in the CXL spec. It gets a little complex > > because you can tunnel commands to devices connected to the switch, > > potentially affecting other hosts. Typical Linux device doing this > > would be a BMC, but there have been repeated questions about providing > > a subset of access to any Linux system (avoiding the foot guns) > > Whole thing fully discoverable - proposal is a standard PCI driver. > > > > > The generic-Type-3-device mailbox has an in kernel driver because the > > > kernel has need to send mailbox commands internally and it is > > > fundamental to RAS and provisioning flows that the kernel have this > > > coordination. What are the motivations for an in-band Switch CCI command > > > submission path? > > > > > > It could be the case that you have a self-evident example in mind that I > > > have thus far failed to realize. > > > > > > > There are possibilities, but for now it's a transport driver just like > > MCTP etc with a well defined chardev interface, with documented ioctl > > interface etc (which I'd keep inline with one the CXL mailbox uses > > just to avoid reinventing the wheel - I'd prefer to use that directly > > to avoid divergence but I don't care that much). > > > > As far as I can see, with the security / blast radius concern alleviated > > by disabling this if lockdown is in use + taint for unaudited commands > > (and a nasty sounding config similar to the cxl mailbox one) > > there is little reason not to take such a driver into the kernel. > > It has next to no maintenance impact outside of itself and a bit of > > library code which I've proposed pushing down to the level of MMPT > > (so PCI not CXL) if you think that is necessary. > > > > We want interrupt handling and basic access controls / command > > interface to userspace. > > > > Apologies if I'm grumpy - several long days of battling cpu hotplug code. > > Again, can we please get back to the specifics of the commands to be > enabled here? I am open to CXL Fabric Management as a first class > citizen, I am not currently open to CXL Fabric Management gets to live > in the corner of the kernel that is unreviewable because all it does is > take opaque ioctl blobs and marshal them to hardware. > That I'm fully on board with. However there are plenty of spec defined commands that in this safe category and to show how this works we can implement those in parallel to a discussion about those defined outside the CXL specification. Encouraging people to do things via a hacky userspace driver as I read you as doing earlier in this thread will have given exactly the opposite impression. To give people an incentive to play the standards game we have to provide an alternative. Userspace libraries will provide some incentive to standardize if we have enough vendors (we don't today - so they will do their own libraries), but it is a lot easier to encourage if we exercise control over the interface. Jonathan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-26 8:45 ` Jonathan Cameron @ 2024-04-26 16:16 ` Dan Williams 2024-04-26 16:53 ` Jonathan Cameron 0 siblings, 1 reply; 23+ messages in thread From: Dan Williams @ 2024-04-26 16:16 UTC (permalink / raw) To: Jonathan Cameron, Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh, gregkh Jonathan Cameron wrote: [..] > To give people an incentive to play the standards game we have to > provide an alternative. Userspace libraries will provide some incentive > to standardize if we have enough vendors (we don't today - so they will > do their own libraries), but it is a lot easier to encourage if we > exercise control over the interface. Yes, and I expect you and I are not far off on what can be done here. However, lets cut to a sentiment hanging over this discussion. Referring to vendor specific commands: "CXL spec has them for a reason and they need to be supported." ...that is an aggressive "vendor specific first" sentiment that generates an aggressive "userspace drivers" reaction, because the best way to get around community discussions about what ABI makes sense is userspace drivers. Now, if we can step back to where this discussion started, where typical Linux collaboration shines, and where I think you and I are more aligned than this thread would indicate, is "vendor specific last". Lets carefully consider the vendor specific commands that are candidates to be de facto cross vendor semantics if not de jure standards. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-26 16:16 ` Dan Williams @ 2024-04-26 16:53 ` Jonathan Cameron 2024-04-26 19:25 ` Harold Johnson 0 siblings, 1 reply; 23+ messages in thread From: Jonathan Cameron @ 2024-04-26 16:53 UTC (permalink / raw) To: Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Harold Johnson, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh, gregkh On Fri, 26 Apr 2024 09:16:44 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > [..] > > To give people an incentive to play the standards game we have to > > provide an alternative. Userspace libraries will provide some incentive > > to standardize if we have enough vendors (we don't today - so they will > > do their own libraries), but it is a lot easier to encourage if we > > exercise control over the interface. > > Yes, and I expect you and I are not far off on what can be done > here. > > However, lets cut to a sentiment hanging over this discussion. Referring > to vendor specific commands: > > "CXL spec has them for a reason and they need to be supported." > > ...that is an aggressive "vendor specific first" sentiment that > generates an aggressive "userspace drivers" reaction, because the best > way to get around community discussions about what ABI makes sense is > userspace drivers. > > Now, if we can step back to where this discussion started, where typical > Linux collaboration shines, and where I think you and I are more aligned > than this thread would indicate, is "vendor specific last". Lets > carefully consider the vendor specific commands that are candidates to > be de facto cross vendor semantics if not de jure standards. > Agreed. I'd go a little further and say I generally have much more warm and fuzzy feelings when what is a vendor defined command (today) maps to more or less the same bit of code for a proposed standards ECN. IP rules prevent us commenting on specific proposals, but there will be things we review quicker and with a lighter touch vs others where we ask lots of annoying questions about generality of the feature etc. Given the effort we are putting in on the kernel side we all want CXL to succeed and will do our best to encourage activities that make that more likely. There are other standards bodies available... which may make more sense for some features. Command interfaces are not a good place to compete and maintain secrecy. If vendors want to do that, then they don't get the pony of upstream support. They get to convince distros to do a custom kernel build for them: Good luck with that, some of those folk are 'blunt' in their responses to such requests. My proposal is we go forward with a bunch of the CXL spec defined commands to show the 'how' and consider specific proposals for upstream support of vendor defined commands on a case by case basis (so pretty much what you say above). Maybe after a few are done we can formalize some rules of thumb help vendors makes such proposals, though maybe some will figure out it is a better and longer term solution to do 'standards first development'. I think we do need to look at the safety filtering of tunneled commands but don't see that as a particularly tricky addition - for the simple non destructive commands at least. Jonathan ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-26 16:53 ` Jonathan Cameron @ 2024-04-26 19:25 ` Harold Johnson 2024-04-27 11:12 ` Greg KH 2024-04-29 12:18 ` Jonathan Cameron 0 siblings, 2 replies; 23+ messages in thread From: Harold Johnson @ 2024-04-26 19:25 UTC (permalink / raw) To: Jonathan Cameron, Dan Williams Cc: linux-cxl, Sreenivas Bagalkote, Brett Henning, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh, gregkh [-- Attachment #1: Type: text/plain, Size: 7740 bytes --] Perhaps a bit more color on a few specifics might be helpful. I think that there will always be a class of vendor specific APIs/Opcodes that are related to an implementation of a standard instead of the standard itself. I've been party to discussion on not creating CXL defined API/Opcodes that get into the realm of specifying an implementation. There are also a class of data that can be collected from a specific implementation that is helpful for debug, for health monitoring, and perhaps performance monitoring where the implementation matters and therefore are not easily abstracted to a standard. A few examples: a) Temperature monitoring of a component or internal chip die temperatures. Could CXL define a standard OpCode to gather temperatures, yes it could; but is this really part of CXL? Then how many temperature elements and what does each element mean? This enters into the implementation and therefore is vendor specific. Unless the CXL spec starts to define the implementation, something along the lines of "thou shall have an average die temperature, rather than specific temperatures across a die", etc. b) Error counters, metrics, internal counters, etc. Could CXL define a set of common error counters, absolutely. PCIe has done some of this. However, a specific implementation may have counters and error reporting that are meaningful only to a specific design and a specific implementation rather than a "least common denominator" approach of a standard body. c) Performance counters, metric, indicators, etc. Performance can be very implementation specific and tweaking performance is likely to be implementation specific. Yes, generic and a least common denominator elements could be created, but are likely to limiting in realizing the maximum performance of an implementation. d) Logs, errors and debug information. In addition to spec defined logging of CXL topology errors, specific designs will have logs, crash dumps, debug data that is very specific to a implementation. There are likely to be cases where a product that conforms to a specification like CXL, may have features that don't directly have anything to do with CXL, but where a standards based management interface can be used to configure, manage, and collect data for a non-CXL feature. e) Innovation. I believe that innovation should be encouraged. There may be designs that support CXL, but that also incorporate unique and innovative features or functions that might service a niche market. The AI space is ripe for innovation and perhaps specialized features that may not make sense for the overall CXL specification. I think that in most cases Vendor specific opcodes are not used to circumvent the standards, but are used when the standards group has no interested in driving into the standard certain features that are clearly either implementation specific or are vendor specific additions that have a specific appeal to a select class of customer, but yet are not relevant to a specific standard. At the end of the day, customer want products that solve a specific problem. Sometimes vendor can address market segments or niches that a standard group has no interest in supporting. It can also take months, and in some cases years to reach an agreement on what standardized feature should look like. I also believe that there can be competitive reasons why there might be a group that wants to slow down a vendor's implementation for fear of losing market share. Thanks Harold Johnson -----Original Message----- From: Jonathan Cameron [mailto:Jonathan.Cameron@Huawei.com] Sent: Friday, April 26, 2024 11:54 AM To: Dan Williams Cc: linux-cxl@vger.kernel.org; Sreenivas Bagalkote; Brett Henning; Harold Johnson; Sumanesh Samanta; linux-kernel@vger.kernel.org; Davidlohr Bueso; Dave Jiang; Alison Schofield; Vishal Verma; Ira Weiny; linuxarm@huawei.com; linux-api@vger.kernel.org; Lorenzo Pieralisi; Natu, Mahesh; gregkh@linuxfoundation.org Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management On Fri, 26 Apr 2024 09:16:44 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > [..] > > To give people an incentive to play the standards game we have to > > provide an alternative. Userspace libraries will provide some incentive > > to standardize if we have enough vendors (we don't today - so they will > > do their own libraries), but it is a lot easier to encourage if we > > exercise control over the interface. > > Yes, and I expect you and I are not far off on what can be done > here. > > However, lets cut to a sentiment hanging over this discussion. Referring > to vendor specific commands: > > "CXL spec has them for a reason and they need to be supported." > > ...that is an aggressive "vendor specific first" sentiment that > generates an aggressive "userspace drivers" reaction, because the best > way to get around community discussions about what ABI makes sense is > userspace drivers. > > Now, if we can step back to where this discussion started, where typical > Linux collaboration shines, and where I think you and I are more aligned > than this thread would indicate, is "vendor specific last". Lets > carefully consider the vendor specific commands that are candidates to > be de facto cross vendor semantics if not de jure standards. > Agreed. I'd go a little further and say I generally have much more warm and fuzzy feelings when what is a vendor defined command (today) maps to more or less the same bit of code for a proposed standards ECN. IP rules prevent us commenting on specific proposals, but there will be things we review quicker and with a lighter touch vs others where we ask lots of annoying questions about generality of the feature etc. Given the effort we are putting in on the kernel side we all want CXL to succeed and will do our best to encourage activities that make that more likely. There are other standards bodies available... which may make more sense for some features. Command interfaces are not a good place to compete and maintain secrecy. If vendors want to do that, then they don't get the pony of upstream support. They get to convince distros to do a custom kernel build for them: Good luck with that, some of those folk are 'blunt' in their responses to such requests. My proposal is we go forward with a bunch of the CXL spec defined commands to show the 'how' and consider specific proposals for upstream support of vendor defined commands on a case by case basis (so pretty much what you say above). Maybe after a few are done we can formalize some rules of thumb help vendors makes such proposals, though maybe some will figure out it is a better and longer term solution to do 'standards first development'. I think we do need to look at the safety filtering of tunneled commands but don't see that as a particularly tricky addition - for the simple non destructive commands at least. Jonathan -- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4215 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-26 19:25 ` Harold Johnson @ 2024-04-27 11:12 ` Greg KH 2024-04-27 16:22 ` Dan Williams 2024-04-29 12:18 ` Jonathan Cameron 1 sibling, 1 reply; 23+ messages in thread From: Greg KH @ 2024-04-27 11:12 UTC (permalink / raw) To: Harold Johnson Cc: Jonathan Cameron, Dan Williams, linux-cxl, Sreenivas Bagalkote, Brett Henning, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh On Fri, Apr 26, 2024 at 02:25:29PM -0500, Harold Johnson wrote: > A few examples: > a) Temperature monitoring of a component or internal chip die > temperatures. Could CXL define a standard OpCode to gather temperatures, > yes it could; but is this really part of CXL? Then how many temperature > elements and what does each element mean? This enters into the > implementation and therefore is vendor specific. Unless the CXL spec > starts to define the implementation, something along the lines of "thou > shall have an average die temperature, rather than specific temperatures > across a die", etc. > > b) Error counters, metrics, internal counters, etc. Could CXL define a > set of common error counters, absolutely. PCIe has done some of this. > However, a specific implementation may have counters and error reporting > that are meaningful only to a specific design and a specific > implementation rather than a "least common denominator" approach of a > standard body. > > c) Performance counters, metric, indicators, etc. Performance can be very > implementation specific and tweaking performance is likely to be > implementation specific. Yes, generic and a least common denominator > elements could be created, but are likely to limiting in realizing the > maximum performance of an implementation. > > d) Logs, errors and debug information. In addition to spec defined > logging of CXL topology errors, specific designs will have logs, crash > dumps, debug data that is very specific to a implementation. There are > likely to be cases where a product that conforms to a specification like > CXL, may have features that don't directly have anything to do with CXL, > but where a standards based management interface can be used to configure, > manage, and collect data for a non-CXL feature. All of the above should be able to be handled by vendor-specific KERNEL drivers that feed the needed information to the proper user/kernel apis that the kernel already provides. So while innovating at the hardware level is fine, follow the ways that everyone has done this for other specification types (USB, PCI, etc.) and just allow vendor drivers to provide the information. Don't do this in crazy userspace drivers which will circumvent the whole reason we have standard kernel/user apis in the first place for these types of things. > e) Innovation. I believe that innovation should be encouraged. There may > be designs that support CXL, but that also incorporate unique and > innovative features or functions that might service a niche market. The > AI space is ripe for innovation and perhaps specialized features that may > not make sense for the overall CXL specification. > > I think that in most cases Vendor specific opcodes are not used to > circumvent the standards, but are used when the standards group has no > interested in driving into the standard certain features that are clearly > either implementation specific or are vendor specific additions that have > a specific appeal to a select class of customer, but yet are not relevant > to a specific standard. Then fight this out in the specification groups, which are highly political, and do not push that into the kernel space please. Again, this is nothing new, we have all done this for specs for decades now, allow vendor additions to the spec and handle that in the kernel and all should be ok, right? Or am I missing something obvious here where we would NOT want to do what all other specs have done? thanks, greg k-h ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-27 11:12 ` Greg KH @ 2024-04-27 16:22 ` Dan Williams 2024-04-28 4:25 ` Sirius 0 siblings, 1 reply; 23+ messages in thread From: Dan Williams @ 2024-04-27 16:22 UTC (permalink / raw) To: Greg KH, Harold Johnson Cc: Jonathan Cameron, Dan Williams, linux-cxl, Sreenivas Bagalkote, Brett Henning, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh Greg KH wrote: [..] > So while innovating at the hardware level is fine, follow the ways that > everyone has done this for other specification types (USB, PCI, etc.) > and just allow vendor drivers to provide the information. Don't do this > in crazy userspace drivers which will circumvent the whole reason we > have standard kernel/user apis in the first place for these types of > things. Right, standard kernel/user apis is the requirement. The suggestion of opaque vendor passthrough tunnels, and every vendor ships their custom tool to do what should be common flows, is where this discussion went off the rails. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-27 16:22 ` Dan Williams @ 2024-04-28 4:25 ` Sirius 0 siblings, 0 replies; 23+ messages in thread From: Sirius @ 2024-04-28 4:25 UTC (permalink / raw) To: Dan Williams Cc: Greg KH, Harold Johnson, Jonathan Cameron, linux-cxl, Sreenivas Bagalkote, Brett Henning, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh In days of yore (Sat, 27 Apr 2024), Dan Williams thus quoth: > Greg KH wrote: > [..] > > So while innovating at the hardware level is fine, follow the ways that > > everyone has done this for other specification types (USB, PCI, etc.) > > and just allow vendor drivers to provide the information. Don't do this > > in crazy userspace drivers which will circumvent the whole reason we > > have standard kernel/user apis in the first place for these types of > > things. > > Right, standard kernel/user apis is the requirement. > > The suggestion of opaque vendor passthrough tunnels, and every vendor > ships their custom tool to do what should be common flows, is where this > discussion went off the rails. One aspect of this is Fabric Management (thinking CXL3 here). It is not desirable that every vendor of CXL hardware require their own (proprietary) fabric management software. From a user perspective, that is absolutely horrible. Users can, and will, mix and match CXL hardware according to their needs (or wallet), and them having to run multiple fabric management solutions (which in worst case are conflicting with each other) to manage things is .. suboptimal. By all means - innovate - but do it in such a way that interoperability and manageability is the priority. Special sauce vendor lock-in is a surefire way to kill CXL where it stands - don't do it. -- Kind regards, /S ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RFC: Restricting userspace interfaces for CXL fabric management 2024-04-26 19:25 ` Harold Johnson 2024-04-27 11:12 ` Greg KH @ 2024-04-29 12:18 ` Jonathan Cameron 1 sibling, 0 replies; 23+ messages in thread From: Jonathan Cameron @ 2024-04-29 12:18 UTC (permalink / raw) To: Harold Johnson Cc: Dan Williams, linux-cxl, Sreenivas Bagalkote, Brett Henning, Sumanesh Samanta, linux-kernel, Davidlohr Bueso, Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny, linuxarm, linux-api, Lorenzo Pieralisi, Natu, Mahesh, gregkh On Fri, 26 Apr 2024 14:25:29 -0500 Harold Johnson <harold.johnson@broadcom.com> wrote: > Perhaps a bit more color on a few specifics might be helpful. > > I think that there will always be a class of vendor specific APIs/Opcodes > that are related to an implementation of a standard instead of the > standard itself. I've been party to discussion on not creating CXL > defined API/Opcodes that get into the realm of specifying an > implementation. There are also a class of data that can be collected from > a specific implementation that is helpful for debug, for health > monitoring, and perhaps performance monitoring where the implementation > matters and therefore are not easily abstracted to a standard. Hi Harold, Let's divert into a few specifics to give some routes to implementing these. Some of them are extensions of things already well handled. Tweaks and extensions in the 'spirit' of the existing spec are both places where adding some richness to the spec is probably not too difficult and where there may be some flexibility. In some cases the definitions in the specification almost certainly came after your design. > > A few examples: > a) Temperature monitoring of a component or internal chip die > temperatures. Could CXL define a standard OpCode to gather temperatures, > yes it could; but is this really part of CXL? Then how many temperature > elements and what does each element mean? This enters into the > implementation and therefore is vendor specific. Unless the CXL spec > starts to define the implementation, something along the lines of "thou > shall have an average die temperature, rather than specific temperatures > across a die", etc. There is a general temperature monitoring and trip thresholds etc in the spec via get Health Info and event logs. That covers a single value, so not as extensive as what you refer to but it's a start. Whilst you are right that the specification should not mandate N temperatures in specific locations, it would be a reasonable request to allow for a wider set of monitors with some level of description. The intent being that generic software can discover what is there and present that info for logging / monitoring. Whilst the exact meaning will vary by device, some broad scoping is easy enough (memory controller vs various FRUs perhaps) and that is useful to providing generic users space software. hmwon has long handled this sort of data from whole systems where very similar aspects of 'what does this monitor actually mean' apply. Sometimes that does need device specific mapping files - so there is precedence that may be helpful here. > > b) Error counters, metrics, internal counters, etc. Could CXL define a > set of common error counters, absolutely. PCIe has done some of this. > However, a specific implementation may have counters and error reporting > that are meaningful only to a specific design and a specific > implementation rather than a "least common denominator" approach of a > standard body. > > c) Performance counters, metric, indicators, etc. Performance can be very > implementation specific and tweaking performance is likely to be > implementation specific. Yes, generic and a least common denominator > elements could be created, but are likely to limiting in realizing the > maximum performance of an implementation. These two are somewhat related. This selection falls into two broad categories. - Opaque logging and reporting needed just for debug. There are patches on list for Vendor Debug Log. They haven't merged yet though I think, so good to get your input on that series. Intent of that feature is opaque logs. There will never be a useful general tool for that stuff, conversely no general userspace tools will use them as a result. It's likely vendor engineer only territory. - Counters that useful at runtime. The CXL PMU spec is rich and flexible. In common with CPU PMUs the perf driver allows for direct specification of events to count. The same issue with implementation specific counters occurs in CPUs. Whilst you can keep the meaning of events opaque, if you do you loose out on useable general purpose software (e.g. perftool). Alternatively some of this can be pushed into perf tool description files. The advantage of pushing counters in to the main specification is that they work out of the box as the CPMU driver will report those directly (no need for perf tool to work with raw event codes). At the moment that driver implements part of the CPMU spec. If there are features you need (free running counters come to mind) then shout / patches welcome. We decided to play wait and see with the CPMU driver as it wasn't clear if the full flexibility was needed for real devices. > > d) Logs, errors and debug information. In addition to spec defined > logging of CXL topology errors, specific designs will have logs, crash > dumps, debug data that is very specific to a implementation. There are > likely to be cases where a product that conforms to a specification like > CXL, may have features that don't directly have anything to do with CXL, > but where a standards based management interface can be used to configure, > manage, and collect data for a non-CXL feature. There are standard interfaces defined for some of this stuff as well. Last I checked the discussion revolved around component state dump only being safe to use if the background command abort was supported, as that was needed to ensure the kernel was not locked out for an indefinite amount of time. If you are looking at other standards being run to configure CXL devices excellent, but those standards should be accessed using whatever the appropriate kernel interfaces. Could you give some examples for this one? > > e) Innovation. I believe that innovation should be encouraged. There may > be designs that support CXL, but that also incorporate unique and > innovative features or functions that might service a niche market. The > AI space is ripe for innovation and perhaps specialized features that may > not make sense for the overall CXL specification. Agreed - but those may need specific drivers. > > I think that in most cases Vendor specific opcodes are not used to > circumvent the standards, but are used when the standards group has no > interested in driving into the standard certain features that are clearly > either implementation specific or are vendor specific additions that have > a specific appeal to a select class of customer, but yet are not relevant > to a specific standard. > > At the end of the day, customer want products that solve a specific > problem. Sometimes vendor can address market segments or niches that a > standard group has no interest in supporting. It can also take months, > and in some cases years to reach an agreement on what standardized feature > should look like. I also believe that there can be competitive reasons > why there might be a group that wants to slow down a vendor's > implementation for fear of losing market share. Whilst I appreciate the way this can slow down adoption / kernel support it's a path that is still worth it in the end. As Dan and others have put it, there are other routes than the main CXL standard. Those should allow tighter collaboration on smaller topics to define common standards. Thanks, Jonathan > > Thanks > Harold Johnson > > > -----Original Message----- > From: Jonathan Cameron [mailto:Jonathan.Cameron@Huawei.com] > Sent: Friday, April 26, 2024 11:54 AM > To: Dan Williams > Cc: linux-cxl@vger.kernel.org; Sreenivas Bagalkote; Brett Henning; Harold > Johnson; Sumanesh Samanta; linux-kernel@vger.kernel.org; Davidlohr Bueso; > Dave Jiang; Alison Schofield; Vishal Verma; Ira Weiny; > linuxarm@huawei.com; linux-api@vger.kernel.org; Lorenzo Pieralisi; Natu, > Mahesh; gregkh@linuxfoundation.org > Subject: Re: RFC: Restricting userspace interfaces for CXL fabric > management > > On Fri, 26 Apr 2024 09:16:44 -0700 > Dan Williams <dan.j.williams@intel.com> wrote: > > > Jonathan Cameron wrote: > > [..] > > > To give people an incentive to play the standards game we have to > > > provide an alternative. Userspace libraries will provide some > incentive > > > to standardize if we have enough vendors (we don't today - so they > will > > > do their own libraries), but it is a lot easier to encourage if we > > > exercise control over the interface. > > > > Yes, and I expect you and I are not far off on what can be done > > here. > > > > However, lets cut to a sentiment hanging over this discussion. Referring > > to vendor specific commands: > > > > "CXL spec has them for a reason and they need to be supported." > > > > ...that is an aggressive "vendor specific first" sentiment that > > generates an aggressive "userspace drivers" reaction, because the best > > way to get around community discussions about what ABI makes sense is > > userspace drivers. > > > > Now, if we can step back to where this discussion started, where typical > > Linux collaboration shines, and where I think you and I are more aligned > > than this thread would indicate, is "vendor specific last". Lets > > carefully consider the vendor specific commands that are candidates to > > be de facto cross vendor semantics if not de jure standards. > > > > Agreed. I'd go a little further and say I generally have much more warm > and > fuzzy feelings when what is a vendor defined command (today) maps to more > or less the same bit of code for a proposed standards ECN. > > IP rules prevent us commenting on specific proposals, but there will be > things we review quicker and with a lighter touch vs others where we > ask lots of annoying questions about generality of the feature etc. > Given the effort we are putting in on the kernel side we all want CXL > to succeed and will do our best to encourage activities that make that > more likely. There are other standards bodies available... which may > make more sense for some features. > > Command interfaces are not a good place to compete and maintain secrecy. > If vendors want to do that, then they don't get the pony of upstream > support. They get to convince distros to do a custom kernel build for > them: > Good luck with that, some of those folk are 'blunt' in their responses to > such requests. > > My proposal is we go forward with a bunch of the CXL spec defined commands > to show the 'how' and consider specific proposals for upstream support > of vendor defined commands on a case by case basis (so pretty much > what you say above). Maybe after a few are done we can formalize some > rules of thumb help vendors makes such proposals, though maybe some > will figure out it is a better and longer term solution to do 'standards > first development'. > > I think we do need to look at the safety filtering of tunneled > commands but don't see that as a particularly tricky addition - > for the simple non destructive commands at least. > > Jonathan > ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-04-29 12:18 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-21 17:44 RFC: Restricting userspace interfaces for CXL fabric management Jonathan Cameron 2024-03-21 21:41 ` Sreenivas Bagalkote 2024-03-22 9:32 ` Jonathan Cameron 2024-03-22 13:24 ` Sreenivas Bagalkote 2024-04-01 16:51 ` Sreenivas Bagalkote 2024-04-06 0:04 ` Dan Williams 2024-04-10 11:45 ` Jonathan Cameron 2024-04-15 20:09 ` Sreenivas Bagalkote 2024-04-23 22:44 ` Sreenivas Bagalkote 2024-04-23 23:24 ` Greg KH 2024-04-24 0:07 ` Dan Williams 2024-04-25 11:33 ` Jonathan Cameron 2024-04-25 16:18 ` Dan Williams 2024-04-25 17:26 ` Jonathan Cameron 2024-04-25 19:25 ` Dan Williams 2024-04-26 8:45 ` Jonathan Cameron 2024-04-26 16:16 ` Dan Williams 2024-04-26 16:53 ` Jonathan Cameron 2024-04-26 19:25 ` Harold Johnson 2024-04-27 11:12 ` Greg KH 2024-04-27 16:22 ` Dan Williams 2024-04-28 4:25 ` Sirius 2024-04-29 12:18 ` Jonathan Cameron
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.