* CXL Vendor Specific Capabilities @ 2021-05-19 16:45 Ariel.Sibley 2021-05-20 9:53 ` Christoph Hellwig 0 siblings, 1 reply; 4+ messages in thread From: Ariel.Sibley @ 2021-05-19 16:45 UTC (permalink / raw) To: linux-cxl Cc: Ahmad.Danesh, cpetersen, mahesh.natu, dan.j.williams, Vincent.Hache, Ariel.Sibley, Sanjay.Goyal, scolee, jhinkle, guru.anbalagane, kevin.marks, bhirano, chet.r.douglas, dcaracci, jgroves, andy.rudoff Re-opening discussion on vendor-specific capabilities on linux-cxl mailing list. Background: =========== The default/recommended value of the CONFIG_CXL_MEM_RAW_COMMANDS Kconfig knob will block passthrough of vendor defined mailbox commands. If this Kconfig knob is enabled, passthrough of vendor defined commands will still taint the kernel. There is a consensus among device and system manufacturers that vendor-specific capabilities are a necessary component of specifications and we need a solution to support these capabilities. Within reason, these capabilities need to be enabled without tainting the kernel. Key points to consider regarding handling of vendor-specific capabilities: ========================================================================== * Tainting the kernel during harmless operations is not desirable. While some users have full control over the kernel and could patch this behavior, those users would prefer to use upstreamed kernel vs. carrying patches. * There is agreement that some capabilities are candidates for standardization, but a path for enabling that class of functionality until it makes its way into the spec is needed. It takes a long time to get anything through the full process of acceptance into the spec and upstreamed. This will slow adoption and stifle innovation. * The reality is that we can’t put everything into the spec. The primary cases are vendor-specific implementations which are not suited for standardization, and proprietary/differentiating features. * Limiting the usefulness of the path for vendor capabilities through the driver will likely result in undesirable side-effects as suppliers and others won’t accept this, forcing vendors to develop workarounds that bypass the driver. If the driver provides a usable pathway that does not taint the kernel, vendors are less likely to implement such approaches. The driver will then be able to perform some policing, for instance based on CEL contents, and CXL device providers can be considered good citizens of the Linux community. Recommendation: =============== Allow root to issue vendor defined passthrough commands by default and use the CEL to determine whether a given command should taint the kernel. Commands that have no effects, or harmless effects, should not taint the kernel. Commands that have potentially harmful effects should taint the kernel. Examples: ========= Case 1: Vendor defined is needed, and the operations do not impact the normal operation of the device. The CEL contents could be used to determine there is no impact and thereby permit these commands without tainting the kernel. * Example 1.1: Interfacing with common diagnostic features which may have wide-ranging implementations and capabilities across different vendors. E.g. Signal integrity diagnostics, debug counters. * Example 1.2: Retrieve diagnostics data from proprietary value-add features. Case 2: Vendor defined is needed, but the operations impact the normal operation of the device. CEL contents can be used to determine the impact and taint the kernel. * Example 2.1: Vendor-specific diagnostics features that are intended to stress the device or memory for qualification and validation purposes. Q & A: ====== Q: Why is it a problem to have CONFIG_CXL_MEM_RAW_COMMANDS disabled by default? A: This pushes the problem to the distros. Distros are likely to follow the Kconfig recommendation by default, so industry will need to have this discussion with each major distro maintainer. If we can reach a consensus in this forum that we can default this knob to enabled, distros are more likely to also enable the functionality. Q: Why is it a problem to taint the kernel for all passthrough commands when CONFIG_CXL_MEM_RAW_COMMANDS is enabled? A: Some functionality is harmless. If the kernel is tainted when executing basic read-only diagnostics, it may be harder to get support for real issues as the state of the system will be suspect. The driver should be able to use the CEL to determine when to taint. Q: Why can't we use Vendor Log for this? A: Several reasons: * There is no ability to control functionality. Many diagnostic features require input to configure what is being read. * The device may require time to collect the diagnostic information, but must respond to Get Log in under 2 seconds (preferably much faster). * It is a single blob, which can be accessed at arbitrary offsets. The limited set of use cases that could be covered by this model would require driver changes to enable the Vendor Log to be dumped with finer granularity. But even then, the fact that the host could potentially read anywhere in the Vendor Log at any time makes robustly encoding a set of debug forensics that are populated on demand (in response to Get Log) from underlying hardware into vendor defined offsets in the Vendor Log difficult. Q: What about security? A: While vendor commands have the potential to be a security risk, it should not be the driver’s responsibility to solve that problem. The in-band path through the driver is just one path into the device. For a device to be secure, vendors and their partner customers need to lock out anything that is a security risk at the device level for production. Given that this device level lockout is necessary, the catch all lockout at the driver level is redundant from a device security point of view. Q: How about we add commands to UAPI instead? A: Vendors should not be adding hooks into a class driver which is supposed to be generic. The details of anything that goes in UAPI would need to be fully public. Any proposed additions would need to be fully explained as to why they are not able to be added to the spec. This limits the viability of UAPI as a method for unlocking value-add functionality. Conclusion: =========== I look forward to a healthy discussion on how we can move this forward, and hope that we can reach an answer that is acceptable for the kernel maintainers, CXL device vendors, and end users. Regards, Ariel ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: CXL Vendor Specific Capabilities 2021-05-19 16:45 CXL Vendor Specific Capabilities Ariel.Sibley @ 2021-05-20 9:53 ` Christoph Hellwig 2021-05-20 11:55 ` Ariel.Sibley 0 siblings, 1 reply; 4+ messages in thread From: Christoph Hellwig @ 2021-05-20 9:53 UTC (permalink / raw) To: Ariel.Sibley Cc: linux-cxl, Ahmad.Danesh, cpetersen, mahesh.natu, dan.j.williams, Vincent.Hache, Sanjay.Goyal, scolee, jhinkle, guru.anbalagane, kevin.marks, bhirano, chet.r.douglas, dcaracci, jgroves, andy.rudoff > > The default/recommended value of the CONFIG_CXL_MEM_RAW_COMMANDS Kconfig knob will block passthrough of vendor defined mailbox commands. If this Kconfig knob is enabled, passthrough of vendor defined commands will still taint the kernel. Please fix your mailer, this badly wrapped wall of text is completely unreadable. ^ permalink raw reply [flat|nested] 4+ messages in thread
* CXL Vendor Specific Capabilities 2021-05-20 9:53 ` Christoph Hellwig @ 2021-05-20 11:55 ` Ariel.Sibley 2021-06-12 2:25 ` Dan Williams 0 siblings, 1 reply; 4+ messages in thread From: Ariel.Sibley @ 2021-05-20 11:55 UTC (permalink / raw) To: linux-cxl Cc: hch, Ahmad.Danesh, cpetersen, mahesh.natu, dan.j.williams, Vincent.Hache, Sanjay.Goyal, scolee, jhinkle, guru.anbalagane, kevin.marks, bhirano, chet.r.douglas, dcaracci, jgroves, andy.rudoff, Vineeth.Pillai Re-posting with 80 char line wrap. Background: =========== The default/recommended value of the CONFIG_CXL_MEM_RAW_COMMANDS Kconfig knob will block passthrough of vendor defined mailbox commands. If this Kconfig knob is enabled, passthrough of vendor defined commands will still taint the kernel. There is a consensus among device and system manufacturers that vendor-specific capabilities are a necessary component of specifications and we need a solution to support these capabilities. Within reason, these capabilities need to be enabled without tainting the kernel. Key points to consider regarding handling of vendor-specific capabilities: ========================================================================== * Tainting the kernel during harmless operations is not desirable. While some users have full control over the kernel and could patch this behavior, those users would prefer to use upstreamed kernel vs. carrying patches. * There is agreement that some capabilities are candidates for standardization, but a path for enabling that class of functionality until it makes its way into the spec is needed. It takes a long time to get anything through the full process of acceptance into the spec and upstreamed. This will slow adoption and stifle innovation. * The reality is that we can’t put everything into the spec. The primary cases are vendor-specific implementations which are not suited for standardization, and proprietary/differentiating features. * Limiting the usefulness of the path for vendor capabilities through the driver will likely result in undesirable side-effects as suppliers and others won’t accept this, forcing vendors to develop workarounds that bypass the driver. If the driver provides a usable pathway that does not taint the kernel, vendors are less likely to implement such approaches. The driver will then be able to perform some policing, for instance based on CEL contents, and CXL device providers can be considered good citizens of the Linux community. Recommendation: =============== Allow root to issue vendor defined passthrough commands by default and use the CEL to determine whether a given command should taint the kernel. Commands that have no effects, or harmless effects, should not taint the kernel. Commands that have potentially harmful effects should taint the kernel. Examples: ========= Case 1: Vendor defined is needed, and the operations do not impact the normal operation of the device. The CEL contents could be used to determine there is no impact and thereby permit these commands without tainting the kernel. * Example 1.1: Interfacing with common diagnostic features which may have wide-ranging implementations and capabilities across different vendors. E.g. Signal integrity diagnostics, debug counters. * Example 1.2: Retrieve diagnostics data from proprietary value-add features. Case 2: Vendor defined is needed, but the operations impact the normal operation of the device. CEL contents can be used to determine the impact and taint the kernel. * Example 2.1: Vendor-specific diagnostics features that are intended to stress the device or memory for qualification and validation purposes. Q & A: ====== Q: Why is it a problem to have CONFIG_CXL_MEM_RAW_COMMANDS disabled by default? A: This pushes the problem to the distros. Distros are likely to follow the Kconfig recommendation by default, so industry will need to have this discussion with each major distro maintainer. If we can reach a consensus in this forum that we can default this knob to enabled, distros are more likely to also enable the functionality. Q: Why is it a problem to taint the kernel for all passthrough commands when CONFIG_CXL_MEM_RAW_COMMANDS is enabled? A: Some functionality is harmless. If the kernel is tainted when executing basic read-only diagnostics, it may be harder to get support for real issues as the state of the system will be suspect. The driver should be able to use the CEL to determine when to taint. Q: Why can't we use Vendor Log for this? A: Several reasons: * There is no ability to control functionality. Many diagnostic features require input to configure what is being read. * The device may require time to collect the diagnostic information, but must respond to Get Log in under 2 seconds (preferably much faster). * It is a single blob, which can be accessed at arbitrary offsets. The limited set of use cases that could be covered by this model would require driver changes to enable the Vendor Log to be dumped with finer granularity. But even then, the fact that the host could potentially read anywhere in the Vendor Log at any time makes robustly encoding a set of debug forensics that are populated on demand (in response to Get Log) from underlying hardware into vendor defined offsets in the Vendor Log difficult. Q: What about security? A: While vendor commands have the potential to be a security risk, it should not be the driver’s responsibility to solve that problem. The in-band path through the driver is just one path into the device. For a device to be secure, vendors and their partner customers need to lock out anything that is a security risk at the device level for production. Given that this device level lockout is necessary, the catch all lockout at the driver level is redundant from a device security point of view. Q: How about we add commands to UAPI instead? A: Vendors should not be adding hooks into a class driver which is supposed to be generic. The details of anything that goes in UAPI would need to be fully public. Any proposed additions would need to be fully explained as to why they are not able to be added to the spec. This limits the viability of UAPI as a method for unlocking value-add functionality. Conclusion: =========== I look forward to a healthy discussion on how we can move this forward, and hope that we can reach an answer that is acceptable for the kernel maintainers, CXL device vendors, and end users. Regards, Ariel ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: CXL Vendor Specific Capabilities 2021-05-20 11:55 ` Ariel.Sibley @ 2021-06-12 2:25 ` Dan Williams 0 siblings, 0 replies; 4+ messages in thread From: Dan Williams @ 2021-06-12 2:25 UTC (permalink / raw) To: Ariel.Sibley Cc: linux-cxl, Christoph Hellwig, Ahmad.Danesh, Chris Petersen, Natu, Mahesh, Vincent.Hache, Sanjay.Goyal, Scott Lee, jhinkle, guru.anbalagane, kevin.marks, bhirano, Chet R Douglas, dcaracci, John Groves (jgroves), Rudoff, Andy, Vineeth.Pillai Hi Ariel, thanks for being patient as I worked my way back to this question. Responses below are with my Linux kernel developer hat on, but they are my own. I.e. Linux distribution representatives might have concurring [1], or differing opinions. Other kernel developers may have different or concurring opinions. It is after all a consensus based community. [1]: http://lore.kernel.org/r/YBi9nkiu3DvMZhrs@Konrads-MacBook-Pro.local On Thu, May 20, 2021 at 4:55 AM <Ariel.Sibley@microchip.com> wrote: > > Re-posting with 80 char line wrap. > > Background: > =========== > > The default/recommended value of the CONFIG_CXL_MEM_RAW_COMMANDS Kconfig knob > will block passthrough of vendor defined mailbox commands. If this Kconfig > knob is enabled, passthrough of vendor defined commands will still taint the > kernel. Right, it's similar to the warning message that spews at the beginning of time when trace_printk() is found to be in active use. It's an indication that you are using a debug / special purpose kernel that is allowing access to unvetted vendor specific functionality. > There is a consensus among device and system manufacturers that vendor-specific > capabilities are a necessary component of specifications and we need a solution > to support these capabilities. Within reason, these capabilities need to be > enabled without tainting the kernel. > > Key points to consider regarding handling of vendor-specific capabilities: > ========================================================================== > > * Tainting the kernel during harmless operations is not desirable. While some > users have full control over the kernel and could patch this behavior, those > users would prefer to use upstreamed kernel vs. carrying patches. The dev_WARN_ONCE() was the easiest way to record the presence of non-standard CXL operation in crash logs for reports sent to the 'linux-cxl' mailing list for help. That can be replaced with a simple dev_warn_once() (lowercase version does not taint), but adds the expectation to receive the full kernel log since boot for reported problems with the driver stack. > * There is agreement that some capabilities are candidates for standardization, > but a path for enabling that class of functionality until it makes its way into > the spec is needed. It takes a long time to get anything through the full > process of acceptance into the spec and upstreamed. This will slow adoption and > stifle innovation. One of the useful mechanisms that has sped up Linux and Tianocore adoption of new EFI/ACPI features is the "Code First" process of the UEFI and ACPI Standards Working Groups. In general, if the feature is a candidate to be adopted in the standard and the time to publish impacts the feature release in Linux that seems like a problem with solving in its own right, even if it is not exactly "Code First". > * The reality is that we can’t put everything into the spec. The primary > cases are vendor-specific implementations which are not suited for > standardization, and proprietary/differentiating features. > The common generic driver is not the place to land proprietary differentiating features. The generic driver has a responsibility to assert its best effort interpretation of what capabilities may impact the confidentiality and integrity of the kernel. For example see "enum lockdown_reason" [2] for the types of interfaces that the kernel shuts down to protect the platform against root. This concept of not trusting root is relatively new compared to past drivers that did not give a second thought to allowing unfettered vendor passthrough tunnels. Vendor defined functionality is unauditable so it requires an opt-in. [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/security.h#n82 > * Limiting the usefulness of the path for vendor capabilities through the > driver will likely result in undesirable side-effects as suppliers and others > won’t accept this, forcing vendors to develop workarounds that bypass the > driver. If the driver provides a usable pathway that does not taint the kernel, > vendors are less likely to implement such approaches. The driver will then be > able to perform some policing, for instance based on CEL contents, and CXL > device providers can be considered good citizens of the Linux community. Unfortunately the driver can not trust the Command Effect Log. The authors of the CEL are not OS practitioners. One example is the topic of runtime firmware update. This document [3], for example, defines a runtime firmware update interface with a command effect log that indicates when the OS is responsible for quiescing I/O. The firmware activation claims no quiesce needed in cases where the activation time is "small". Rather than trust the vendor's definition of "small" and the CEL claims of what is safe the driver ignores CEL and sets its own policy. [3]: https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf > > Recommendation: > =============== > > Allow root to issue vendor defined passthrough commands by default and use the > CEL to determine whether a given command should taint the kernel. Commands > that have no effects, or harmless effects, should not taint the kernel. > Commands that have potentially harmful effects should taint the kernel. > > Examples: > ========= > > Case 1: Vendor defined is needed, and the operations do not impact the normal > operation of the device. The CEL contents could be used to determine there is > no impact and thereby permit these commands without tainting the kernel. > > * Example 1.1: Interfacing with common diagnostic features which may have > wide-ranging implementations and capabilities across different vendors. E.g. > Signal integrity diagnostics, debug counters. Signal integrity and debug counters seem to be in the realm of pre production environments where a debug kernel is expected. > > * Example 1.2: Retrieve diagnostics data from proprietary value-add features. The Vendor Debug Log is an example of allowing for vendor defined payload to travel within open transports. It's easier to say, yes it's likely safe to allow opaque blobs to be dumped to the kernel vs an open transport. > > Case 2: Vendor defined is needed, but the operations impact the normal > operation of the device. CEL contents can be used to determine the impact and > taint the kernel. > > * Example 2.1: Vendor-specific diagnostics features that are intended to stress > the device or memory for qualification and validation purposes. Qualification and validation seem to be places where the special kernel build requirement can be satisfied. > > Q & A: > ====== > > Q: Why is it a problem to have CONFIG_CXL_MEM_RAW_COMMANDS disabled by default? > A: This pushes the problem to the distros. Distros are likely to follow the > Kconfig recommendation by default, so industry will need to have this > discussion with each major distro maintainer. Have a look at something like this directory: https://kojipkgs.fedoraproject.org/packages/kernel/5.13.0/0.rc5.38.eln111/x86_64/ ...and notice that kernel-debug is built by default. I expect it would be relatively straightforward to ask distros to turn on CONFIG_CXL_MEM_RAW_COMMANDS in their equivalent of the kernel-debug package: https://kojipkgs.fedoraproject.org/packages/kernel/5.13.0/0.rc5.38.eln111/x86_64/kernel-debug-5.13.0-0.rc5.38.eln111.x86_64.rpm > If we can reach a consensus in > this forum that we can default this knob to enabled, distros are more likely to > also enable the functionality. This seems to misunderstand how the trust relationship flows in the Linux community. The distros are co-equal members of the Linux community and they also depend on subject matter experts to vet the functionality that they enable. Like you saw in the example from Konrad above there is already discomfort in the community at even allowing CONFIG_CXL_MEM_RAW_COMMANDS to be enabled at all. So, I am advocating for a compromise position that allows for vendor defined functionality, while also constraining its deployment to what the linux-cxl community can review and develop native kernel interfaces. > > Q: Why is it a problem to taint the kernel for all passthrough commands when > CONFIG_CXL_MEM_RAW_COMMANDS is enabled? > A: Some functionality is harmless. If the kernel is tainted when executing > basic read-only diagnostics, it may be harder to get support for real issues as > the state of the system will be suspect. The driver should be able to use the > CEL to determine when to taint. If a problem comes to the list with vendor specific functionality in the log the linux-cxl community is unlikely to engage. > > Q: Why can't we use Vendor Log for this? > A: Several reasons: > * There is no ability to control functionality. Many diagnostic features > require input to configure what is being read. > * The device may require time to collect the diagnostic information, but must > respond to Get Log in under 2 seconds (preferably much faster). > * It is a single blob, which can be accessed at arbitrary offsets. The limited > set of use cases that could be covered by this model would require driver > changes to enable the Vendor Log to be dumped with finer granularity. But even > then, the fact that the host could potentially read anywhere in the Vendor Log > at any time makes robustly encoding a set of debug forensics that are populated > on demand (in response to Get Log) from underlying hardware into vendor defined > offsets in the Vendor Log difficult. > > Q: What about security? > A: While vendor commands have the potential to be a security risk, it should > not be the driver’s responsibility to solve that problem. The in-band path > through the driver is just one path into the device. For a device to be secure, > vendors and their partner customers need to lock out anything that is a > security risk at the device level for production. Given that this device level > lockout is necessary, the catch all lockout at the driver level is redundant > from a device security point of view. > In an environment where kernel functionality is being audited against an untrusted root the driver must not abdicate responsibility for the portions of security it can control, but there's a maintainability aspect as well. Open collaboration yields benefits for maintainability and deployability of system software. > Q: How about we add commands to UAPI instead? > A: Vendors should not be adding hooks into a class driver which is supposed to > be generic. The details of anything that goes in UAPI would need to be fully > public. Any proposed additions would need to be fully explained as to why they > are not able to be added to the spec. This limits the viability of UAPI as a > method for unlocking value-add functionality. I'm not sure I understand the premise of this question? I will say that I am fully in favor of innovations that solve problems for CXL Linux users, but it needs to be done in a way that does not compromise maintainability (unrestricted vendor differentiation) or security (confidentiality, integrity, or stability). > Conclusion: > =========== > > I look forward to a healthy discussion on how we can move this forward, and > hope that we can reach an answer that is acceptable for the kernel maintainers, > CXL device vendors, and end users. Ariel, thank you for putting this together. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-06-12 2:27 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-19 16:45 CXL Vendor Specific Capabilities Ariel.Sibley 2021-05-20 9:53 ` Christoph Hellwig 2021-05-20 11:55 ` Ariel.Sibley 2021-06-12 2:25 ` Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).