linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CXL Vendor Specific Capabilities
@ 2021-05-19 16:45 Ariel.Sibley
  2021-05-20  9:53 ` Christoph Hellwig
  0 siblings, 1 reply; 4+ messages in thread
From: Ariel.Sibley @ 2021-05-19 16:45 UTC (permalink / raw)
  To: linux-cxl
  Cc: Ahmad.Danesh, cpetersen, mahesh.natu, dan.j.williams,
	Vincent.Hache, Ariel.Sibley, Sanjay.Goyal, scolee, jhinkle,
	guru.anbalagane, kevin.marks, bhirano, chet.r.douglas, dcaracci,
	jgroves, andy.rudoff

Re-opening discussion on vendor-specific capabilities on linux-cxl mailing list.

Background:
===========

The default/recommended value of the CONFIG_CXL_MEM_RAW_COMMANDS Kconfig knob will block passthrough of vendor defined mailbox commands.  If this Kconfig knob is enabled, passthrough of vendor defined commands will still taint the kernel.

There is a consensus among device and system manufacturers that vendor-specific capabilities are a necessary component of specifications and we need a solution to support these capabilities. Within reason, these capabilities need to be enabled without tainting the kernel.

Key points to consider regarding handling of vendor-specific capabilities:
==========================================================================

* Tainting the kernel during harmless operations is not desirable. While some users have full control over the kernel and could patch this behavior, those users would prefer to use upstreamed kernel vs. carrying patches.

* There is agreement that some capabilities are candidates for standardization, but a path for enabling that class of functionality until it makes its way into the spec is needed.  It takes a long time to get anything through the full process of acceptance into the spec and upstreamed. This will slow adoption and stifle innovation.

* The reality is that we can’t put everything into the spec. The primary cases are vendor-specific implementations which are not suited for standardization, and proprietary/differentiating features.

* Limiting the usefulness of the path for vendor capabilities through the driver will likely result in undesirable side-effects as suppliers and others won’t accept this, forcing vendors to develop workarounds that bypass the driver. If the driver provides a usable pathway that does not taint the kernel, vendors are less likely to implement such approaches. The driver will then be able to perform some policing, for instance based on CEL contents, and CXL device providers can be considered good citizens of the Linux community.

Recommendation:
===============

Allow root to issue vendor defined passthrough commands by default and use the CEL to determine whether a given command should taint the kernel.  Commands that have no effects, or harmless effects, should not taint the kernel.  Commands that have potentially harmful effects should taint the kernel.

Examples:
=========

Case 1: Vendor defined is needed, and the operations do not impact the normal operation of the device. The CEL contents could be used to determine there is no impact and thereby permit these commands without tainting the kernel.

* Example 1.1: Interfacing with common diagnostic features which may have wide-ranging implementations and capabilities across different vendors. E.g. Signal integrity diagnostics, debug counters.

* Example 1.2: Retrieve diagnostics data from proprietary value-add features.

Case 2: Vendor defined is needed, but the operations impact the normal operation of the device. CEL contents can be used to determine the impact and taint the kernel.

* Example 2.1: Vendor-specific diagnostics features that are intended to stress the device or memory for qualification and validation purposes. 

Q & A:
======

Q: Why is it a problem to have CONFIG_CXL_MEM_RAW_COMMANDS disabled by default?
A: This pushes the problem to the distros. Distros are likely to follow the Kconfig recommendation by default, so industry will need to have this discussion with each major distro maintainer. If we can reach a consensus in this forum that we can default this knob to enabled, distros are more likely to also enable the functionality.

Q: Why is it a problem to taint the kernel for all passthrough commands when CONFIG_CXL_MEM_RAW_COMMANDS is enabled?
A: Some functionality is harmless. If the kernel is tainted when executing basic read-only diagnostics, it may be harder to get support for real issues as the state of the system will be suspect. The driver should be able to use the CEL to determine when to taint.

Q: Why can't we use Vendor Log for this?
A: Several reasons:
* There is no ability to control functionality. Many diagnostic features require input to configure what is being read.
* The device may require time to collect the diagnostic information, but must respond to Get Log in under 2 seconds (preferably much faster).
* It is a single blob, which can be accessed at arbitrary offsets. The limited set of use cases that could be covered by this model would require driver changes to enable the Vendor Log to be dumped with finer granularity. But even then, the fact that the host could potentially read anywhere in the Vendor Log at any time makes robustly encoding a set of debug forensics that are populated on demand (in response to Get Log) from underlying hardware into vendor defined offsets in the Vendor Log difficult.

Q: What about security?
A: While vendor commands have the potential to be a security risk, it should not be the driver’s responsibility to solve that problem. The in-band path through the driver is just one path into the device. For a device to be secure, vendors and their partner customers need to lock out anything that is a security risk at the device level for production. Given that this device level lockout is necessary, the catch all lockout at the driver level is redundant from a device security point of view.

Q: How about we add commands to UAPI instead?
A: Vendors should not be adding hooks into a class driver which is supposed to be generic. The details of anything that goes in UAPI would need to be fully public. Any proposed additions would need to be fully explained as to why they are not able to be added to the spec. This limits the viability of UAPI as a method for unlocking value-add functionality.

Conclusion:
===========

I look forward to a healthy discussion on how we can move this forward, and hope that we can reach an answer that is acceptable for the kernel maintainers, CXL device vendors, and end users.

Regards,
Ariel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: CXL Vendor Specific Capabilities
  2021-05-19 16:45 CXL Vendor Specific Capabilities Ariel.Sibley
@ 2021-05-20  9:53 ` Christoph Hellwig
  2021-05-20 11:55   ` Ariel.Sibley
  0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2021-05-20  9:53 UTC (permalink / raw)
  To: Ariel.Sibley
  Cc: linux-cxl, Ahmad.Danesh, cpetersen, mahesh.natu, dan.j.williams,
	Vincent.Hache, Sanjay.Goyal, scolee, jhinkle, guru.anbalagane,
	kevin.marks, bhirano, chet.r.douglas, dcaracci, jgroves,
	andy.rudoff

> 
> The default/recommended value of the CONFIG_CXL_MEM_RAW_COMMANDS Kconfig knob will block passthrough of vendor defined mailbox commands.  If this Kconfig knob is enabled, passthrough of vendor defined commands will still taint the kernel.

Please fix your mailer, this badly wrapped wall of text is completely
unreadable.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* CXL Vendor Specific Capabilities
  2021-05-20  9:53 ` Christoph Hellwig
@ 2021-05-20 11:55   ` Ariel.Sibley
  2021-06-12  2:25     ` Dan Williams
  0 siblings, 1 reply; 4+ messages in thread
From: Ariel.Sibley @ 2021-05-20 11:55 UTC (permalink / raw)
  To: linux-cxl
  Cc: hch, Ahmad.Danesh, cpetersen, mahesh.natu, dan.j.williams,
	Vincent.Hache, Sanjay.Goyal, scolee, jhinkle, guru.anbalagane,
	kevin.marks, bhirano, chet.r.douglas, dcaracci, jgroves,
	andy.rudoff, Vineeth.Pillai

Re-posting with 80 char line wrap.

Background:
===========

The default/recommended value of the CONFIG_CXL_MEM_RAW_COMMANDS Kconfig knob
will block passthrough of vendor defined mailbox commands.  If this Kconfig
knob is enabled, passthrough of vendor defined commands will still taint the
kernel.

There is a consensus among device and system manufacturers that vendor-specific
capabilities are a necessary component of specifications and we need a solution
to support these capabilities. Within reason, these capabilities need to be
enabled without tainting the kernel.

Key points to consider regarding handling of vendor-specific capabilities:
==========================================================================

* Tainting the kernel during harmless operations is not desirable. While some
users have full control over the kernel and could patch this behavior, those
users would prefer to use upstreamed kernel vs. carrying patches.

* There is agreement that some capabilities are candidates for standardization,
but a path for enabling that class of functionality until it makes its way into
the spec is needed.  It takes a long time to get anything through the full
process of acceptance into the spec and upstreamed. This will slow adoption and
stifle innovation.

* The reality is that we can’t put everything into the spec. The primary
cases are vendor-specific implementations which are not suited for
standardization, and proprietary/differentiating features.

* Limiting the usefulness of the path for vendor capabilities through the
driver will likely result in undesirable side-effects as suppliers and others
won’t accept this, forcing vendors to develop workarounds that bypass the
driver. If the driver provides a usable pathway that does not taint the kernel,
vendors are less likely to implement such approaches. The driver will then be
able to perform some policing, for instance based on CEL contents, and CXL
device providers can be considered good citizens of the Linux community.

Recommendation:
===============

Allow root to issue vendor defined passthrough commands by default and use the
CEL to determine whether a given command should taint the kernel.  Commands
that have no effects, or harmless effects, should not taint the kernel.
Commands that have potentially harmful effects should taint the kernel.

Examples:
=========

Case 1: Vendor defined is needed, and the operations do not impact the normal
operation of the device. The CEL contents could be used to determine there is
no impact and thereby permit these commands without tainting the kernel.

* Example 1.1: Interfacing with common diagnostic features which may have
wide-ranging implementations and capabilities across different vendors. E.g.
Signal integrity diagnostics, debug counters.

* Example 1.2: Retrieve diagnostics data from proprietary value-add features.

Case 2: Vendor defined is needed, but the operations impact the normal
operation of the device. CEL contents can be used to determine the impact and
taint the kernel.

* Example 2.1: Vendor-specific diagnostics features that are intended to stress
the device or memory for qualification and validation purposes.

Q & A:
======

Q: Why is it a problem to have CONFIG_CXL_MEM_RAW_COMMANDS disabled by default?
A: This pushes the problem to the distros. Distros are likely to follow the
Kconfig recommendation by default, so industry will need to have this
discussion with each major distro maintainer. If we can reach a consensus in
this forum that we can default this knob to enabled, distros are more likely to
also enable the functionality.

Q: Why is it a problem to taint the kernel for all passthrough commands when
CONFIG_CXL_MEM_RAW_COMMANDS is enabled?
A: Some functionality is harmless. If the kernel is tainted when executing
basic read-only diagnostics, it may be harder to get support for real issues as
the state of the system will be suspect. The driver should be able to use the
CEL to determine when to taint.

Q: Why can't we use Vendor Log for this?
A: Several reasons:
* There is no ability to control functionality. Many diagnostic features
require input to configure what is being read.
* The device may require time to collect the diagnostic information, but must
respond to Get Log in under 2 seconds (preferably much faster).
* It is a single blob, which can be accessed at arbitrary offsets. The limited
set of use cases that could be covered by this model would require driver
changes to enable the Vendor Log to be dumped with finer granularity. But even
then, the fact that the host could potentially read anywhere in the Vendor Log
at any time makes robustly encoding a set of debug forensics that are populated
on demand (in response to Get Log) from underlying hardware into vendor defined
offsets in the Vendor Log difficult.

Q: What about security?
A: While vendor commands have the potential to be a security risk, it should
not be the driver’s responsibility to solve that problem. The in-band path
through the driver is just one path into the device. For a device to be secure,
vendors and their partner customers need to lock out anything that is a
security risk at the device level for production. Given that this device level
lockout is necessary, the catch all lockout at the driver level is redundant
from a device security point of view.

Q: How about we add commands to UAPI instead?
A: Vendors should not be adding hooks into a class driver which is supposed to
be generic. The details of anything that goes in UAPI would need to be fully
public. Any proposed additions would need to be fully explained as to why they
are not able to be added to the spec. This limits the viability of UAPI as a
method for unlocking value-add functionality.

Conclusion:
===========

I look forward to a healthy discussion on how we can move this forward, and
hope that we can reach an answer that is acceptable for the kernel maintainers,
CXL device vendors, and end users.

Regards,
Ariel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: CXL Vendor Specific Capabilities
  2021-05-20 11:55   ` Ariel.Sibley
@ 2021-06-12  2:25     ` Dan Williams
  0 siblings, 0 replies; 4+ messages in thread
From: Dan Williams @ 2021-06-12  2:25 UTC (permalink / raw)
  To: Ariel.Sibley
  Cc: linux-cxl, Christoph Hellwig, Ahmad.Danesh, Chris Petersen, Natu,
	Mahesh, Vincent.Hache, Sanjay.Goyal, Scott Lee, jhinkle,
	guru.anbalagane, kevin.marks, bhirano, Chet R Douglas, dcaracci,
	John Groves (jgroves),
	Rudoff, Andy, Vineeth.Pillai

Hi Ariel, thanks for being patient as I worked my way back to this question.

Responses below are with my Linux kernel developer hat on, but they
are my own. I.e. Linux distribution representatives might have
concurring [1], or differing opinions. Other kernel developers may
have different or concurring opinions. It is after all a consensus
based community.

[1]: http://lore.kernel.org/r/YBi9nkiu3DvMZhrs@Konrads-MacBook-Pro.local

On Thu, May 20, 2021 at 4:55 AM <Ariel.Sibley@microchip.com> wrote:
>
> Re-posting with 80 char line wrap.
>
> Background:
> ===========
>
> The default/recommended value of the CONFIG_CXL_MEM_RAW_COMMANDS Kconfig knob
> will block passthrough of vendor defined mailbox commands.  If this Kconfig
> knob is enabled, passthrough of vendor defined commands will still taint the
> kernel.

Right, it's similar to the warning message that spews at the beginning
of time when trace_printk() is found to be in active use. It's an
indication that you are using a debug / special purpose kernel that is
allowing access to unvetted vendor specific functionality.

> There is a consensus among device and system manufacturers that vendor-specific
> capabilities are a necessary component of specifications and we need a solution
> to support these capabilities. Within reason, these capabilities need to be
> enabled without tainting the kernel.
>
> Key points to consider regarding handling of vendor-specific capabilities:
> ==========================================================================
>
> * Tainting the kernel during harmless operations is not desirable. While some
> users have full control over the kernel and could patch this behavior, those
> users would prefer to use upstreamed kernel vs. carrying patches.

The dev_WARN_ONCE() was the easiest way to record the presence of
non-standard CXL operation in crash logs for reports sent to the
'linux-cxl' mailing list for help. That can be replaced with a simple
dev_warn_once() (lowercase version does not taint), but adds the
expectation to receive the full kernel log since boot for reported
problems with the driver stack.

> * There is agreement that some capabilities are candidates for standardization,
> but a path for enabling that class of functionality until it makes its way into
> the spec is needed.  It takes a long time to get anything through the full
> process of acceptance into the spec and upstreamed. This will slow adoption and
> stifle innovation.

One of the useful mechanisms that has sped up Linux and Tianocore
adoption of new EFI/ACPI features is the "Code First" process of the
UEFI and ACPI Standards Working Groups. In general, if the feature is
a candidate to be adopted in the standard and the time to publish
impacts the feature release in Linux that seems like a problem with
solving in its own right, even if it is not exactly "Code First".

> * The reality is that we can’t put everything into the spec. The primary
> cases are vendor-specific implementations which are not suited for
> standardization, and proprietary/differentiating features.
>

The common generic driver is not the place to land proprietary
differentiating features. The generic driver has a responsibility to
assert its best effort interpretation of what capabilities may impact
the confidentiality and integrity of the kernel. For example see "enum
lockdown_reason" [2] for the types of interfaces that the kernel shuts
down to protect the platform against root. This concept of not
trusting root is relatively new compared to past drivers that did not
give a second thought to allowing unfettered vendor passthrough
tunnels. Vendor defined functionality is unauditable so it requires an
opt-in.

[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/security.h#n82

> * Limiting the usefulness of the path for vendor capabilities through the
> driver will likely result in undesirable side-effects as suppliers and others
> won’t accept this, forcing vendors to develop workarounds that bypass the
> driver. If the driver provides a usable pathway that does not taint the kernel,
> vendors are less likely to implement such approaches. The driver will then be
> able to perform some policing, for instance based on CEL contents, and CXL
> device providers can be considered good citizens of the Linux community.

Unfortunately the driver can not trust the Command Effect Log. The
authors of the CEL are not OS practitioners. One example is the topic
of runtime firmware update. This document [3], for example, defines a
runtime firmware update interface with a command effect log that
indicates when the OS is responsible for quiescing I/O. The firmware
activation claims no quiesce needed in cases where the activation time
is "small". Rather than trust the vendor's definition of "small" and
the CEL claims of what is safe the driver ignores CEL and sets its own
policy.

[3]: https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf

>
> Recommendation:
> ===============
>
> Allow root to issue vendor defined passthrough commands by default and use the
> CEL to determine whether a given command should taint the kernel.  Commands
> that have no effects, or harmless effects, should not taint the kernel.
> Commands that have potentially harmful effects should taint the kernel.
>
> Examples:
> =========
>
> Case 1: Vendor defined is needed, and the operations do not impact the normal
> operation of the device. The CEL contents could be used to determine there is
> no impact and thereby permit these commands without tainting the kernel.
>
> * Example 1.1: Interfacing with common diagnostic features which may have
> wide-ranging implementations and capabilities across different vendors. E.g.
> Signal integrity diagnostics, debug counters.

Signal integrity and debug counters seem to be in the realm of pre
production environments where a debug kernel is expected.

>
> * Example 1.2: Retrieve diagnostics data from proprietary value-add features.

The Vendor Debug Log is an example of allowing for vendor defined
payload to travel within open transports. It's easier to say, yes it's
likely safe to allow opaque blobs to be dumped to the kernel vs an
open transport.

>
> Case 2: Vendor defined is needed, but the operations impact the normal
> operation of the device. CEL contents can be used to determine the impact and
> taint the kernel.
>
> * Example 2.1: Vendor-specific diagnostics features that are intended to stress
> the device or memory for qualification and validation purposes.

Qualification and validation seem to be places where the special
kernel build requirement can be satisfied.

>
> Q & A:
> ======
>
> Q: Why is it a problem to have CONFIG_CXL_MEM_RAW_COMMANDS disabled by default?
> A: This pushes the problem to the distros. Distros are likely to follow the
> Kconfig recommendation by default, so industry will need to have this
> discussion with each major distro maintainer.

Have a look at something like this directory:

https://kojipkgs.fedoraproject.org/packages/kernel/5.13.0/0.rc5.38.eln111/x86_64/

...and notice that kernel-debug is built by default. I expect it would
be relatively straightforward to ask distros to turn on
CONFIG_CXL_MEM_RAW_COMMANDS in their equivalent of the kernel-debug
package:

https://kojipkgs.fedoraproject.org/packages/kernel/5.13.0/0.rc5.38.eln111/x86_64/kernel-debug-5.13.0-0.rc5.38.eln111.x86_64.rpm

> If we can reach a consensus in
> this forum that we can default this knob to enabled, distros are more likely to
> also enable the functionality.

This seems to misunderstand how the trust relationship flows in the
Linux community. The distros are co-equal members of the Linux
community and they also depend on subject matter experts to vet the
functionality that they enable. Like you saw in the example from
Konrad above there is already discomfort in the community at even
allowing CONFIG_CXL_MEM_RAW_COMMANDS to be enabled at all. So, I am
advocating for a compromise position that allows for vendor defined
functionality, while also constraining its deployment to what the
linux-cxl community can review and develop native kernel interfaces.

>
> Q: Why is it a problem to taint the kernel for all passthrough commands when
> CONFIG_CXL_MEM_RAW_COMMANDS is enabled?
> A: Some functionality is harmless. If the kernel is tainted when executing
> basic read-only diagnostics, it may be harder to get support for real issues as
> the state of the system will be suspect. The driver should be able to use the
> CEL to determine when to taint.

If a problem comes to the list with vendor specific functionality in
the log the linux-cxl community is unlikely to engage.

>
> Q: Why can't we use Vendor Log for this?
> A: Several reasons:
> * There is no ability to control functionality. Many diagnostic features
> require input to configure what is being read.
> * The device may require time to collect the diagnostic information, but must
> respond to Get Log in under 2 seconds (preferably much faster).
> * It is a single blob, which can be accessed at arbitrary offsets. The limited
> set of use cases that could be covered by this model would require driver
> changes to enable the Vendor Log to be dumped with finer granularity. But even
> then, the fact that the host could potentially read anywhere in the Vendor Log
> at any time makes robustly encoding a set of debug forensics that are populated
> on demand (in response to Get Log) from underlying hardware into vendor defined
> offsets in the Vendor Log difficult.
>
> Q: What about security?
> A: While vendor commands have the potential to be a security risk, it should
> not be the driver’s responsibility to solve that problem. The in-band path
> through the driver is just one path into the device. For a device to be secure,
> vendors and their partner customers need to lock out anything that is a
> security risk at the device level for production. Given that this device level
> lockout is necessary, the catch all lockout at the driver level is redundant
> from a device security point of view.
>

In an environment where kernel functionality is being audited against
an untrusted root the driver must not abdicate responsibility for the
portions of security it can control, but there's a maintainability
aspect as well. Open collaboration yields benefits for maintainability
and deployability of system software.

> Q: How about we add commands to UAPI instead?
> A: Vendors should not be adding hooks into a class driver which is supposed to
> be generic. The details of anything that goes in UAPI would need to be fully
> public. Any proposed additions would need to be fully explained as to why they
> are not able to be added to the spec. This limits the viability of UAPI as a
> method for unlocking value-add functionality.

I'm not sure I understand the premise of this question?

I will say that I am fully in favor of innovations that solve problems
for CXL Linux users, but it needs to be done in a way that does not
compromise maintainability (unrestricted vendor differentiation) or
security (confidentiality, integrity, or stability).

> Conclusion:
> ===========
>
> I look forward to a healthy discussion on how we can move this forward, and
> hope that we can reach an answer that is acceptable for the kernel maintainers,
> CXL device vendors, and end users.

Ariel, thank you for putting this together.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-06-12  2:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 16:45 CXL Vendor Specific Capabilities Ariel.Sibley
2021-05-20  9:53 ` Christoph Hellwig
2021-05-20 11:55   ` Ariel.Sibley
2021-06-12  2:25     ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).