linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device
@ 2022-04-14 20:32 ira.weiny
  2022-04-14 20:32 ` [PATCH V8 01/10] PCI: Add vendor ID for the PCI SIG ira.weiny
                   ` (9 more replies)
  0 siblings, 10 replies; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

Changes from V7:[4]
	Avoid code bloat by making pci-doe.c conditional on CONFIG_PCI_DOE
		which is auto selected by the CXL_PCI config option.
	Minor code clean ups
	Fix bug in pci_doe_supports_prot()
	Rebase to cxl-pending

Changes from V6:[3]
	The big change is the removal of the auxiliary bus code from the PCI
	layer.  The auxiliary bus usage is now in the CXL layer.  The PCI layer
	provides helpers for subsystems to utilize DOE mailboxes by creating a
	pci_doe_mb object which controls a state machine for that mailbox
	capability.  The CXL layer wraps this object in an auxiliary device and
	driver which can then be used to determine if the kernel is controlling
	the capability or it is available to be used by user space.  Reads from
	user space via lspci are allowed.  Writes are allowed but flagged via a
	tainting the kernel.

	Feedback from Bjorn, Jonathan, and Dan
		Details in each patch

Changes from V5:[0]

	Rework the patch set to split PCI vs CXL changes
		Also make each change a bit more stand alone for easier review
	Add cxl_cdat structure
	Put CDAT related data structures in cdat.h
	Clarify some device lifetimes with comments
	Incorporate feedback from Jonathan, Bjorn and Dan
		The bigest change is placing the DOE scanning code into the
			pci_doe driver (part of the PCI codre).
		Validate the CDAT when it is read rather than before DSMAS
			parsing
		Do not report DSMAS failure as an error, report a warning and
			keep going.
		Retry reading the table 1 time.
	Update commit messages and this cover letter



CXL drivers need various data which are provided through generic DOE mailboxes
as defined in the PCIe 6.0 spec.[1]

One such data is the Coherent Device Atribute Table (CDAT).  CDAT data provides
coherent information about the various devices in the system.  It was developed
because systems no longer have a priori knowledge of all coherent devices
within a system.  CDAT describes the coherent characteristics of the
components on the CXL bus separate from system configurations.  The OS can
then, for example, use this information to form correct interleave sets.

To begin reading the CDAT the OS must have support to access the DOE mailboxes
provided by the CXL devices.

Because DOE is not specific to DOE but is provided within the PCI spec, the
series adds PCI DOE capability library functions.  These functions allow for
the iteration of the DOE capabilities on a device as well as creating
pci_doe_mb structures which can control the operation of the DOE state machine.

CXL iterates the DOE capabilities creates auxiliary bus devices.  These devices
are driven by a CXL DOE auxiliary driver which calls into the PCI DOE library
functions as appropriate.

The auxiliary bus architecture allows for root users to control which DOE
mailboxes are controlled by the kernel and which should be allowed for
unrestricted access by user space.  One such use case is to allow for CXL
Compliance Testing (CXL 2.0 14.16.4 Compliance Mode DOE).  By default the
kernel controls all mailboxes found.

After the devices are created and the driver attaches, CDAT data is read from
the device and DSMAS information parsed from that CDAT blob for use later.

This work was tested using qemu with additional patches.


[0] https://lore.kernel.org/linux-cxl/20211105235056.3711389-1-ira.weiny@intel.com/
[1] https://pcisig.com/specifications
[2] https://lore.kernel.org/qemu-devel/20210202005948.241655-1-ben.widawsky@intel.com/
[3] https://lore.kernel.org/linux-cxl/20220201071952.900068-1-ira.weiny@intel.com/
[4] https://lore.kernel.org/linux-cxl/20220330235920.2800929-1-ira.weiny@intel.com/


Ira Weiny (7):
PCI: Replace magic constant for PCI Sig Vendor ID
cxl/pci: Create auxiliary devices for each DOE mailbox
cxl/pci: Create DOE auxiliary driver
cxl/pci: Find the DOE mailbox which supports CDAT
cxl/cdat: Introduce cxl_cdat_valid()
cxl/mem: Retry reading CDAT on failure
cxl/port: Parse out DSMAS data from CDAT table

Jonathan Cameron (3):
PCI: Add vendor ID for the PCI SIG
PCI: Create PCI library functions in support of DOE mailboxes.
cxl/mem: Read CDAT table

drivers/cxl/Kconfig | 14 +
drivers/cxl/Makefile | 2 +
drivers/cxl/cdat.h | 116 +++++++
drivers/cxl/core/memdev.c | 38 +++
drivers/cxl/cxl.h | 8 +
drivers/cxl/cxlmem.h | 28 ++
drivers/cxl/cxlpci.h | 35 ++
drivers/cxl/doe.c | 90 +++++
drivers/cxl/pci.c | 423 +++++++++++++++++++++++
drivers/cxl/port.c | 65 ++++
drivers/pci/Kconfig | 3 +
drivers/pci/Makefile | 1 +
drivers/pci/pci-doe.c | 607 ++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 2 +-
include/linux/pci-doe.h | 119 +++++++
include/linux/pci_ids.h | 1 +
include/uapi/linux/pci_regs.h | 30 +-
17 files changed, 1580 insertions(+), 2 deletions(-)
create mode 100644 drivers/cxl/cdat.h
create mode 100644 drivers/cxl/doe.c
create mode 100644 drivers/pci/pci-doe.c
create mode 100644 include/linux/pci-doe.h


base-commit: 4f497e0dd3530981f8c897d5d47ee880b62baefb
--
2.35.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V8 01/10] PCI: Add vendor ID for the PCI SIG
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-14 20:32 ` [PATCH V8 02/10] PCI: Replace magic constant for PCI Sig Vendor ID ira.weiny
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

This ID is used in DOE headers to identify protocols that are defined
within the PCI Express Base Specification, PCIe r6.0, sec 6.30.1.1 table
6-32.

Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 include/linux/pci_ids.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 0178823ce8c2..8af3b86206b1 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -151,6 +151,7 @@
 #define PCI_CLASS_OTHERS		0xff
 
 /* Vendors and devices.  Sort key: vendor first, device next. */
+#define PCI_VENDOR_ID_PCI_SIG		0x0001
 
 #define PCI_VENDOR_ID_LOONGSON		0x0014
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V8 02/10] PCI: Replace magic constant for PCI Sig Vendor ID
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
  2022-04-14 20:32 ` [PATCH V8 01/10] PCI: Add vendor ID for the PCI SIG ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-14 20:32 ` [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes ira.weiny
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

Replace the magic value in pci_bus_crs_vendor_id() with
PCI_VENDOR_ID_PCI_SIG.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V6
	Simplify commit message
---
 drivers/pci/probe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 17a969942d37..6280e780a48c 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2312,7 +2312,7 @@ EXPORT_SYMBOL(pci_alloc_dev);
 
 static bool pci_bus_crs_vendor_id(u32 l)
 {
-	return (l & 0xffff) == 0x0001;
+	return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG;
 }
 
 static bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
  2022-04-14 20:32 ` [PATCH V8 01/10] PCI: Add vendor ID for the PCI SIG ira.weiny
  2022-04-14 20:32 ` [PATCH V8 02/10] PCI: Replace magic constant for PCI Sig Vendor ID ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-28 21:27   ` Bjorn Helgaas
  2022-05-30 19:06   ` Lukas Wunner
  2022-04-14 20:32 ` [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox ira.weiny
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Christoph Hellwig, Ira Weiny, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Introduced in a PCI v6.0[1], DOE provides a config space based mailbox
with standard protocol discovery.  Each mailbox is accessed through a
DOE Extended Capability.

Each DOE mailbox must support the DOE discovery protocol in addition to
any number of additional protocols.

Define core PCI functionality to manage a single PCI DOE mailbox at a
defined config space offset.  Functionality includes creating, supported
protocol queries, submit tasks to, and destroying the new state objects.

If interrupts are desired, interrupts vectors should be allocated prior
to asking for irq's when creating a mailbox object.

[1] https://members.pcisig.com/wg/PCI-SIG/document/16609

Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V7
	Add a Kconfig for this functionality
	Fix bug in pci_doe_supports_prot()
	Rebased on cxl-pending

Changes from V6
	Clean up signed off by lines
	Make this functionality all PCI library functions
	Clean up header files
	s/pci_doe_irq/pci_doe_irq_handler
	Use pci_{request,free}_irq
		Remove irq_name (maintained by pci_request_irq)
	Fix checks to use an irq
	Consistently use u16 for cap_offset
	Cleanup kdocs and comments
	Create a helper retire_cur_task() to handle locking of the
		current task pointer.
	Remove devm_ calls from PCI layer.
		The devm_ calls do not allow for the pci_doe_mb objects
		to be tied to an auxiliary device.  Leave it to the
		caller to use devm_ if desired.
	From Dan Williams
		s/cb/end_task/; Pass pci_doe_task to end_task
		Clarify exchange/task/request/response.
			Merge pci_doe_task and pci_doe_exchange into
			pci_doe_task which represents a single
			request/response task for the state machine to
			process.
		Simplify submitting work to the mailbox
			Replace pci_doe_exchange_sync() with
			pci_doe_submit_task() Consumers of the mailbox
			are now responsible for setting up callbacks
			within a task object and submitting them to the
			mailbox to be processed.
		Remove WARN_ON when task != NULL and be sure to abort that task.
		Convert abort/dead to atomic flags
		s/state_lock/task_lock to better define what the lock is
			protecting
		Remove all the auxiliary bus code from the PCI layer
			The PCI layer provides helpers to use the DOE
			Mailboxes.  Each subsystem can then use the
			helpers as they see fit.  The CXL layer in this
			series uses aux devices to manage the new
			pci_doe_mb objects.

	From Bjorn
		Clarify the fact that DOE mailboxes are capabilities of
			the device.
		Code clean ups
		Cleanup Makefile
		Update references to PCI SIG spec v6.0
		Move this attribution here:
		This code is based on Jonathan's V4 series here:
		https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/

Changes from V5
	From Bjorn
		s/pci_WARN/pci_warn
			Add timeout period to print
		Trim to 80 chars
		Use Tabs for DOE define spacing
		Use %#x for clarity
	From Jonathan
		Addresses concerns about the order of unwinding stuff
		s/doe/doe_dev in pci_doe_exhcnage_sync
		Correct kernel Doc comment
		Move pci_doe_task_complete() down in the file.
		Rework pci_doe_irq()
			process STATUS_ERROR first
			Return IRQ_NONE if the irq is not processed
			Use PCI_DOE_STATUS_INT_STATUS explicitly to
				clear the irq
	Clean up goto label s/err_free_irqs/err_free_irq
	use devm_kzalloc for doe struct
	clean up error paths in pci_doe_probe
	s/pci_doe_drv/pci_doe
	remove include mutex.h
	remove device name and define, move it in the next patch which uses it
	use devm_kasprintf() for irq_name
	use devm_request_irq()
	remove pci_doe_unregister()
		[get/put]_device() were unneeded and with the use of
		devm_* this function can be removed completely.
	refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq
		make this function just a registration of the irq and
		move pci_doe_abort() into pci_doe_probe()
	use devm_* to allocate the protocol array

Changes from Jonathan's V4
	Move the DOE MB code into the DOE auxiliary driver
	Remove Task List in favor of a wait queue

Changes from Ben
	remove CXL references
	propagate rc from pci functions on error

squash: make PCI_DOE optional based on selecting by those who need it.

squash: Create PCI library functions

Fix bug in protocol support check
---
 drivers/pci/Kconfig           |   3 +
 drivers/pci/Makefile          |   1 +
 drivers/pci/pci-doe.c         | 607 ++++++++++++++++++++++++++++++++++
 include/linux/pci-doe.h       | 119 +++++++
 include/uapi/linux/pci_regs.h |  29 +-
 5 files changed, 758 insertions(+), 1 deletion(-)
 create mode 100644 drivers/pci/pci-doe.c
 create mode 100644 include/linux/pci-doe.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 133c73207782..b2f2e588a817 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -121,6 +121,9 @@ config XEN_PCIDEV_FRONTEND
 config PCI_ATS
 	bool
 
+config PCI_DOE
+	bool
+
 config PCI_ECAM
 	bool
 
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 0da6b1ebc694..b609d8ad813d 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -31,6 +31,7 @@ obj-$(CONFIG_PCI_ECAM)		+= ecam.o
 obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
+obj-$(CONFIG_PCI_DOE)		+= pci-doe.o
 
 # Endpoint library must be initialized before its users
 obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
diff --git a/drivers/pci/pci-doe.c b/drivers/pci/pci-doe.c
new file mode 100644
index 000000000000..ccf936421d2a
--- /dev/null
+++ b/drivers/pci/pci-doe.c
@@ -0,0 +1,607 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Data Object Exchange
+ * https://members.pcisig.com/wg/PCI-SIG/document/16609
+ *
+ * Copyright (C) 2021 Huawei
+ *	Jonathan Cameron <Jonathan.Cameron@huawei.com>
+ *
+ * Copyright (C) 2022 Intel Corporation
+ *	Ira Weiny <ira.weiny@intel.com>
+ */
+
+#include <linux/bitfield.h>
+#include <linux/delay.h>
+#include <linux/jiffies.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci-doe.h>
+#include <linux/workqueue.h>
+
+#define PCI_DOE_PROTOCOL_DISCOVERY 0
+
+#define PCI_DOE_BUSY_MAX_RETRIES 16
+#define PCI_DOE_POLL_INTERVAL (HZ / 128)
+
+/* Timeout of 1 second from 6.30.2 Operation, PCI Spec v6.0 */
+#define PCI_DOE_TIMEOUT HZ
+
+static irqreturn_t pci_doe_irq_handler(int irq, void *data)
+{
+	struct pci_doe_mb *doe_mb = data;
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	u32 val;
+
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+	/* Leave the error case to be handled outside IRQ */
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
+		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
+		return IRQ_HANDLED;
+	}
+
+	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
+		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
+					PCI_DOE_STATUS_INT_STATUS);
+		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
+		return IRQ_HANDLED;
+	}
+
+	return IRQ_NONE;
+}
+
+/*
+ * Only called when safe to directly access the DOE from
+ * doe_statemachine_work().  Outside access is not protected.  Users who
+ * perform such access are left with the pieces.
+ */
+static void pci_doe_abort_start(struct pci_doe_mb *doe_mb)
+{
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	u32 val;
+
+	val = PCI_DOE_CTRL_ABORT;
+	if (doe_mb->irq >= 0)
+		val |= PCI_DOE_CTRL_INT_EN;
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
+
+	doe_mb->timeout_jiffies = jiffies + HZ;
+	schedule_delayed_work(&doe_mb->statemachine, HZ);
+}
+
+static int pci_doe_send_req(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
+{
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	u32 val;
+	int i;
+
+	/*
+	 * Check the DOE busy bit is not set. If it is set, this could indicate
+	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
+	 * it is expected that firmware and OS will negotiate access rights via
+	 * an, as yet to be defined method.
+	 */
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
+		return -EBUSY;
+
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+		return -EIO;
+
+	/* Write DOE Header */
+	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, task->prot.vid) |
+		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, task->prot.type);
+	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
+	/* Length is 2 DW of header + length of payload in DW */
+	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
+					  2 + task->request_pl_sz /
+						sizeof(u32)));
+	for (i = 0; i < task->request_pl_sz / sizeof(u32); i++)
+		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+				       task->request_pl[i]);
+
+	val = PCI_DOE_CTRL_GO;
+	if (doe_mb->irq >= 0)
+		val |= PCI_DOE_CTRL_INT_EN;
+
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
+	/* Request is sent - now wait for poll or IRQ */
+	return 0;
+}
+
+static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
+{
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	size_t length;
+	u32 val;
+	int i;
+
+	/* Read the first dword to get the protocol */
+	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) ||
+	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) {
+		pci_err(pdev,
+			"Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",
+			task->prot.vid, task->prot.type,
+			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
+			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
+		return -EIO;
+	}
+
+	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	/* Read the second dword to get the length */
+	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+
+	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
+	if (length > SZ_1M || length < 2)
+		return -EIO;
+
+	/* First 2 dwords have already been read */
+	length -= 2;
+	/* Read the rest of the response payload */
+	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
+		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
+				      &task->response_pl[i]);
+		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	}
+
+	/* Flush excess length */
+	for (; i < length; i++) {
+		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	}
+	/* Final error check to pick up on any since Data Object Ready */
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+		return -EIO;
+
+	return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32);
+}
+
+static void signal_task_complete(struct pci_doe_task *task, int rv)
+{
+	task->rv = rv;
+	task->complete(task);
+}
+
+static void retire_cur_task(struct pci_doe_mb *doe_mb)
+{
+	mutex_lock(&doe_mb->task_lock);
+	doe_mb->cur_task = NULL;
+	mutex_unlock(&doe_mb->task_lock);
+	wake_up_interruptible(&doe_mb->wq);
+}
+
+static void doe_statemachine_work(struct work_struct *work)
+{
+	struct delayed_work *w = to_delayed_work(work);
+	struct pci_doe_mb *doe_mb = container_of(w, struct pci_doe_mb,
+						 statemachine);
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	struct pci_doe_task *task;
+	u32 val;
+	int rc;
+
+	mutex_lock(&doe_mb->task_lock);
+	task = doe_mb->cur_task;
+	mutex_unlock(&doe_mb->task_lock);
+
+	if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) {
+		/*
+		 * Currently only used during init - care needed if
+		 * pci_doe_abort() is generally exposed as it would impact
+		 * queries in flight.
+		 */
+		if (task)
+			pr_err("Aborting with active task!\n");
+		doe_mb->state = DOE_WAIT_ABORT;
+		pci_doe_abort_start(doe_mb);
+		return;
+	}
+
+	switch (doe_mb->state) {
+	case DOE_IDLE:
+		if (task == NULL)
+			return;
+
+		rc = pci_doe_send_req(doe_mb, task);
+		/*
+		 * The specification does not provide any guidance on how long
+		 * some other entity could keep the DOE busy, so try for 1
+		 * second then fail. Busy handling is best effort only, because
+		 * there is no way of avoiding racing against another user of
+		 * the DOE.
+		 */
+		if (rc == -EBUSY) {
+			doe_mb->busy_retries++;
+			if (doe_mb->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
+				/* Long enough, fail this request */
+				pci_warn(pdev,
+					"DOE busy for too long (> 1 sec)\n");
+				doe_mb->busy_retries = 0;
+				goto err_busy;
+			}
+			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
+			return;
+		}
+		if (rc)
+			goto err_abort;
+		doe_mb->busy_retries = 0;
+
+		doe_mb->state = DOE_WAIT_RESP;
+		doe_mb->timeout_jiffies = jiffies + HZ;
+		/* Now poll or wait for IRQ with timeout */
+		if (doe_mb->irq >= 0)
+			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
+		else
+			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
+		return;
+
+	case DOE_WAIT_RESP:
+		/* Not possible to get here with NULL task */
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
+			rc = -EIO;
+			goto err_abort;
+		}
+
+		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
+			/* If not yet at timeout reschedule otherwise abort */
+			if (time_after(jiffies, doe_mb->timeout_jiffies)) {
+				rc = -ETIMEDOUT;
+				goto err_abort;
+			}
+			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
+			return;
+		}
+
+		rc  = pci_doe_recv_resp(doe_mb, task);
+		if (rc < 0)
+			goto err_abort;
+
+		doe_mb->state = DOE_IDLE;
+
+		retire_cur_task(doe_mb);
+		/* Set the return value to the length of received payload */
+		signal_task_complete(task, rc);
+
+		return;
+
+	case DOE_WAIT_ABORT:
+	case DOE_WAIT_ABORT_ON_ERR:
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
+		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
+			/* Back to normal state - carry on */
+			retire_cur_task(doe_mb);
+
+			/*
+			 * For deliberately triggered abort, someone is
+			 * waiting.
+			 */
+			if (doe_mb->state == DOE_WAIT_ABORT) {
+				if (task)
+					signal_task_complete(task, -EFAULT);
+				complete(&doe_mb->abort_c);
+			}
+
+			doe_mb->state = DOE_IDLE;
+			return;
+		}
+		if (time_after(jiffies, doe_mb->timeout_jiffies)) {
+			/* Task has timed out and is dead - abort */
+			pci_err(pdev, "DOE ABORT timed out\n");
+			set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
+			retire_cur_task(doe_mb);
+
+			if (doe_mb->state == DOE_WAIT_ABORT) {
+				if (task)
+					signal_task_complete(task, -EFAULT);
+				complete(&doe_mb->abort_c);
+			}
+		}
+		return;
+	}
+
+err_abort:
+	doe_mb->state = DOE_WAIT_ABORT_ON_ERR;
+	pci_doe_abort_start(doe_mb);
+err_busy:
+	signal_task_complete(task, rc);
+	if (doe_mb->state == DOE_IDLE)
+		retire_cur_task(doe_mb);
+}
+
+static void pci_doe_task_complete(struct pci_doe_task *task)
+{
+	complete(task->private);
+}
+
+static int pci_doe_discovery(struct pci_doe_mb *doe_mb, u8 *index, u16 *vid,
+			     u8 *protocol)
+{
+	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
+				    *index);
+	u32 response_pl;
+	DECLARE_COMPLETION_ONSTACK(c);
+	struct pci_doe_task task = {
+		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
+		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
+		.request_pl = &request_pl,
+		.request_pl_sz = sizeof(request_pl),
+		.response_pl = &response_pl,
+		.response_pl_sz = sizeof(response_pl),
+		.complete = pci_doe_task_complete,
+		.private = &c,
+	};
+	int ret;
+
+	ret = pci_doe_submit_task(doe_mb, &task);
+	if (ret < 0)
+		return ret;
+
+	wait_for_completion(&c);
+
+	if (task.rv != sizeof(response_pl))
+		return -EIO;
+
+	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
+	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
+			      response_pl);
+	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
+			   response_pl);
+
+	return 0;
+}
+
+static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb)
+{
+	u8 index = 0;
+	int num_prots;
+	int rc;
+
+	/* Discovery protocol must always be supported and must report itself */
+	num_prots = 1;
+
+	doe_mb->prots = kcalloc(num_prots, sizeof(*doe_mb->prots), GFP_KERNEL);
+	if (!doe_mb->prots)
+		return -ENOMEM;
+
+	/*
+	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
+	 * error if pci_doe_cache_protocols() fails past this point.
+	 */
+	do {
+		struct pci_doe_protocol *prot;
+
+		prot = &doe_mb->prots[num_prots - 1];
+		rc = pci_doe_discovery(doe_mb, &index, &prot->vid, &prot->type);
+		if (rc)
+			return rc;
+
+		if (index) {
+			struct pci_doe_protocol *prot_new;
+
+			num_prots++;
+			prot_new = krealloc(doe_mb->prots,
+					    sizeof(*doe_mb->prots) * num_prots,
+					    GFP_KERNEL);
+			if (!prot_new)
+				return -ENOMEM;
+
+			doe_mb->prots = prot_new;
+		}
+	} while (index);
+
+	doe_mb->num_prots = num_prots;
+	return 0;
+}
+
+static int pci_doe_abort(struct pci_doe_mb *doe_mb)
+{
+	reinit_completion(&doe_mb->abort_c);
+	set_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags);
+	schedule_delayed_work(&doe_mb->statemachine, 0);
+	wait_for_completion(&doe_mb->abort_c);
+
+	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags))
+		return -EIO;
+
+	return 0;
+}
+
+static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
+{
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	int doe_irq, rc;
+	u32 val;
+
+	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
+
+	if (!FIELD_GET(PCI_DOE_CAP_INT, val))
+		return -EOPNOTSUPP;
+
+	doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
+	rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
+			     NULL, doe_mb,
+			     "DOE[%d:%s]", doe_irq, pci_name(pdev));
+	if (rc)
+		return rc;
+
+	doe_mb->irq = doe_irq;
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
+			       PCI_DOE_CTRL_INT_EN);
+	return 0;
+}
+
+static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
+{
+	if (doe_mb->irq >= 0)
+		pci_free_irq(doe_mb->pdev, doe_mb->irq, doe_mb);
+	kfree(doe_mb->prots);
+	kfree(doe_mb);
+}
+
+/**
+ * pci_doe_create_mb() - Create a DOE mailbox object
+ *
+ * @pdev: PCI device to create the DOE mailbox for
+ * @cap_offset: Offset of the DOE mailbox
+ * @use_irq: Should the state machine use an irq
+ *
+ * Create a single mailbox object to manage the mailbox protocol at the
+ * cap_offset specified.
+ *
+ * Caller should allocate PCI irq vectors before setting use_irq.
+ *
+ * RETURNS: created mailbox object on success
+ *	    ERR_PTR(-errno) on failure
+ */
+struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
+				     bool use_irq)
+{
+	struct pci_doe_mb *doe_mb;
+	int rc;
+
+	doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL);
+	if (!doe_mb)
+		return ERR_PTR(-ENOMEM);
+
+	doe_mb->pdev = pdev;
+	init_completion(&doe_mb->abort_c);
+	doe_mb->irq = -1;
+	doe_mb->cap_offset = cap_offset;
+
+	init_waitqueue_head(&doe_mb->wq);
+	mutex_init(&doe_mb->task_lock);
+	INIT_DELAYED_WORK(&doe_mb->statemachine, doe_statemachine_work);
+	doe_mb->state = DOE_IDLE;
+
+	if (use_irq) {
+		rc = pci_doe_request_irq(doe_mb);
+		if (rc)
+			pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
+				cap_offset, rc);
+	}
+
+	/* Reset the mailbox by issuing an abort */
+	rc = pci_doe_abort(doe_mb);
+	if (rc) {
+		pci_err(pdev, "DOE failed to reset the mailbox @ %u : %d\n",
+			cap_offset, rc);
+		pci_doe_free_mb(doe_mb);
+		return ERR_PTR(rc);
+	}
+
+	rc = pci_doe_cache_protocols(doe_mb);
+	if (rc) {
+		pci_err(pdev, "DOE failed to cache protocols for mailbox @ %u : %d\n",
+			cap_offset, rc);
+		pci_doe_free_mb(doe_mb);
+		return ERR_PTR(rc);
+	}
+
+	return doe_mb;
+}
+EXPORT_SYMBOL_GPL(pci_doe_create_mb);
+
+/**
+ * pci_doe_supports_prot() - Return if the DOE instance supports the given
+ *			     protocol
+ * @doe_mb: DOE mailbox capability to query
+ * @vid: Protocol Vendor ID
+ * @type: protocol type
+ *
+ * RETURNS: True if the DOE device supports the protocol specified
+ */
+bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type)
+{
+	int i;
+
+	/* The discovery protocol must always be supported */
+	if (vid == PCI_VENDOR_ID_PCI_SIG && type == PCI_DOE_PROTOCOL_DISCOVERY)
+		return true;
+
+	for (i = 0; i < doe_mb->num_prots; i++)
+		if ((doe_mb->prots[i].vid == vid) &&
+		    (doe_mb->prots[i].type == type))
+			return true;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
+
+/**
+ * pci_doe_submit_task() - Submit a task to be processed by the state machine
+ *
+ * @doe_mb: DOE mailbox capability to submit to
+ * @task: task to be queued
+ *
+ * Submit a DOE task (request/response) to the DOE mailbox to be processed.
+ * Returns upon queueing the task object.  If the queue is full this function
+ * will sleep until there is room in the queue.
+ *
+ * task->complete will be called when the state machine is done processing this
+ * task.
+ *
+ * Excess data will be discarded.
+ *
+ * RETURNS: 0 when task has been successful queued, -ERRNO on error
+ */
+int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
+{
+	if (!pci_doe_supports_prot(doe_mb, task->prot.vid, task->prot.type))
+		return -EINVAL;
+
+	/* DOE requests must be a whole number of DW */
+	if (task->request_pl_sz % sizeof(u32))
+		return -EINVAL;
+
+again:
+	mutex_lock(&doe_mb->task_lock);
+	if (doe_mb->cur_task) {
+		mutex_unlock(&doe_mb->task_lock);
+		wait_event_interruptible(doe_mb->wq, doe_mb->cur_task == NULL);
+		goto again;
+	}
+
+	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
+		mutex_unlock(&doe_mb->task_lock);
+		return -EIO;
+	}
+	doe_mb->cur_task = task;
+	mutex_unlock(&doe_mb->task_lock);
+	schedule_delayed_work(&doe_mb->statemachine, 0);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_doe_submit_task);
+
+/**
+ * pci_doe_destroy_mb() - Destroy a DOE mailbox object created with
+ * pci_doe_create_mb()
+ *
+ * @doe_mb: DOE mailbox capability structure to destroy
+ *
+ * The mailbox becomes invalid and should not be used after this call.
+ */
+void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb)
+{
+	/* abort any work in progress */
+	pci_doe_abort(doe_mb);
+
+	/* halt the state machine */
+	cancel_delayed_work_sync(&doe_mb->statemachine);
+
+	pci_doe_free_mb(doe_mb);
+}
+EXPORT_SYMBOL_GPL(pci_doe_destroy_mb);
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
new file mode 100644
index 000000000000..7e6ebaf9930a
--- /dev/null
+++ b/include/linux/pci-doe.h
@@ -0,0 +1,119 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data Object Exchange
+ * https://members.pcisig.com/wg/PCI-SIG/document/16609
+ *
+ * Copyright (C) 2021 Huawei
+ *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
+ *
+ * Copyright (C) 2022 Intel Corporation
+ *	Ira Weiny <ira.weiny@intel.com>
+ */
+
+#ifndef LINUX_PCI_DOE_H
+#define LINUX_PCI_DOE_H
+
+#include <linux/completion.h>
+
+enum pci_doe_state {
+	DOE_IDLE,
+	DOE_WAIT_RESP,
+	DOE_WAIT_ABORT,
+	DOE_WAIT_ABORT_ON_ERR,
+};
+
+#define PCI_DOE_FLAG_ABORT	0
+#define PCI_DOE_FLAG_DEAD	1
+
+struct pci_doe_protocol {
+	u16 vid;
+	u8 type;
+};
+
+/**
+ * struct pci_doe_task - represents a single query/response
+ *
+ * @prot: DOE Protocol
+ * @request_pl: The request payload
+ * @request_pl_sz: Size of the request payload
+ * @response_pl: The response payload
+ * @response_pl_sz: Size of the response payload
+ * @rv: Return value.  Length of received response or error
+ * @complete: Called when task is complete
+ * @private: Private data for the consumer
+ */
+struct pci_doe_task {
+	struct pci_doe_protocol prot;
+	u32 *request_pl;
+	size_t request_pl_sz;
+	u32 *response_pl;
+	size_t response_pl_sz;
+	int rv;
+	void (*complete)(struct pci_doe_task *task);
+	void *private;
+};
+
+/**
+ * struct pci_doe_mb - State for a single DOE mailbox
+ *
+ * This state is used to manage a single DOE mailbox capability.  All fields
+ * should be considered opaque to the consumers and the structure passed into
+ * the helpers below after being created by devm_pci_doe_create()
+ *
+ * @pdev: PCI device this belongs to mailbox belongs to
+ * @abort_c: Completion used for initial abort handling
+ * @irq: Interrupt used for signaling DOE ready or abort
+ * @prots: Array of protocols supported on this DOE
+ * @num_prots: Size of prots array
+ * @cap_offset: Capability offset
+ * @wq: Wait queue to wait on if a query is in progress
+ * @cur_task: Current task the state machine is working on
+ * @task_lock: Protect cur_task
+ * @statemachine: Work item for the DOE state machine
+ * @state: Current state of this DOE
+ * @timeout_jiffies: 1 second after GO set
+ * @busy_retries: Count of retry attempts
+ * @flags: Bit array of PCI_DOE_FLAG_* flags
+ *
+ * Note: prots can't be allocated with struct size because the number of
+ * protocols is not known until after this structure is in use.  However, the
+ * single discovery protocol is always required to query for the number of
+ * protocols.
+ */
+struct pci_doe_mb {
+	struct pci_dev *pdev;
+	struct completion abort_c;
+	int irq;
+	struct pci_doe_protocol *prots;
+	int num_prots;
+	u16 cap_offset;
+
+	wait_queue_head_t wq;
+	struct pci_doe_task *cur_task;
+	struct mutex task_lock;
+	struct delayed_work statemachine;
+	enum pci_doe_state state;
+	unsigned long timeout_jiffies;
+	unsigned int busy_retries;
+	unsigned long flags;
+};
+
+/**
+ * pci_doe_for_each_off - Iterate each DOE capability
+ * @pdev: struct pci_dev to iterate
+ * @off: u16 of config space offset of each mailbox capability found
+ */
+#define pci_doe_for_each_off(pdev, off) \
+	for (off = pci_find_next_ext_capability(pdev, off, \
+					PCI_EXT_CAP_ID_DOE); \
+		off > 0; \
+		off = pci_find_next_ext_capability(pdev, off, \
+					PCI_EXT_CAP_ID_DOE))
+
+struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
+				     bool use_irq);
+void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
+bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);
+int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task);
+
+#endif
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index bee1a9ed6e66..4e96b45ee36d 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -736,7 +736,8 @@
 #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
 #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
 #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
-#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
+#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
+#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
 
 #define PCI_EXT_CAP_DSN_SIZEOF	12
 #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
@@ -1102,4 +1103,30 @@
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
 
+/* Data Object Exchange */
+#define PCI_DOE_CAP		0x04    /* DOE Capabilities Register */
+#define  PCI_DOE_CAP_INT			0x00000001  /* Interrupt Support */
+#define  PCI_DOE_CAP_IRQ			0x00000ffe  /* Interrupt Message Number */
+#define PCI_DOE_CTRL		0x08    /* DOE Control Register */
+#define  PCI_DOE_CTRL_ABORT			0x00000001  /* DOE Abort */
+#define  PCI_DOE_CTRL_INT_EN			0x00000002  /* DOE Interrupt Enable */
+#define  PCI_DOE_CTRL_GO			0x80000000  /* DOE Go */
+#define PCI_DOE_STATUS		0x0c    /* DOE Status Register */
+#define  PCI_DOE_STATUS_BUSY			0x00000001  /* DOE Busy */
+#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */
+#define  PCI_DOE_STATUS_ERROR			0x00000004  /* DOE Error */
+#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
+#define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
+#define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
+
+/* DOE Data Object - note not actually registers */
+#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
+#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
+#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
+
+#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
+
 #endif /* LINUX_PCI_REGS_H */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (2 preceding siblings ...)
  2022-04-14 20:32 ` [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-27 17:19   ` Jonathan Cameron
  2022-04-29 15:55   ` Jonathan Cameron
  2022-04-14 20:32 ` [PATCH V8 05/10] cxl/pci: Create DOE auxiliary driver ira.weiny
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

CXL kernel drivers optionally need to access DOE mailbox capabilities.
Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
the kernel while other access is designed towards user space usage.  An
example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
Compliance Mode DOE) which offers a mechanism to set different test
modes for a device.

There is no anticipated need for the kernel to share an individual
mailbox with user space.  Thus developing an interface to marshal access
between the kernel and user space for a single mailbox is unnecessary
overhead.  However, having the kernel relinquish some mailboxes to be
controlled by user space is a reasonable compromise to share access to
the device.

The auxiliary bus provides an elegant solution for this.  Each DOE
capability is given its own auxiliary device.  This device is controlled
by a kernel driver by default which restricts access to the mailbox.
Unbinding the driver from a single auxiliary device (DOE mailbox
capability) frees the mailbox for user space access.  This architecture
also allows a clear picture on which mailboxes are kernel controlled vs
not.

Iterate each DOE mailbox capability and create auxiliary bus devices.
Follow on patches will define a driver for the newly created devices.

sysfs shows the devices.

$ ls -l /sys/bus/auxiliary/devices/
total 0
lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.0 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.1 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.2 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.3 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.4 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.4
lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.5 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.6 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.7 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V7:
	Minor code clean ups
	Rebased on cxl-pending

Changes from V6:
	Move all the auxiliary device stuff to the CXL layer

Changes from V5:
	Split the CXL specific stuff off from the PCI DOE create
	auxiliary device code.
---
 drivers/cxl/Kconfig  |   1 +
 drivers/cxl/cxlpci.h |  21 +++++++
 drivers/cxl/pci.c    | 127 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 149 insertions(+)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index f64e3984689f..ac0f5ca95431 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -16,6 +16,7 @@ if CXL_BUS
 config CXL_PCI
 	tristate "PCI manageability"
 	default CXL_BUS
+	select AUXILIARY_BUS
 	help
 	  The CXL specification defines a "CXL memory device" sub-class in the
 	  PCI "memory controller" base class of devices. Device's identified by
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 329e7ea3f36a..2ad8715173ce 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -2,6 +2,7 @@
 /* Copyright(c) 2020 Intel Corporation. All rights reserved. */
 #ifndef __CXL_PCI_H__
 #define __CXL_PCI_H__
+#include <linux/auxiliary_bus.h>
 #include <linux/pci.h>
 #include "cxl.h"
 
@@ -72,4 +73,24 @@ static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
 }
 
 int devm_cxl_port_enumerate_dports(struct cxl_port *port);
+
+/**
+ * struct cxl_doe_dev - CXL DOE auxiliary bus device
+ *
+ * @adev: Auxiliary bus device
+ * @pdev: PCI device this belongs to
+ * @cap_offset: Capability offset
+ * @use_irq: Set if IRQs are to be used with this mailbox
+ *
+ * This represents a single DOE mailbox device.  CXL devices should create this
+ * device and register it on the Auxiliary bus for the CXL DOE driver to drive.
+ */
+struct cxl_doe_dev {
+	struct auxiliary_device adev;
+	struct pci_dev *pdev;
+	int cap_offset;
+	bool use_irq;
+};
+#define DOE_DEV_NAME "doe"
+
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index e7ab9a34d718..41a6f3eb0a5c 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -8,6 +8,7 @@
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/pci.h>
+#include <linux/pci-doe.h>
 #include <linux/io.h>
 #include "cxlmem.h"
 #include "cxlpci.h"
@@ -564,6 +565,128 @@ static void cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
 	info->ranges = __cxl_dvsec_ranges(cxlds, info);
 }
 
+static void cxl_pci_free_irq_vectors(void *data)
+{
+	pci_free_irq_vectors(data);
+}
+
+static DEFINE_IDA(pci_doe_adev_ida);
+
+static void cxl_pci_doe_dev_release(struct device *dev)
+{
+	struct auxiliary_device *adev = container_of(dev,
+						struct auxiliary_device,
+						dev);
+	struct cxl_doe_dev *doe_dev = container_of(adev, struct cxl_doe_dev,
+						   adev);
+
+	ida_free(&pci_doe_adev_ida, adev->id);
+	kfree(doe_dev);
+}
+
+static void cxl_pci_doe_destroy_device(void *ad)
+{
+	auxiliary_device_delete(ad);
+	auxiliary_device_uninit(ad);
+}
+
+/**
+ * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
+ *				mailboxes found
+ *
+ * @pci_dev: The PCI device to scan for DOE mailboxes
+ *
+ * There is no coresponding destroy of these devices.  This function associates
+ * the DOE auxiliary devices created with the pci_dev passed in.  That
+ * association is device managed (devm_*) such that the DOE auxiliary device
+ * lifetime is always less than or equal to the lifetime of the pci_dev.
+ *
+ * RETURNS: 0 on success -ERRNO on failure.
+ */
+static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	bool use_irq = true;
+	int irqs = 0;
+	u16 off = 0;
+	int rc;
+
+	pci_doe_for_each_off(pdev, off)
+		irqs++;
+	pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
+
+	/*
+	 * Allocate enough vectors for the DOE's
+	 */
+	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
+						     PCI_IRQ_MSIX);
+	if (rc != irqs) {
+		pci_err(pdev,
+			"Not enough interrupts for all the DOEs; use polling\n");
+		use_irq = false;
+		/* Some got allocated; clean them up */
+		if (rc > 0)
+			cxl_pci_free_irq_vectors(pdev);
+	} else {
+		/*
+		 * Enabling bus mastering is require for MSI/MSIx.  It could be
+		 * done later within the DOE initialization, but as it
+		 * potentially has other impacts keep it here when setting up
+		 * the IRQ's.
+		 */
+		pci_set_master(pdev);
+		rc = devm_add_action_or_reset(dev,
+					      cxl_pci_free_irq_vectors,
+					      pdev);
+		if (rc)
+			return rc;
+	}
+
+	pci_doe_for_each_off(pdev, off) {
+		struct auxiliary_device *adev;
+		struct cxl_doe_dev *new_dev;
+		int id;
+
+		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
+		if (!new_dev)
+			return -ENOMEM;
+
+		new_dev->pdev = pdev;
+		new_dev->cap_offset = off;
+		new_dev->use_irq = use_irq;
+
+		/* Set up struct auxiliary_device */
+		adev = &new_dev->adev;
+		id = ida_alloc(&pci_doe_adev_ida, GFP_KERNEL);
+		if (id < 0) {
+			kfree(new_dev);
+			return -ENOMEM;
+		}
+
+		adev->id = id;
+		adev->name = DOE_DEV_NAME;
+		adev->dev.release = cxl_pci_doe_dev_release;
+		adev->dev.parent = dev;
+
+		if (auxiliary_device_init(adev)) {
+			cxl_pci_doe_dev_release(&adev->dev);
+			return -EIO;
+		}
+
+		if (auxiliary_device_add(adev)) {
+			auxiliary_device_uninit(adev);
+			return -EIO;
+		}
+
+		rc = devm_add_action_or_reset(dev, cxl_pci_doe_destroy_device,
+					      adev);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct cxl_register_map map;
@@ -630,6 +753,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	rc = cxl_pci_create_doe_devices(pdev);
+	if (rc)
+		return rc;
+
 	cxl_dvsec_ranges(cxlds);
 
 	cxlmd = devm_cxl_add_memdev(cxlds);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V8 05/10] cxl/pci: Create DOE auxiliary driver
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (3 preceding siblings ...)
  2022-04-14 20:32 ` [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-27 17:43   ` Jonathan Cameron
  2022-04-14 20:32 ` [PATCH V8 06/10] cxl/pci: Find the DOE mailbox which supports CDAT ira.weiny
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

CXL kernel drivers optionally need to access DOE mailbox capabilities.
Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
the kernel while other access is designed towards user space usage.  An
example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
Compliance Mode DOE) which offers a mechanism to set different test
modes for a device.

There is no anticipated need for the kernel to share an individual
mailbox with user space.  Thus developing an interface to marshal access
between the kernel and user space for a single mailbox is unnecessary
overhead.  However, having the kernel relinquish some mailboxes to be
controlled by user space is a reasonable compromise to share access to
the device.

The auxiliary bus provides an elegant solution for this.  Each DOE
capability is given its own auxiliary device.  This device is controlled
by a kernel driver by default which restricts access to the mailbox.
Unbinding the driver from a single auxiliary device (DOE mailbox
capability) frees the mailbox for user space access.  This architecture
also allows a clear picture on which mailboxes are kernel controlled vs
not.

Create a driver for the DOE auxiliary devices.  The driver uses the PCI
DOE core to manage the mailbox.

User space must be prevented from unbinding the driver state when the
DOE auxiliary driver is being accessed by the kernel.  Add a read write
lock to the DOE auxiliary device to protect the driver data portion.

Finally, flag the driver module to be preloaded by device creation to
ensure the driver is attached when iterating the DOE capabilities.

User space access can be obtained by unbinding the driver from that
device.  For example:

$ ls -l /sys/bus/auxiliary/drivers
total 0
drwxr-xr-x 2 root root 0 Mar 24 10:45 cxl_doe.cxl_doe_drv

$ ls -l /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci*
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.0 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.1 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.2 -> ../../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.4
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.5 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.6 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.7 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7

$ echo "cxl_pci.doe.4" > /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/unbind

$ ls -l /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci*
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.0 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.1 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.2 -> ../../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.5 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.6 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.7 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V7:
	Now need to select PCI_DOE
	Change MODULE_LICENSE to 'GPL' instead of old 'GPL v2'

Changes from V6:
	The CXL layer now contains the driver for these auxiliary
	devices.

Changes from V5:
	Split the CXL specific stuff off from the PCI DOE create
	auxiliary device code.
---
 drivers/cxl/Kconfig           | 13 +++++
 drivers/cxl/Makefile          |  2 +
 drivers/cxl/cxlpci.h          | 13 +++++
 drivers/cxl/doe.c             | 90 +++++++++++++++++++++++++++++++++++
 drivers/cxl/pci.c             | 20 ++++++++
 include/uapi/linux/pci_regs.h |  1 +
 6 files changed, 139 insertions(+)
 create mode 100644 drivers/cxl/doe.c

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index ac0f5ca95431..82f3908fa5cc 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -103,4 +103,17 @@ config CXL_SUSPEND
 	def_bool y
 	depends on SUSPEND && CXL_MEM
 
+config CXL_PCI_DOE
+      tristate "CXL PCI Data Object Exchange (DOE) support"
+      depends on CXL_PCI
+      default CXL_BUS
+      select PCI_DOE
+      help
+        Driver for DOE auxiliary devices.
+
+	The DOE capabilities provides a simple mailbox in PCI config space that
+	is used for a number of different protocols useful to CXL.  The CXL PCI
+	subsystem creates auxiliary devices for each DOE mailbox capability
+	found.  This driver is required for the kernel to use these devices.
+
 endif
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index a78270794150..c71b7a6345fb 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-y += core/
 obj-$(CONFIG_CXL_PCI) += cxl_pci.o
+obj-$(CONFIG_CXL_PCI_DOE) += cxl_doe.o
 obj-$(CONFIG_CXL_MEM) += cxl_mem.o
 obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
 obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
@@ -8,6 +9,7 @@ obj-$(CONFIG_CXL_PORT) += cxl_port.o
 
 cxl_mem-y := mem.o
 cxl_pci-y := pci.o
+cxl_doe-y := doe.o
 cxl_acpi-y := acpi.o
 cxl_pmem-y := pmem.o
 cxl_port-y := port.o
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 2ad8715173ce..821fe05e8289 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -79,6 +79,7 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
  *
  * @adev: Auxiliary bus device
  * @pdev: PCI device this belongs to
+ * @driver_access: Lock the driver during access
  * @cap_offset: Capability offset
  * @use_irq: Set if IRQs are to be used with this mailbox
  *
@@ -88,9 +89,21 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
 struct cxl_doe_dev {
 	struct auxiliary_device adev;
 	struct pci_dev *pdev;
+	struct rw_semaphore driver_access;
 	int cap_offset;
 	bool use_irq;
 };
 #define DOE_DEV_NAME "doe"
 
+/**
+ * struct cxl_doe_drv_state - state of the DOE Aux driver
+ *
+ * @doe_dev: The Auxiliary DOE device
+ * @doe_mb: PCI DOE mailbox state
+ */
+struct cxl_doe_drv_state {
+	struct cxl_doe_dev *doe_dev;
+	struct pci_doe_mb *doe_mb;
+};
+
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/doe.c b/drivers/cxl/doe.c
new file mode 100644
index 000000000000..1d3a24a77002
--- /dev/null
+++ b/drivers/cxl/doe.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
+
+#include <linux/pci.h>
+#include <linux/pci-doe.h>
+
+#include "cxlpci.h"
+
+static void doe_destroy_mb(void *ds)
+{
+	struct cxl_doe_drv_state *doe_ds = ds;
+
+	pci_doe_destroy_mb(doe_ds->doe_mb);
+}
+
+static int cxl_pci_doe_probe(struct auxiliary_device *aux_dev,
+			     const struct auxiliary_device_id *id)
+{
+	struct cxl_doe_dev *doe_dev = container_of(aux_dev, struct cxl_doe_dev,
+						   adev);
+	struct device *dev = &aux_dev->dev;
+	struct cxl_doe_drv_state *doe_ds;
+	struct pci_doe_mb *doe_mb;
+
+	doe_ds = devm_kzalloc(dev, sizeof(*doe_ds), GFP_KERNEL);
+	if (!doe_ds)
+		return -ENOMEM;
+
+	doe_mb = pci_doe_create_mb(doe_dev->pdev, doe_dev->cap_offset,
+				   doe_dev->use_irq);
+	if (IS_ERR(doe_mb)) {
+		dev_err(dev, "Failed to create the DOE mailbox state machine\n");
+		return PTR_ERR(doe_mb);
+	}
+
+	doe_ds->doe_mb = doe_mb;
+	devm_add_action_or_reset(dev, doe_destroy_mb, doe_ds);
+
+	down_write(&doe_dev->driver_access);
+	auxiliary_set_drvdata(aux_dev, doe_ds);
+	up_write(&doe_dev->driver_access);
+
+	return 0;
+}
+
+static void cxl_pci_doe_remove(struct auxiliary_device *aux_dev)
+{
+	struct cxl_doe_dev *doe_dev = container_of(aux_dev, struct cxl_doe_dev,
+						   adev);
+
+	down_write(&doe_dev->driver_access);
+	auxiliary_set_drvdata(aux_dev, NULL);
+	up_write(&doe_dev->driver_access);
+}
+
+static const struct auxiliary_device_id cxl_pci_doe_auxiliary_id_table[] = {
+	{.name = "cxl_pci." DOE_DEV_NAME, },
+	{},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, cxl_pci_doe_auxiliary_id_table);
+
+struct auxiliary_driver cxl_pci_doe_auxiliary_drv = {
+	.name = "cxl_doe_drv",
+	.id_table = cxl_pci_doe_auxiliary_id_table,
+	.probe = cxl_pci_doe_probe,
+	.remove = cxl_pci_doe_remove,
+};
+
+static int __init cxl_pci_doe_init_module(void)
+{
+	int ret;
+
+	ret = auxiliary_driver_register(&cxl_pci_doe_auxiliary_drv);
+	if (ret) {
+		pr_err("Failed cxl_pci_doe auxiliary_driver_register() ret=%d\n",
+		       ret);
+	}
+
+	return ret;
+}
+
+static void __exit cxl_pci_doe_exit_module(void)
+{
+	auxiliary_driver_unregister(&cxl_pci_doe_auxiliary_drv);
+}
+
+module_init(cxl_pci_doe_init_module);
+module_exit(cxl_pci_doe_exit_module);
+MODULE_LICENSE("GPL");
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 41a6f3eb0a5c..0dec1f1a3f38 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -590,6 +590,17 @@ static void cxl_pci_doe_destroy_device(void *ad)
 	auxiliary_device_uninit(ad);
 }
 
+static struct cxl_doe_drv_state *cxl_pci_doe_get_drv(struct cxl_doe_dev *doe_dev)
+{
+	down_read(&doe_dev->driver_access);
+	return auxiliary_get_drvdata(&doe_dev->adev);
+}
+
+static void cxl_pci_doe_put_drv(struct cxl_doe_dev *doe_dev)
+{
+	up_read(&doe_dev->driver_access);
+}
+
 /**
  * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
  *				mailboxes found
@@ -652,6 +663,7 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
 			return -ENOMEM;
 
 		new_dev->pdev = pdev;
+		init_rwsem(&new_dev->driver_access);
 		new_dev->cap_offset = off;
 		new_dev->use_irq = use_irq;
 
@@ -682,6 +694,13 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
 					      adev);
 		if (rc)
 			return rc;
+
+		if (device_attach(&adev->dev) != 1) {
+			dev_err(&adev->dev,
+				"Failed to attach a driver to DOE device %d\n",
+				adev->id);
+			return -ENODEV;
+		}
 	}
 
 	return 0;
@@ -785,6 +804,7 @@ static struct pci_driver cxl_pci_driver = {
 	},
 };
 
+MODULE_SOFTDEP("pre: cxl_doe");
 MODULE_LICENSE("GPL v2");
 module_pci_driver(cxl_pci_driver);
 MODULE_IMPORT_NS(CXL);
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 4e96b45ee36d..c268df088dd4 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1118,6 +1118,7 @@
 #define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
 #define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
 #define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
+#define PCI_DOE_CAP_SIZE	(0x14 + 4)	/* Size of this register block */
 
 /* DOE Data Object - note not actually registers */
 #define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V8 06/10] cxl/pci: Find the DOE mailbox which supports CDAT
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (4 preceding siblings ...)
  2022-04-14 20:32 ` [PATCH V8 05/10] cxl/pci: Create DOE auxiliary driver ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-27 17:49   ` Jonathan Cameron
  2022-04-14 20:32 ` [PATCH V8 07/10] cxl/mem: Read CDAT table ira.weiny
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

Each CXL device may have multiple DOE mailbox capabilities and each
mailbox may support multiple protocols.

Search the DOE auxiliary devices for one which supports the CDAT
protocol.  Cache that device to be used for future queries.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V7
	Minor code clean ups

Changes from V6
	Adjust for aux devices being a CXL only concept
	Update commit msg.
	Ensure devices iterated by auxiliary_find_device() are checked
		to be DOE devices prior to checking for the CDAT
		protocol
	From Ben
		Ensure reference from auxiliary_find_device() is dropped
---
 drivers/cxl/cxl.h    |  2 ++
 drivers/cxl/cxlmem.h |  2 ++
 drivers/cxl/pci.c    | 76 +++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 990b6670222e..50817fd2c912 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -90,6 +90,8 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
 #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
 #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
 
+#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
+
 /*
  * Using struct_group() allows for per register-block-type helper routines,
  * without requiring block-type agnostic code to include the prefix.
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 7235d2f976e5..0dc6988afb30 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -168,6 +168,7 @@ struct cxl_endpoint_dvsec_info {
  * Currently only memory devices are represented.
  *
  * @dev: The device associated with this CXL state
+ * @cdat_doe: Auxiliary DOE device capabile of reading CDAT
  * @regs: Parsed register blocks
  * @cxl_dvsec: Offset to the PCIe device DVSEC
  * @payload_size: Size of space for payload
@@ -200,6 +201,7 @@ struct cxl_endpoint_dvsec_info {
 struct cxl_dev_state {
 	struct device *dev;
 
+	struct cxl_doe_dev *cdat_doe;
 	struct cxl_regs regs;
 	int cxl_dvsec;
 
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 0dec1f1a3f38..622cac925262 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -706,6 +706,80 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
 	return 0;
 }
 
+static bool cxl_doe_dev_supports_prot(struct cxl_doe_dev *doe_dev, u16 vid, u16 pid)
+{
+	struct cxl_doe_drv_state *doe_ds;
+	bool ret;
+
+	doe_ds = cxl_pci_doe_get_drv(doe_dev);
+	if (!doe_ds) {
+		cxl_pci_doe_put_drv(doe_dev);
+		return false;
+	}
+	ret = pci_doe_supports_prot(doe_ds->doe_mb, vid, pid);
+	cxl_pci_doe_put_drv(doe_dev);
+	return ret;
+}
+
+static int cxl_match_cdat_doe_device(struct device *dev, const void *data)
+{
+	const struct cxl_dev_state *cxlds = data;
+	struct auxiliary_device *adev = to_auxiliary_dev(dev);
+	struct cxl_doe_dev *doe_dev;
+
+	/* Ensure this is a DOE device */
+	if (strcmp(DOE_DEV_NAME, adev->name))
+		return 0;
+
+	/* Determine if this auxiliary device belongs to the cxlds */
+	if (cxlds->dev != dev->parent)
+		return 0;
+
+	doe_dev = container_of(adev, struct cxl_doe_dev, adev);
+
+	/* If it is one of ours check for the CDAT protocol */
+	if (!cxl_doe_dev_supports_prot(doe_dev, PCI_DVSEC_VENDOR_ID_CXL,
+				   CXL_DOE_PROTOCOL_TABLE_ACCESS))
+		return 0;
+
+	return 1;
+}
+
+static void drop_cdat_doe_ref(void *data)
+{
+	struct cxl_doe_dev *cdat_doe = data;
+
+	put_device(&cdat_doe->adev.dev);
+}
+
+static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
+{
+	struct device *dev = cxlds->dev;
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct auxiliary_device *adev;
+	int rc;
+
+	rc = cxl_pci_create_doe_devices(pdev);
+	if (rc)
+		return rc;
+
+	adev = auxiliary_find_device(NULL, cxlds, &cxl_match_cdat_doe_device);
+
+	if (adev) {
+		struct cxl_doe_dev *doe_dev = container_of(adev,
+							   struct cxl_doe_dev,
+							   adev);
+
+		/* auxiliary_find_device() takes the reference */
+		rc = devm_add_action_or_reset(dev, drop_cdat_doe_ref, doe_dev);
+		if (rc)
+			return rc;
+		cxlds->cdat_doe = doe_dev;
+	}
+
+	return 0;
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct cxl_register_map map;
@@ -772,7 +846,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
-	rc = cxl_pci_create_doe_devices(pdev);
+	rc = cxl_setup_doe_devices(cxlds);
 	if (rc)
 		return rc;
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V8 07/10] cxl/mem: Read CDAT table
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (5 preceding siblings ...)
  2022-04-14 20:32 ` [PATCH V8 06/10] cxl/pci: Find the DOE mailbox which supports CDAT ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-27 17:55   ` Jonathan Cameron
  2022-04-14 20:32 ` [PATCH V8 08/10] cxl/cdat: Introduce cxl_cdat_valid() ira.weiny
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

The OS will need CDAT data from CXL devices to properly set up
interleave sets.  Currently this is supported by a through a DOE mailbox
which supports CDAT.  But any cxl_mem object can provide this data later
if need be, for example for testing.

Cache the CDAT data for later parsing.  Provide a sysfs binary attribute
to allow dumping of the CDAT.

Binary dumping is modeled on /sys/firmware/ACPI/tables/

The ability to dump this table will be very useful for emulation of real
devices once they become available as QEMU CXL type 3 device emulation will
be able to load this file in.

This does not support table updates at runtime. It will always provide
whatever was there when first cached. Handling of table updates can be
implemented later.

Finally create a complete list of DOE defines within cdat.h for code
wishing to decode the CDAT table.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V6:
	Fix issue with devm use
		Move cached cdat data to cxl_dev_state
	Use new pci_doe_submit_task()
	Ensure the aux driver is locked while processing tasks
	Rebased on cxl-pending

Changes from V5:
	Add proper guards around cdat.h
	Split out finding the CDAT DOE mailbox
	Use cxl_cdat to group CDAT data together
	Adjust to use auxiliary_find_device() to find the DOE device
		which supplies the CDAT protocol.
	Rebased to latest
	Remove dev_dbg(length)
	Remove unneeded DOE Table access defines
	Move CXL_DOE_PROTOCOL_TABLE_ACCESS define into this patch where
		it is used

Changes from V4:
	Split this into it's own patch
	Rearchitect this such that the memdev driver calls into the DOE
	driver via the cxl_mem state object.  This allows CDAT data to
	come from any type of cxl_mem object not just PCI DOE.
	Rebase on new struct cxl_dev_state
---
 drivers/cxl/cdat.h        |  97 ++++++++++++++++++++++++
 drivers/cxl/core/memdev.c |  38 ++++++++++
 drivers/cxl/cxlmem.h      |  26 +++++++
 drivers/cxl/cxlpci.h      |   1 +
 drivers/cxl/pci.c         | 153 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 315 insertions(+)
 create mode 100644 drivers/cxl/cdat.h

diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
new file mode 100644
index 000000000000..4722b6bbbaf0
--- /dev/null
+++ b/drivers/cxl/cdat.h
@@ -0,0 +1,97 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __CXL_CDAT_H__
+#define __CXL_CDAT_H__
+
+/*
+ * Coherent Device Attribute table (CDAT)
+ *
+ * Specification available from UEFI.org
+ *
+ * Whilst CDAT is defined as a single table, the access via DOE maiboxes is
+ * done one entry at a time, where the first entry is the header.
+ */
+
+#define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
+#define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
+#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
+#define   CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA	0
+#define CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE	0xffff0000
+
+/*
+ * CDAT entries are little endian and are read from PCI config space which
+ * is also little endian.
+ * As such, on a big endian system these will have been reversed.
+ * This prevents us from making easy use of packed structures.
+ * Style form pci_regs.h
+ */
+
+#define CDAT_HEADER_LENGTH_DW 4
+#define CDAT_HEADER_LENGTH_BYTES (CDAT_HEADER_LENGTH_DW * sizeof(u32))
+#define CDAT_HEADER_DW0_LENGTH		0xffffffff
+#define CDAT_HEADER_DW1_REVISION	0x000000ff
+#define CDAT_HEADER_DW1_CHECKSUM	0x0000ff00
+/* CDAT_HEADER_DW2_RESERVED	*/
+#define CDAT_HEADER_DW3_SEQUENCE	0xffffffff
+
+/* All structures have a common first DW */
+#define CDAT_STRUCTURE_DW0_TYPE		0x000000ff
+#define   CDAT_STRUCTURE_DW0_TYPE_DSMAS 0
+#define   CDAT_STRUCTURE_DW0_TYPE_DSLBIS 1
+#define   CDAT_STRUCTURE_DW0_TYPE_DSMSCIS 2
+#define   CDAT_STRUCTURE_DW0_TYPE_DSIS 3
+#define   CDAT_STRUCTURE_DW0_TYPE_DSEMTS 4
+#define   CDAT_STRUCTURE_DW0_TYPE_SSLBIS 5
+
+#define CDAT_STRUCTURE_DW0_LENGTH	0xffff0000
+
+/* Device Scoped Memory Affinity Structure */
+#define CDAT_DSMAS_DW1_DSMAD_HANDLE	0x000000ff
+#define CDAT_DSMAS_DW1_FLAGS		0x0000ff00
+#define CDAT_DSMAS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMAS_DPA_LEN(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
+#define CDAT_DSMAS_NON_VOLATILE(flags)  ((flags & 0x04) >> 2)
+
+/* Device Scoped Latency and Bandwidth Information Structure */
+#define CDAT_DSLBIS_DW1_HANDLE		0x000000ff
+#define CDAT_DSLBIS_DW1_FLAGS		0x0000ff00
+#define CDAT_DSLBIS_DW1_DATA_TYPE	0x00ff0000
+#define CDAT_DSLBIS_BASE_UNIT(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSLBIS_DW4_ENTRY_0		0x0000ffff
+#define CDAT_DSLBIS_DW4_ENTRY_1		0xffff0000
+#define CDAT_DSLBIS_DW5_ENTRY_2		0x0000ffff
+
+/* Device Scoped Memory Side Cache Information Structure */
+#define CDAT_DSMSCIS_DW1_HANDLE		0x000000ff
+#define CDAT_DSMSCIS_MEMORY_SIDE_CACHE_SIZE(entry) \
+	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMSCIS_DW4_MEMORY_SIDE_CACHE_ATTRS 0xffffffff
+
+/* Device Scoped Initiator Structure */
+#define CDAT_DSIS_DW1_FLAGS		0x000000ff
+#define CDAT_DSIS_DW1_HANDLE		0x0000ff00
+
+/* Device Scoped EFI Memory Type Structure */
+#define CDAT_DSEMTS_DW1_HANDLE		0x000000ff
+#define CDAT_DSEMTS_DW1_EFI_MEMORY_TYPE_ATTR	0x0000ff00
+#define CDAT_DSEMTS_DPA_OFFSET(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSEMTS_DPA_LENGTH(entry)	((u64)((entry)[5]) << 32 | (entry)[4])
+
+/* Switch Scoped Latency and Bandwidth Information Structure */
+#define CDAT_SSLBIS_DW1_DATA_TYPE	0x000000ff
+#define CDAT_SSLBIS_BASE_UNIT(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_SSLBIS_ENTRY_PORT_X(entry, i) ((entry)[4 + (i) * 2] & 0x0000ffff)
+#define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
+#define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
+
+/**
+ * struct cxl_cdat - CXL CDAT data
+ *
+ * @table: cache of CDAT table
+ * @length: length of cached CDAT table
+ */
+struct cxl_cdat {
+	void *table;
+	size_t length;
+};
+
+#endif /* !__CXL_CDAT_H__ */
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 1f76b28f9826..9ca5f2f2d308 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -86,6 +86,37 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 	return sysfs_emit(buf, "%#llx\n", len);
 }
 
+static ssize_t CDAT_read(struct file *filp, struct kobject *kobj,
+			 struct bin_attribute *bin_attr, char *buf,
+			 loff_t offset, size_t count)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	if (!cxlds->cdat.table)
+		return 0;
+
+	return memory_read_from_buffer(buf, count, &offset,
+				       cxlds->cdat.table,
+				       cxlds->cdat.length);
+}
+
+static BIN_ATTR_RO(CDAT, 0);
+
+static umode_t cxl_memdev_bin_attr_is_visible(struct kobject *kobj,
+					      struct bin_attribute *attr, int i)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	if ((attr == &bin_attr_CDAT) && cxlds->cdat.table)
+		return 0400;
+
+	return 0;
+}
+
 static struct device_attribute dev_attr_pmem_size =
 	__ATTR(size, 0444, pmem_size_show, NULL);
 
@@ -115,6 +146,11 @@ static struct attribute *cxl_memdev_attributes[] = {
 	NULL,
 };
 
+static struct bin_attribute *cxl_memdev_bin_attributes[] = {
+	&bin_attr_CDAT,
+	NULL,
+};
+
 static struct attribute *cxl_memdev_pmem_attributes[] = {
 	&dev_attr_pmem_size.attr,
 	NULL,
@@ -136,6 +172,8 @@ static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
 static struct attribute_group cxl_memdev_attribute_group = {
 	.attrs = cxl_memdev_attributes,
 	.is_visible = cxl_memdev_visible,
+	.bin_attrs = cxl_memdev_bin_attributes,
+	.is_bin_visible = cxl_memdev_bin_attr_is_visible,
 };
 
 static struct attribute_group cxl_memdev_ram_attribute_group = {
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 0dc6988afb30..8bf058c8d31f 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -5,6 +5,7 @@
 #include <uapi/linux/cxl_mem.h>
 #include <linux/cdev.h>
 #include "cxl.h"
+#include "cdat.h"
 
 /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
 #define CXLMDEV_STATUS_OFFSET 0x0
@@ -169,6 +170,7 @@ struct cxl_endpoint_dvsec_info {
  *
  * @dev: The device associated with this CXL state
  * @cdat_doe: Auxiliary DOE device capabile of reading CDAT
+ * @cdat: Cached CDAT data
  * @regs: Parsed register blocks
  * @cxl_dvsec: Offset to the PCIe device DVSEC
  * @payload_size: Size of space for payload
@@ -194,6 +196,10 @@ struct cxl_endpoint_dvsec_info {
  * @serial: PCIe Device Serial Number
  * @mbox_send: @dev specific transport for transmitting mailbox commands
  * @wait_media_ready: @dev specific method to await media ready
+ * @cdat_get_length: @dev specific function for reading the CDAT table length
+ *                   returns -errno if CDAT not supported on this device
+ * @cdat_read_table: @dev specific function for reading the table
+ *                   returns -errno if CDAT not supported on this device
  *
  * See section 8.2.9.5.2 Capacity Configuration and Label Storage for
  * details on capacity parameters.
@@ -202,6 +208,7 @@ struct cxl_dev_state {
 	struct device *dev;
 
 	struct cxl_doe_dev *cdat_doe;
+	struct cxl_cdat cdat;
 	struct cxl_regs regs;
 	int cxl_dvsec;
 
@@ -230,6 +237,9 @@ struct cxl_dev_state {
 
 	int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
 	int (*wait_media_ready)(struct cxl_dev_state *cxlds);
+	int (*cdat_get_length)(struct cxl_dev_state *cxlds, size_t *length);
+	int (*cdat_read_table)(struct cxl_dev_state *cxlds,
+			       struct cxl_cdat *cdat);
 };
 
 enum cxl_opcode {
@@ -374,4 +384,20 @@ struct cxl_hdm {
 	unsigned int interleave_mask;
 	struct cxl_port *port;
 };
+
+static inline int cxl_mem_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
+{
+	if (cxlds->cdat_get_length)
+		return cxlds->cdat_get_length(cxlds, length);
+	return -EOPNOTSUPP;
+}
+
+static inline int cxl_mem_cdat_read_table(struct cxl_dev_state *cxlds,
+					  struct cxl_cdat *cdat)
+{
+	if (cxlds->cdat_read_table)
+		return cxlds->cdat_read_table(cxlds, cdat);
+	return -EOPNOTSUPP;
+}
+
 #endif /* __CXL_MEM_H__ */
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 821fe05e8289..aa8c6ef93000 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -4,6 +4,7 @@
 #define __CXL_PCI_H__
 #include <linux/auxiliary_bus.h>
 #include <linux/pci.h>
+#include <linux/pci-doe.h>
 #include "cxl.h"
 
 #define CXL_MEMORY_PROGIF	0x10
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 622cac925262..aecb327911a0 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -13,6 +13,7 @@
 #include "cxlmem.h"
 #include "cxlpci.h"
 #include "cxl.h"
+#include "cdat.h"
 
 /**
  * DOC: cxl pci
@@ -780,6 +781,151 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
 	return 0;
 }
 
+#define CDAT_DOE_REQ(entry_handle)					\
+	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
+		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
+	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_TABLE_TYPE,			\
+		    CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA) |		\
+	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, (entry_handle)))
+
+static void cxl_doe_task_complete(struct pci_doe_task *task)
+{
+	complete(task->private);
+}
+
+static int cxl_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
+{
+	struct cxl_doe_dev *doe_dev = cxlds->cdat_doe;
+	struct cxl_doe_drv_state *doe_ds;
+	u32 cdat_request_pl = CDAT_DOE_REQ(0);
+	u32 cdat_response_pl[32];
+	DECLARE_COMPLETION_ONSTACK(c);
+	struct pci_doe_task task = {
+		.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+		.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+		.request_pl = &cdat_request_pl,
+		.request_pl_sz = sizeof(cdat_request_pl),
+		.response_pl = cdat_response_pl,
+		.response_pl_sz = sizeof(cdat_response_pl),
+		.complete = cxl_doe_task_complete,
+		.private = &c,
+	};
+	int rc = 0;
+
+	doe_ds = cxl_pci_doe_get_drv(doe_dev);
+	if (!doe_ds) {
+		rc = -EIO;
+		goto release_driver;
+	}
+
+	rc = pci_doe_submit_task(doe_ds->doe_mb, &task);
+	if (rc < 0) {
+		dev_err(cxlds->dev, "DOE submit failed: %d", rc);
+		goto release_driver;
+	}
+	wait_for_completion(&c);
+
+	if (task.rv < 1) {
+		rc = -EIO;
+		goto release_driver;
+	}
+
+	*length = cdat_response_pl[1];
+	dev_dbg(cxlds->dev, "CDAT length %lu\n", *length);
+
+release_driver:
+	cxl_pci_doe_put_drv(doe_dev);
+	return rc;
+}
+
+static int cxl_cdat_read_table(struct cxl_dev_state *cxlds,
+			       struct cxl_cdat *cdat)
+{
+	struct cxl_doe_dev *doe_dev = cxlds->cdat_doe;
+	struct cxl_doe_drv_state *doe_ds;
+	size_t length = cdat->length;
+	u32 *data = cdat->table;
+	int entry_handle = 0;
+	int rc = 0;
+
+	doe_ds = cxl_pci_doe_get_drv(doe_dev);
+	if (!doe_ds) {
+		rc = -EIO;
+		goto release_driver;
+	}
+
+	do {
+		u32 cdat_request_pl = CDAT_DOE_REQ(entry_handle);
+		u32 cdat_response_pl[32];
+		DECLARE_COMPLETION_ONSTACK(c);
+		struct pci_doe_task task = {
+			.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+			.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+			.request_pl = &cdat_request_pl,
+			.request_pl_sz = sizeof(cdat_request_pl),
+			.response_pl = cdat_response_pl,
+			.response_pl_sz = sizeof(cdat_response_pl),
+			.complete = cxl_doe_task_complete,
+			.private = &c,
+		};
+		size_t entry_dw;
+		u32 *entry;
+
+		rc = pci_doe_submit_task(doe_ds->doe_mb, &task);
+		if (rc < 0) {
+			dev_err(cxlds->dev, "DOE submit failed: %d", rc);
+			goto release_driver;
+		}
+		wait_for_completion(&c);
+
+		entry = cdat_response_pl + 1;
+		entry_dw = task.rv / sizeof(u32);
+		/* Skip Header */
+		entry_dw -= 1;
+		entry_dw = min(length / 4, entry_dw);
+		memcpy(data, entry, entry_dw * sizeof(u32));
+		length -= entry_dw * sizeof(u32);
+		data += entry_dw;
+		entry_handle = FIELD_GET(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, cdat_response_pl[0]);
+
+	} while (entry_handle != 0xFFFF);
+
+release_driver:
+	cxl_pci_doe_put_drv(doe_dev);
+	return rc;
+}
+
+static void cxl_initialize_cdat_callbacks(struct cxl_dev_state *cxlds)
+{
+	if (!cxlds->cdat_doe)
+		return;
+
+	cxlds->cdat_get_length = cxl_cdat_get_length;
+	cxlds->cdat_read_table = cxl_cdat_read_table;
+}
+
+static int read_cdat_data(struct cxl_dev_state *cxlds)
+{
+	struct device *dev = cxlds->dev;
+	size_t cdat_length;
+	int ret;
+
+	if (cxl_mem_cdat_get_length(cxlds, &cdat_length))
+		return 0;
+
+	cxlds->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
+	if (!cxlds->cdat.table)
+		return -ENOMEM;
+	cxlds->cdat.length = cdat_length;
+	ret = cxl_mem_cdat_read_table(cxlds, &cxlds->cdat);
+	if (ret) {
+		devm_kfree(dev, cxlds->cdat.table);
+		cxlds->cdat.table = NULL;
+		cxlds->cdat.length = 0;
+	}
+	return ret;
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct cxl_register_map map;
@@ -850,6 +996,13 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	cxl_initialize_cdat_callbacks(cxlds);
+
+	/* Cache the data early to ensure is_visible() works */
+	rc = read_cdat_data(cxlds);
+	if (rc)
+		dev_err(&pdev->dev, "CDAT data read error (%d)\n", rc);
+
 	cxl_dvsec_ranges(cxlds);
 
 	cxlmd = devm_cxl_add_memdev(cxlds);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V8 08/10] cxl/cdat: Introduce cxl_cdat_valid()
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (6 preceding siblings ...)
  2022-04-14 20:32 ` [PATCH V8 07/10] cxl/mem: Read CDAT table ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-27 17:56   ` Jonathan Cameron
  2022-04-14 20:32 ` [PATCH V8 09/10] cxl/mem: Retry reading CDAT on failure ira.weiny
  2022-04-14 20:32 ` [PATCH V8 10/10] cxl/port: Parse out DSMAS data from CDAT table ira.weiny
  9 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

The CDAT data is protected by a checksum and should be the proper
length.

Introduce cxl_cdat_valid() to validate the data.  While at it check and
store the sequence number.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V6
	Change name to cxl_cdat_valid() as this validates all the CDAT
		data not just the header
	Add error and debug prints

Changes from V5
	New patch, split out
	Update cdat_hdr_valid()
		Remove revision and cs field parsing
			There is no point in these
		Add seq check and debug print.
---
 drivers/cxl/cdat.h |  2 ++
 drivers/cxl/pci.c  | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
index 4722b6bbbaf0..a7725d26f2d2 100644
--- a/drivers/cxl/cdat.h
+++ b/drivers/cxl/cdat.h
@@ -88,10 +88,12 @@
  *
  * @table: cache of CDAT table
  * @length: length of cached CDAT table
+ * @seq: Last read Sequence number of the CDAT table
  */
 struct cxl_cdat {
 	void *table;
 	size_t length;
+	u32 seq;
 };
 
 #endif /* !__CXL_CDAT_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index aecb327911a0..d7952156dd02 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -781,6 +781,40 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
 	return 0;
 }
 
+static bool cxl_cdat_valid(struct device *dev, struct cxl_cdat *cdat)
+{
+	u32 *table = cdat->table;
+	u8 *data8 = cdat->table;
+	u32 length, seq;
+	u8 check;
+	int i;
+
+	length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, table[0]);
+	if ((length < CDAT_HEADER_LENGTH_BYTES) || (length > cdat->length)) {
+		dev_err(dev, "Invalid length %u (%lu-%lu)\n", length,
+			CDAT_HEADER_LENGTH_BYTES, cdat->length);
+		return false;
+	}
+
+	for (check = 0, i = 0; i < length; i++)
+		check += data8[i];
+
+	dev_dbg(dev, "CDAT length %u CS %u\n", length, check);
+	if (check != 0) {
+		dev_err(dev, "Invalid checksum %u\n", check);
+		return false;
+	}
+
+	seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, table[3]);
+	/* Store the sequence for now. */
+	if (cdat->seq != seq) {
+		dev_info(dev, "CDAT seq change %x -> %x\n", cdat->seq, seq);
+		cdat->seq = seq;
+	}
+
+	return true;
+}
+
 #define CDAT_DOE_REQ(entry_handle)					\
 	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
 		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
@@ -892,6 +926,8 @@ static int cxl_cdat_read_table(struct cxl_dev_state *cxlds,
 
 release_driver:
 	cxl_pci_doe_put_drv(doe_dev);
+	if (!rc && !cxl_cdat_valid(cxlds->dev, cdat))
+		return -EIO;
 	return rc;
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V8 09/10] cxl/mem: Retry reading CDAT on failure
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (7 preceding siblings ...)
  2022-04-14 20:32 ` [PATCH V8 08/10] cxl/cdat: Introduce cxl_cdat_valid() ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-27 17:57   ` Jonathan Cameron
  2022-04-14 20:32 ` [PATCH V8 10/10] cxl/port: Parse out DSMAS data from CDAT table ira.weiny
  9 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

The CDAT read may fail for a number of reasons but mainly it is possible
to get different parts of a valid state.  The checksum in the CDAT table
protects against this.

Now that the cdat data is validated issue a retries if the CDAT read
fails.  For now 5 retries are implemented.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V6
	Move to pci.c
	Fix retries count
	Change to 5 retries

Changes from V5:
	New patch -- easy to push off or drop.
---
 drivers/cxl/pci.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index d7952156dd02..43cbc297079d 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -940,7 +940,7 @@ static void cxl_initialize_cdat_callbacks(struct cxl_dev_state *cxlds)
 	cxlds->cdat_read_table = cxl_cdat_read_table;
 }
 
-static int read_cdat_data(struct cxl_dev_state *cxlds)
+static int __read_cdat_data(struct cxl_dev_state *cxlds)
 {
 	struct device *dev = cxlds->dev;
 	size_t cdat_length;
@@ -962,6 +962,21 @@ static int read_cdat_data(struct cxl_dev_state *cxlds)
 	return ret;
 }
 
+static void read_cdat_data(struct cxl_dev_state *cxlds)
+{
+	int retries = 5;
+	int rc;
+
+	while (retries--) {
+		rc = __read_cdat_data(cxlds);
+		if (!rc)
+			break;
+		dev_err(cxlds->dev,
+			"CDAT data read error rc=%d (retries %d)\n",
+			rc, retries);
+	}
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct cxl_register_map map;
@@ -1035,9 +1050,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	cxl_initialize_cdat_callbacks(cxlds);
 
 	/* Cache the data early to ensure is_visible() works */
-	rc = read_cdat_data(cxlds);
-	if (rc)
-		dev_err(&pdev->dev, "CDAT data read error (%d)\n", rc);
+	read_cdat_data(cxlds);
 
 	cxl_dvsec_ranges(cxlds);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V8 10/10] cxl/port: Parse out DSMAS data from CDAT table
  2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (8 preceding siblings ...)
  2022-04-14 20:32 ` [PATCH V8 09/10] cxl/mem: Retry reading CDAT on failure ira.weiny
@ 2022-04-14 20:32 ` ira.weiny
  2022-04-27 18:01   ` Jonathan Cameron
  9 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2022-04-14 20:32 UTC (permalink / raw)
  To: Dan Williams, Bjorn Helgaas, Jonathan Cameron
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

CXL Ports with memory devices attached need the information from the
Device Scoped Memory Affinity Structure (DSMAS).  This information is
contained within the CDAT table buffer which is previously read and
cached in the device state.

If CDAT data is available, parse and cache DSMAS data from the table.
Store this data in unmarshaled struct dsmas data structures for ease of
use.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V7
	Rebased on cxl-pending

Changes from V6
	Move to port.c
	It is not an error if no DSMAS data is found

Changes from V5
	Fix up sparse warnings
	Split out cdat_hdr_valid()
	Update cdat_hdr_valid()
		Remove revision and cs field parsing
			There is no point in these
		Add seq check and debug print.
	From Jonathan
		Add spaces around '+' and '/'
		use devm_krealloc() for dmas_ary
---
 drivers/cxl/cdat.h | 17 ++++++++++++
 drivers/cxl/cxl.h  |  6 +++++
 drivers/cxl/port.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+)

diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
index a7725d26f2d2..66706b238bc9 100644
--- a/drivers/cxl/cdat.h
+++ b/drivers/cxl/cdat.h
@@ -83,6 +83,23 @@
 #define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
 #define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
 
+/**
+ * struct cxl_dsmas - host unmarshaled version of DSMAS data
+ *
+ * As defined in the Coherent Device Attribute Table (CDAT) specification this
+ * represents a single DSMAS entry in that table.
+ *
+ * @dpa_base: The lowest Device Physical Address associated with this DSMAD
+ * @length: Length in bytes of this DSMAD
+ * @non_volatile: If set, the memory region represents Non-Volatile memory
+ */
+struct cxl_dsmas {
+	u64 dpa_base;
+	u64 length;
+	/* Flags */
+	u8 non_volatile:1;
+};
+
 /**
  * struct cxl_cdat - CXL CDAT data
  *
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 50817fd2c912..80fd39769a65 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -9,6 +9,8 @@
 #include <linux/bitops.h>
 #include <linux/io.h>
 
+#include "cdat.h"
+
 /**
  * DOC: cxl objects
  *
@@ -269,6 +271,8 @@ struct cxl_nvdimm {
  * @component_reg_phys: component register capability base address (optional)
  * @dead: last ep has been removed, force port re-creation
  * @depth: How deep this port is relative to the root. depth 0 is the root.
+ * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
+ * @nr_dsmas: Number of entries in dsmas_ary
  */
 struct cxl_port {
 	struct device dev;
@@ -280,6 +284,8 @@ struct cxl_port {
 	resource_size_t component_reg_phys;
 	bool dead;
 	unsigned int depth;
+	struct cxl_dsmas *dsmas_ary;
+	int nr_dsmas;
 };
 
 /**
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index d420da5fc39c..2288432a27cd 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -30,6 +30,70 @@ static void schedule_detach(void *cxlmd)
 	schedule_cxl_memdev_detach(cxlmd);
 }
 
+static void parse_dsmas(struct cxl_port *port, struct cxl_dev_state *cxlds)
+{
+	struct device *dev = &port->dev;
+	struct cxl_dsmas *dsmas_ary = NULL;
+	u32 *data = cxlds->cdat.table;
+	int bytes_left = cxlds->cdat.length;
+	int nr_dsmas = 0;
+
+	if (!data) {
+		dev_info(dev, "No CDAT data available for DSMAS\n");
+		return;
+	}
+
+	/* Skip header */
+	data += CDAT_HEADER_LENGTH_DW;
+	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
+
+	while (bytes_left > 0) {
+		u32 *cur_rec = data;
+		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
+		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
+
+		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
+			struct cxl_dsmas *new_ary;
+			u8 flags;
+
+			new_ary = devm_krealloc(dev, dsmas_ary,
+					   sizeof(*dsmas_ary) * (nr_dsmas + 1),
+					   GFP_KERNEL);
+			if (!new_ary) {
+				dev_err(dev,
+					"Failed to allocate memory for DSMAS data (nr_dsmas %d)\n",
+					nr_dsmas);
+				return;
+			}
+			dsmas_ary = new_ary;
+
+			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
+
+			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
+			dsmas_ary[nr_dsmas].length = CDAT_DSMAS_DPA_LEN(cur_rec);
+			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
+
+			dev_dbg(dev, "DSMAS %d: %llx:%llx %s\n",
+				nr_dsmas,
+				dsmas_ary[nr_dsmas].dpa_base,
+				dsmas_ary[nr_dsmas].dpa_base +
+					dsmas_ary[nr_dsmas].length,
+				(dsmas_ary[nr_dsmas].non_volatile ?
+					"Persistent" : "Volatile")
+				);
+
+			nr_dsmas++;
+		}
+
+		data += (length / sizeof(u32));
+		bytes_left -= length;
+	}
+
+	dev_dbg(dev, "Found %d DSMAS entries\n", nr_dsmas);
+	port->dsmas_ary = dsmas_ary;
+	port->nr_dsmas = nr_dsmas;
+}
+
 static int cxl_port_probe(struct device *dev)
 {
 	struct cxl_port *port = to_cxl_port(dev);
@@ -43,6 +107,7 @@ static int cxl_port_probe(struct device *dev)
 		rc = devm_add_action_or_reset(dev, schedule_detach, cxlmd);
 		if (rc)
 			return rc;
+		parse_dsmas(port, cxlmd->cxlds);
 	} else {
 		rc = devm_cxl_port_enumerate_dports(port);
 		if (rc < 0)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox
  2022-04-14 20:32 ` [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox ira.weiny
@ 2022-04-27 17:19   ` Jonathan Cameron
  2022-04-28 21:09     ` ira.weiny
  2022-04-29 15:55   ` Jonathan Cameron
  1 sibling, 1 reply; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-27 17:19 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 14 Apr 2022 13:32:31 -0700
ira.weiny@intel.com wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL kernel drivers optionally need to access DOE mailbox capabilities.
> Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
> the kernel while other access is designed towards user space usage.  An
> example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
> Compliance Mode DOE) which offers a mechanism to set different test
> modes for a device.
> 
> There is no anticipated need for the kernel to share an individual
> mailbox with user space.  Thus developing an interface to marshal access
> between the kernel and user space for a single mailbox is unnecessary
> overhead.  However, having the kernel relinquish some mailboxes to be
> controlled by user space is a reasonable compromise to share access to
> the device.
> 
> The auxiliary bus provides an elegant solution for this.  Each DOE
> capability is given its own auxiliary device.  This device is controlled
> by a kernel driver by default which restricts access to the mailbox.
> Unbinding the driver from a single auxiliary device (DOE mailbox
> capability) frees the mailbox for user space access.  This architecture
> also allows a clear picture on which mailboxes are kernel controlled vs
> not.
> 
> Iterate each DOE mailbox capability and create auxiliary bus devices.
> Follow on patches will define a driver for the newly created devices.
> 
> sysfs shows the devices.
> 
> $ ls -l /sys/bus/auxiliary/devices/
> total 0
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.0 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.1 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.2 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.3 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.4 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.4
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.5 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.6 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.7 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

I'm not 100% happy with effectively having one solution for CXL
and probably a different one for DOEs on switch ports
(which I just hacked into a port service driver to convince
myself there was at least one plausible way of doing that) but if
this effectively separates the two discussions then I guess I can
live with it for now ;)

Once this is merged we can start the discussion about how to
handle switch ports with DOEs both for CDAT and SPDM.

I'll send out an RFC that is so hideous it will get people to
suggestion how to do it better!  Currently it starts and
stops the mailbox 3 times in the registration path and I think
it's more luck than judgement that is landing me with the right
MSI.

Anyhow, this looks good to me.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


> 
> ---
> Changes from V7:
> 	Minor code clean ups
> 	Rebased on cxl-pending
> 
> Changes from V6:
> 	Move all the auxiliary device stuff to the CXL layer
> 
> Changes from V5:
> 	Split the CXL specific stuff off from the PCI DOE create
> 	auxiliary device code.
> ---
>  drivers/cxl/Kconfig  |   1 +
>  drivers/cxl/cxlpci.h |  21 +++++++
>  drivers/cxl/pci.c    | 127 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 149 insertions(+)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index f64e3984689f..ac0f5ca95431 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -16,6 +16,7 @@ if CXL_BUS
>  config CXL_PCI
>  	tristate "PCI manageability"
>  	default CXL_BUS
> +	select AUXILIARY_BUS
>  	help
>  	  The CXL specification defines a "CXL memory device" sub-class in the
>  	  PCI "memory controller" base class of devices. Device's identified by
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 329e7ea3f36a..2ad8715173ce 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -2,6 +2,7 @@
>  /* Copyright(c) 2020 Intel Corporation. All rights reserved. */
>  #ifndef __CXL_PCI_H__
>  #define __CXL_PCI_H__
> +#include <linux/auxiliary_bus.h>
>  #include <linux/pci.h>
>  #include "cxl.h"
>  
> @@ -72,4 +73,24 @@ static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
>  }
>  
>  int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> +
> +/**
> + * struct cxl_doe_dev - CXL DOE auxiliary bus device
> + *
> + * @adev: Auxiliary bus device
> + * @pdev: PCI device this belongs to
> + * @cap_offset: Capability offset
> + * @use_irq: Set if IRQs are to be used with this mailbox
> + *
> + * This represents a single DOE mailbox device.  CXL devices should create this
> + * device and register it on the Auxiliary bus for the CXL DOE driver to drive.
> + */
> +struct cxl_doe_dev {
> +	struct auxiliary_device adev;
> +	struct pci_dev *pdev;
> +	int cap_offset;
> +	bool use_irq;
> +};
> +#define DOE_DEV_NAME "doe"
> +
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index e7ab9a34d718..41a6f3eb0a5c 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -8,6 +8,7 @@
>  #include <linux/mutex.h>
>  #include <linux/list.h>
>  #include <linux/pci.h>
> +#include <linux/pci-doe.h>
>  #include <linux/io.h>
>  #include "cxlmem.h"
>  #include "cxlpci.h"
> @@ -564,6 +565,128 @@ static void cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
>  	info->ranges = __cxl_dvsec_ranges(cxlds, info);
>  }
>  
> +static void cxl_pci_free_irq_vectors(void *data)
> +{
> +	pci_free_irq_vectors(data);
> +}
> +
> +static DEFINE_IDA(pci_doe_adev_ida);
> +
> +static void cxl_pci_doe_dev_release(struct device *dev)
> +{
> +	struct auxiliary_device *adev = container_of(dev,
> +						struct auxiliary_device,
> +						dev);
> +	struct cxl_doe_dev *doe_dev = container_of(adev, struct cxl_doe_dev,
> +						   adev);
> +
> +	ida_free(&pci_doe_adev_ida, adev->id);
> +	kfree(doe_dev);
> +}
> +
> +static void cxl_pci_doe_destroy_device(void *ad)
> +{
> +	auxiliary_device_delete(ad);
> +	auxiliary_device_uninit(ad);
> +}
> +
> +/**
> + * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
> + *				mailboxes found
> + *
> + * @pci_dev: The PCI device to scan for DOE mailboxes
> + *
> + * There is no coresponding destroy of these devices.  This function associates
> + * the DOE auxiliary devices created with the pci_dev passed in.  That
> + * association is device managed (devm_*) such that the DOE auxiliary device
> + * lifetime is always less than or equal to the lifetime of the pci_dev.
> + *
> + * RETURNS: 0 on success -ERRNO on failure.
> + */
> +static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	bool use_irq = true;
> +	int irqs = 0;
> +	u16 off = 0;
> +	int rc;
> +
> +	pci_doe_for_each_off(pdev, off)
> +		irqs++;
> +	pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> +
> +	/*
> +	 * Allocate enough vectors for the DOE's
> +	 */
> +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> +						     PCI_IRQ_MSIX);
> +	if (rc != irqs) {
> +		pci_err(pdev,
> +			"Not enough interrupts for all the DOEs; use polling\n");
> +		use_irq = false;
> +		/* Some got allocated; clean them up */
> +		if (rc > 0)
> +			cxl_pci_free_irq_vectors(pdev);
> +	} else {
> +		/*
> +		 * Enabling bus mastering is require for MSI/MSIx.  It could be
> +		 * done later within the DOE initialization, but as it
> +		 * potentially has other impacts keep it here when setting up
> +		 * the IRQ's.
> +		 */
> +		pci_set_master(pdev);
> +		rc = devm_add_action_or_reset(dev,
> +					      cxl_pci_free_irq_vectors,
> +					      pdev);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	pci_doe_for_each_off(pdev, off) {
> +		struct auxiliary_device *adev;
> +		struct cxl_doe_dev *new_dev;
> +		int id;
> +
> +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> +		if (!new_dev)
> +			return -ENOMEM;
> +
> +		new_dev->pdev = pdev;
> +		new_dev->cap_offset = off;
> +		new_dev->use_irq = use_irq;
> +
> +		/* Set up struct auxiliary_device */
> +		adev = &new_dev->adev;
> +		id = ida_alloc(&pci_doe_adev_ida, GFP_KERNEL);
> +		if (id < 0) {
> +			kfree(new_dev);
> +			return -ENOMEM;
> +		}
> +
> +		adev->id = id;
> +		adev->name = DOE_DEV_NAME;
> +		adev->dev.release = cxl_pci_doe_dev_release;
> +		adev->dev.parent = dev;
> +
> +		if (auxiliary_device_init(adev)) {
> +			cxl_pci_doe_dev_release(&adev->dev);
> +			return -EIO;
> +		}
> +
> +		if (auxiliary_device_add(adev)) {
> +			auxiliary_device_uninit(adev);
> +			return -EIO;
> +		}
> +
> +		rc = devm_add_action_or_reset(dev, cxl_pci_doe_destroy_device,
> +					      adev);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	return 0;
> +}
> +
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct cxl_register_map map;
> @@ -630,6 +753,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxl_pci_create_doe_devices(pdev);
> +	if (rc)
> +		return rc;
> +
>  	cxl_dvsec_ranges(cxlds);
>  
>  	cxlmd = devm_cxl_add_memdev(cxlds);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 05/10] cxl/pci: Create DOE auxiliary driver
  2022-04-14 20:32 ` [PATCH V8 05/10] cxl/pci: Create DOE auxiliary driver ira.weiny
@ 2022-04-27 17:43   ` Jonathan Cameron
  2022-04-28 14:48     ` ira.weiny
  0 siblings, 1 reply; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-27 17:43 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 14 Apr 2022 13:32:32 -0700
ira.weiny@intel.com wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL kernel drivers optionally need to access DOE mailbox capabilities.
> Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
> the kernel while other access is designed towards user space usage.  An
> example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
> Compliance Mode DOE) which offers a mechanism to set different test
> modes for a device.
> 
> There is no anticipated need for the kernel to share an individual
> mailbox with user space.  Thus developing an interface to marshal access
> between the kernel and user space for a single mailbox is unnecessary
> overhead.  However, having the kernel relinquish some mailboxes to be
> controlled by user space is a reasonable compromise to share access to
> the device.
> 
> The auxiliary bus provides an elegant solution for this.  Each DOE
> capability is given its own auxiliary device.  This device is controlled
> by a kernel driver by default which restricts access to the mailbox.
> Unbinding the driver from a single auxiliary device (DOE mailbox
> capability) frees the mailbox for user space access.  This architecture
> also allows a clear picture on which mailboxes are kernel controlled vs
> not.
> 
> Create a driver for the DOE auxiliary devices.  The driver uses the PCI
> DOE core to manage the mailbox.
> 
> User space must be prevented from unbinding the driver state when the
> DOE auxiliary driver is being accessed by the kernel.  Add a read write
> lock to the DOE auxiliary device to protect the driver data portion.
> 
> Finally, flag the driver module to be preloaded by device creation to
> ensure the driver is attached when iterating the DOE capabilities.
> 
> User space access can be obtained by unbinding the driver from that
> device.  For example:
> 
> $ ls -l /sys/bus/auxiliary/drivers
> total 0
> drwxr-xr-x 2 root root 0 Mar 24 10:45 cxl_doe.cxl_doe_drv
> 
> $ ls -l /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci*
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.0 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.1 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.2 -> ../../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.4
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.5 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.6 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.7 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> 
> $ echo "cxl_pci.doe.4" > /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/unbind
> 
> $ ls -l /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci*
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.0 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.1 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.2 -> ../../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.5 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.6 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.7 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Hi Ira,

A few minor comments inline.
With those cleaned up

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> 
> ---
> Changes from V7:
> 	Now need to select PCI_DOE
> 	Change MODULE_LICENSE to 'GPL' instead of old 'GPL v2'
> 
> Changes from V6:
> 	The CXL layer now contains the driver for these auxiliary
> 	devices.
> 
> Changes from V5:
> 	Split the CXL specific stuff off from the PCI DOE create
> 	auxiliary device code.
> ---
>  drivers/cxl/Kconfig           | 13 +++++
>  drivers/cxl/Makefile          |  2 +
>  drivers/cxl/cxlpci.h          | 13 +++++
>  drivers/cxl/doe.c             | 90 +++++++++++++++++++++++++++++++++++
>  drivers/cxl/pci.c             | 20 ++++++++
>  include/uapi/linux/pci_regs.h |  1 +
>  6 files changed, 139 insertions(+)
>  create mode 100644 drivers/cxl/doe.c
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index ac0f5ca95431..82f3908fa5cc 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -103,4 +103,17 @@ config CXL_SUSPEND
>  	def_bool y
>  	depends on SUSPEND && CXL_MEM
>  
> +config CXL_PCI_DOE
> +      tristate "CXL PCI Data Object Exchange (DOE) support"
> +      depends on CXL_PCI

Should be leading tabs not spaces.

> +      default CXL_BUS
> +      select PCI_DOE
> +      help
> +        Driver for DOE auxiliary devices.
> +
> +	The DOE capabilities provides a simple mailbox in PCI config space that
> +	is used for a number of different protocols useful to CXL.  The CXL PCI
> +	subsystem creates auxiliary devices for each DOE mailbox capability
> +	found.  This driver is required for the kernel to use these devices.
> +
>  endif
> diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> index a78270794150..c71b7a6345fb 100644
> --- a/drivers/cxl/Makefile
> +++ b/drivers/cxl/Makefile
> @@ -1,6 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-y += core/
>  obj-$(CONFIG_CXL_PCI) += cxl_pci.o
> +obj-$(CONFIG_CXL_PCI_DOE) += cxl_doe.o
>  obj-$(CONFIG_CXL_MEM) += cxl_mem.o
>  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
>  obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
> @@ -8,6 +9,7 @@ obj-$(CONFIG_CXL_PORT) += cxl_port.o
>  
>  cxl_mem-y := mem.o
>  cxl_pci-y := pci.o
> +cxl_doe-y := doe.o
>  cxl_acpi-y := acpi.o
>  cxl_pmem-y := pmem.o
>  cxl_port-y := port.o
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 2ad8715173ce..821fe05e8289 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -79,6 +79,7 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
>   *
>   * @adev: Auxiliary bus device
>   * @pdev: PCI device this belongs to
> + * @driver_access: Lock the driver during access
>   * @cap_offset: Capability offset
>   * @use_irq: Set if IRQs are to be used with this mailbox
>   *
> @@ -88,9 +89,21 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
>  struct cxl_doe_dev {
>  	struct auxiliary_device adev;
>  	struct pci_dev *pdev;
> +	struct rw_semaphore driver_access;
>  	int cap_offset;
>  	bool use_irq;
>  };
>  #define DOE_DEV_NAME "doe"
>  
> +/**
> + * struct cxl_doe_drv_state - state of the DOE Aux driver
> + *
> + * @doe_dev: The Auxiliary DOE device

As far as I can tell no one actually uses the doe_dev from here for anything
so do we need it at all?

> + * @doe_mb: PCI DOE mailbox state
> + */
> +struct cxl_doe_drv_state {
> +	struct cxl_doe_dev *doe_dev;
> +	struct pci_doe_mb *doe_mb;
> +};
> +
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/doe.c b/drivers/cxl/doe.c
> new file mode 100644
> index 000000000000..1d3a24a77002
> --- /dev/null
> +++ b/drivers/cxl/doe.c
> @@ -0,0 +1,90 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> +
> +#include <linux/pci.h>
> +#include <linux/pci-doe.h>

Rather sparse include list given what is going on in here.
We shouldn't rely too heavily on what comes indirectly except
where it's really standard headers.

Probably mod_devicetable.h and some headers for auxiliary devices
and rwsem.h at least?  

> +
> +#include "cxlpci.h"
> +
> +static void doe_destroy_mb(void *ds)
> +{
> +	struct cxl_doe_drv_state *doe_ds = ds;
> +
> +	pci_doe_destroy_mb(doe_ds->doe_mb);
> +}
> +
> +static int cxl_pci_doe_probe(struct auxiliary_device *aux_dev,
> +			     const struct auxiliary_device_id *id)
> +{
> +	struct cxl_doe_dev *doe_dev = container_of(aux_dev, struct cxl_doe_dev,
> +						   adev);
> +	struct device *dev = &aux_dev->dev;
> +	struct cxl_doe_drv_state *doe_ds;
> +	struct pci_doe_mb *doe_mb;
> +
> +	doe_ds = devm_kzalloc(dev, sizeof(*doe_ds), GFP_KERNEL);
> +	if (!doe_ds)
> +		return -ENOMEM;
> +
> +	doe_mb = pci_doe_create_mb(doe_dev->pdev, doe_dev->cap_offset,
> +				   doe_dev->use_irq);
> +	if (IS_ERR(doe_mb)) {
> +		dev_err(dev, "Failed to create the DOE mailbox state machine\n");

You could use dev_err_probe() here to tidy this up a little bit.

> +		return PTR_ERR(doe_mb);

> +	}
> +
> +	doe_ds->doe_mb = doe_mb;
> +	devm_add_action_or_reset(dev, doe_destroy_mb, doe_ds);
> +
> +	down_write(&doe_dev->driver_access);
> +	auxiliary_set_drvdata(aux_dev, doe_ds);
> +	up_write(&doe_dev->driver_access);
> +
> +	return 0;
> +}
> +
> +static void cxl_pci_doe_remove(struct auxiliary_device *aux_dev)
> +{
> +	struct cxl_doe_dev *doe_dev = container_of(aux_dev, struct cxl_doe_dev,
> +						   adev);
> +
> +	down_write(&doe_dev->driver_access);
> +	auxiliary_set_drvdata(aux_dev, NULL);

This confused me for a bit.  I 'think' you are doing this to be able to use
it as a flag for whether the driver is still bound. If so, a comment would
be useful.

> +	up_write(&doe_dev->driver_access);
> +}
> +
> +static const struct auxiliary_device_id cxl_pci_doe_auxiliary_id_table[] = {
> +	{.name = "cxl_pci." DOE_DEV_NAME, },
> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, cxl_pci_doe_auxiliary_id_table);
> +
> +struct auxiliary_driver cxl_pci_doe_auxiliary_drv = {
> +	.name = "cxl_doe_drv",
> +	.id_table = cxl_pci_doe_auxiliary_id_table,
> +	.probe = cxl_pci_doe_probe,
> +	.remove = cxl_pci_doe_remove,
> +};
> +
> +static int __init cxl_pci_doe_init_module(void)
> +{
> +	int ret;
> +
> +	ret = auxiliary_driver_register(&cxl_pci_doe_auxiliary_drv);
> +	if (ret) {
> +		pr_err("Failed cxl_pci_doe auxiliary_driver_register() ret=%d\n",
> +		       ret);
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit cxl_pci_doe_exit_module(void)
> +{
> +	auxiliary_driver_unregister(&cxl_pci_doe_auxiliary_drv);
> +}
> +
> +module_init(cxl_pci_doe_init_module);
> +module_exit(cxl_pci_doe_exit_module);
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 41a6f3eb0a5c..0dec1f1a3f38 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -590,6 +590,17 @@ static void cxl_pci_doe_destroy_device(void *ad)
>  	auxiliary_device_uninit(ad);
>  }
>  
> +static struct cxl_doe_drv_state *cxl_pci_doe_get_drv(struct cxl_doe_dev *doe_dev)
> +{
> +	down_read(&doe_dev->driver_access);
> +	return auxiliary_get_drvdata(&doe_dev->adev);
> +}
> +
> +static void cxl_pci_doe_put_drv(struct cxl_doe_dev *doe_dev)
> +{
> +	up_read(&doe_dev->driver_access);
> +}
> +
>  /**
>   * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
>   *				mailboxes found
> @@ -652,6 +663,7 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
>  			return -ENOMEM;
>  
>  		new_dev->pdev = pdev;
> +		init_rwsem(&new_dev->driver_access);
>  		new_dev->cap_offset = off;
>  		new_dev->use_irq = use_irq;
>  
> @@ -682,6 +694,13 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
>  					      adev);
>  		if (rc)
>  			return rc;
> +
> +		if (device_attach(&adev->dev) != 1) {
> +			dev_err(&adev->dev,
> +				"Failed to attach a driver to DOE device %d\n",
> +				adev->id);
> +			return -ENODEV;
> +		}

Can you add a comment on why this has to be the case at this point.  Why can't
the driver come along later?

>  	}
>  
>  	return 0;
> @@ -785,6 +804,7 @@ static struct pci_driver cxl_pci_driver = {
>  	},
>  };

...


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 06/10] cxl/pci: Find the DOE mailbox which supports CDAT
  2022-04-14 20:32 ` [PATCH V8 06/10] cxl/pci: Find the DOE mailbox which supports CDAT ira.weiny
@ 2022-04-27 17:49   ` Jonathan Cameron
  2022-05-09 21:25     ` Ira Weiny
  0 siblings, 1 reply; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-27 17:49 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 14 Apr 2022 13:32:33 -0700
ira.weiny@intel.com wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> Each CXL device may have multiple DOE mailbox capabilities and each
> mailbox may support multiple protocols.
> 
> Search the DOE auxiliary devices for one which supports the CDAT
> protocol.  Cache that device to be used for future queries.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

One question inline.  With that addressed (or convincing argument why not)
the rest looks good.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> 
> ---
> Changes from V7
> 	Minor code clean ups
> 
> Changes from V6
> 	Adjust for aux devices being a CXL only concept
> 	Update commit msg.
> 	Ensure devices iterated by auxiliary_find_device() are checked
> 		to be DOE devices prior to checking for the CDAT
> 		protocol
> 	From Ben
> 		Ensure reference from auxiliary_find_device() is dropped
> ---
>  drivers/cxl/cxl.h    |  2 ++
>  drivers/cxl/cxlmem.h |  2 ++
>  drivers/cxl/pci.c    | 76 +++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 79 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 990b6670222e..50817fd2c912 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -90,6 +90,8 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
>  #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
>  #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
>  
> +#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
> +
>  /*
>   * Using struct_group() allows for per register-block-type helper routines,
>   * without requiring block-type agnostic code to include the prefix.
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 7235d2f976e5..0dc6988afb30 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -168,6 +168,7 @@ struct cxl_endpoint_dvsec_info {
>   * Currently only memory devices are represented.
>   *
>   * @dev: The device associated with this CXL state
> + * @cdat_doe: Auxiliary DOE device capabile of reading CDAT
>   * @regs: Parsed register blocks
>   * @cxl_dvsec: Offset to the PCIe device DVSEC
>   * @payload_size: Size of space for payload
> @@ -200,6 +201,7 @@ struct cxl_endpoint_dvsec_info {
>  struct cxl_dev_state {
>  	struct device *dev;
>  
> +	struct cxl_doe_dev *cdat_doe;
>  	struct cxl_regs regs;
>  	int cxl_dvsec;
>  
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 0dec1f1a3f38..622cac925262 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -706,6 +706,80 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
>  	return 0;
>  }
>  
> +static bool cxl_doe_dev_supports_prot(struct cxl_doe_dev *doe_dev, u16 vid, u16 pid)
> +{
> +	struct cxl_doe_drv_state *doe_ds;
> +	bool ret;
> +
> +	doe_ds = cxl_pci_doe_get_drv(doe_dev);
> +	if (!doe_ds) {
> +		cxl_pci_doe_put_drv(doe_dev);

That means the get_drv has a rather unexpected side effect.
Could you move the NULL check into the function so that we don't retain
a reference if it fails?

> +		return false;
> +	}
> +	ret = pci_doe_supports_prot(doe_ds->doe_mb, vid, pid);
> +	cxl_pci_doe_put_drv(doe_dev);
> +	return ret;
> +}
> +
> +static int cxl_match_cdat_doe_device(struct device *dev, const void *data)
> +{
> +	const struct cxl_dev_state *cxlds = data;
> +	struct auxiliary_device *adev = to_auxiliary_dev(dev);
> +	struct cxl_doe_dev *doe_dev;
> +
> +	/* Ensure this is a DOE device */
> +	if (strcmp(DOE_DEV_NAME, adev->name))
> +		return 0;
> +
> +	/* Determine if this auxiliary device belongs to the cxlds */
> +	if (cxlds->dev != dev->parent)
> +		return 0;
> +
> +	doe_dev = container_of(adev, struct cxl_doe_dev, adev);
> +
> +	/* If it is one of ours check for the CDAT protocol */
> +	if (!cxl_doe_dev_supports_prot(doe_dev, PCI_DVSEC_VENDOR_ID_CXL,
> +				   CXL_DOE_PROTOCOL_TABLE_ACCESS))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +static void drop_cdat_doe_ref(void *data)
> +{
> +	struct cxl_doe_dev *cdat_doe = data;
> +
> +	put_device(&cdat_doe->adev.dev);
> +}
> +
> +static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> +{
> +	struct device *dev = cxlds->dev;
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct auxiliary_device *adev;
> +	int rc;
> +
> +	rc = cxl_pci_create_doe_devices(pdev);
> +	if (rc)
> +		return rc;
> +
> +	adev = auxiliary_find_device(NULL, cxlds, &cxl_match_cdat_doe_device);
> +
> +	if (adev) {
> +		struct cxl_doe_dev *doe_dev = container_of(adev,
> +							   struct cxl_doe_dev,
> +							   adev);
> +
> +		/* auxiliary_find_device() takes the reference */
> +		rc = devm_add_action_or_reset(dev, drop_cdat_doe_ref, doe_dev);
> +		if (rc)
> +			return rc;
> +		cxlds->cdat_doe = doe_dev;
> +	}
> +
> +	return 0;
> +}
> +
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct cxl_register_map map;
> @@ -772,7 +846,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> -	rc = cxl_pci_create_doe_devices(pdev);
> +	rc = cxl_setup_doe_devices(cxlds);
>  	if (rc)
>  		return rc;
>  


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 07/10] cxl/mem: Read CDAT table
  2022-04-14 20:32 ` [PATCH V8 07/10] cxl/mem: Read CDAT table ira.weiny
@ 2022-04-27 17:55   ` Jonathan Cameron
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-27 17:55 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 14 Apr 2022 13:32:34 -0700
ira.weiny@intel.com wrote:

> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> The OS will need CDAT data from CXL devices to properly set up
> interleave sets.  Currently this is supported by a through a DOE mailbox
> which supports CDAT.  But any cxl_mem object can provide this data later
> if need be, for example for testing.
> 
> Cache the CDAT data for later parsing.  Provide a sysfs binary attribute
> to allow dumping of the CDAT.
> 
> Binary dumping is modeled on /sys/firmware/ACPI/tables/
> 
> The ability to dump this table will be very useful for emulation of real
> devices once they become available as QEMU CXL type 3 device emulation will
> be able to load this file in.
> 
> This does not support table updates at runtime. It will always provide
> whatever was there when first cached. Handling of table updates can be
> implemented later.
> 
> Finally create a complete list of DOE defines within cdat.h for code
> wishing to decode the CDAT table.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
I'd have left the introduction of callbacks until they were actually needed,
but meh. That's minor.

On trivial below.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

...


> +
> +static int read_cdat_data(struct cxl_dev_state *cxlds)
> +{
> +	struct device *dev = cxlds->dev;
> +	size_t cdat_length;
> +	int ret;
> +
> +	if (cxl_mem_cdat_get_length(cxlds, &cdat_length))
> +		return 0;
> +
> +	cxlds->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
> +	if (!cxlds->cdat.table)
> +		return -ENOMEM;

Trivial but blank line here.

> +	cxlds->cdat.length = cdat_length;
> +	ret = cxl_mem_cdat_read_table(cxlds, &cxlds->cdat);
> +	if (ret) {
> +		devm_kfree(dev, cxlds->cdat.table);
> +		cxlds->cdat.table = NULL;
> +		cxlds->cdat.length = 0;
> +	}
> +	return ret;
> +}


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 08/10] cxl/cdat: Introduce cxl_cdat_valid()
  2022-04-14 20:32 ` [PATCH V8 08/10] cxl/cdat: Introduce cxl_cdat_valid() ira.weiny
@ 2022-04-27 17:56   ` Jonathan Cameron
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-27 17:56 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 14 Apr 2022 13:32:35 -0700
ira.weiny@intel.com wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> The CDAT data is protected by a checksum and should be the proper
> length.
> 
> Introduce cxl_cdat_valid() to validate the data.  While at it check and
> store the sequence number.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
> Changes from V6
> 	Change name to cxl_cdat_valid() as this validates all the CDAT
> 		data not just the header
> 	Add error and debug prints
> 
> Changes from V5
> 	New patch, split out
> 	Update cdat_hdr_valid()
> 		Remove revision and cs field parsing
> 			There is no point in these
> 		Add seq check and debug print.
> ---
>  drivers/cxl/cdat.h |  2 ++
>  drivers/cxl/pci.c  | 36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> index 4722b6bbbaf0..a7725d26f2d2 100644
> --- a/drivers/cxl/cdat.h
> +++ b/drivers/cxl/cdat.h
> @@ -88,10 +88,12 @@
>   *
>   * @table: cache of CDAT table
>   * @length: length of cached CDAT table
> + * @seq: Last read Sequence number of the CDAT table
>   */
>  struct cxl_cdat {
>  	void *table;
>  	size_t length;
> +	u32 seq;
>  };
>  
>  #endif /* !__CXL_CDAT_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index aecb327911a0..d7952156dd02 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -781,6 +781,40 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
>  	return 0;
>  }
>  
> +static bool cxl_cdat_valid(struct device *dev, struct cxl_cdat *cdat)
> +{
> +	u32 *table = cdat->table;
> +	u8 *data8 = cdat->table;
> +	u32 length, seq;
> +	u8 check;
> +	int i;
> +
> +	length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, table[0]);
> +	if ((length < CDAT_HEADER_LENGTH_BYTES) || (length > cdat->length)) {
> +		dev_err(dev, "Invalid length %u (%lu-%lu)\n", length,
> +			CDAT_HEADER_LENGTH_BYTES, cdat->length);
> +		return false;
> +	}
> +
> +	for (check = 0, i = 0; i < length; i++)
> +		check += data8[i];
> +
> +	dev_dbg(dev, "CDAT length %u CS %u\n", length, check);
> +	if (check != 0) {
> +		dev_err(dev, "Invalid checksum %u\n", check);
> +		return false;
> +	}
> +
> +	seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, table[3]);
> +	/* Store the sequence for now. */
> +	if (cdat->seq != seq) {
> +		dev_info(dev, "CDAT seq change %x -> %x\n", cdat->seq, seq);
> +		cdat->seq = seq;
> +	}
> +
> +	return true;
> +}
> +
>  #define CDAT_DOE_REQ(entry_handle)					\
>  	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
>  		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
> @@ -892,6 +926,8 @@ static int cxl_cdat_read_table(struct cxl_dev_state *cxlds,
>  
>  release_driver:
>  	cxl_pci_doe_put_drv(doe_dev);
> +	if (!rc && !cxl_cdat_valid(cxlds->dev, cdat))
> +		return -EIO;
>  	return rc;
>  }
>  


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 09/10] cxl/mem: Retry reading CDAT on failure
  2022-04-14 20:32 ` [PATCH V8 09/10] cxl/mem: Retry reading CDAT on failure ira.weiny
@ 2022-04-27 17:57   ` Jonathan Cameron
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-27 17:57 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 14 Apr 2022 13:32:36 -0700
ira.weiny@intel.com wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> The CDAT read may fail for a number of reasons but mainly it is possible
> to get different parts of a valid state.  The checksum in the CDAT table
> protects against this.
> 
> Now that the cdat data is validated issue a retries if the CDAT read

validated, retry if the CDAT read fails.

> fails.  For now 5 retries are implemented.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> 
> ---
> Changes from V6
> 	Move to pci.c
> 	Fix retries count
> 	Change to 5 retries
> 
> Changes from V5:
> 	New patch -- easy to push off or drop.
> ---
>  drivers/cxl/pci.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index d7952156dd02..43cbc297079d 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -940,7 +940,7 @@ static void cxl_initialize_cdat_callbacks(struct cxl_dev_state *cxlds)
>  	cxlds->cdat_read_table = cxl_cdat_read_table;
>  }
>  
> -static int read_cdat_data(struct cxl_dev_state *cxlds)
> +static int __read_cdat_data(struct cxl_dev_state *cxlds)
>  {
>  	struct device *dev = cxlds->dev;
>  	size_t cdat_length;
> @@ -962,6 +962,21 @@ static int read_cdat_data(struct cxl_dev_state *cxlds)
>  	return ret;
>  }
>  
> +static void read_cdat_data(struct cxl_dev_state *cxlds)
> +{
> +	int retries = 5;
> +	int rc;
> +
> +	while (retries--) {
> +		rc = __read_cdat_data(cxlds);
> +		if (!rc)
> +			break;
> +		dev_err(cxlds->dev,
> +			"CDAT data read error rc=%d (retries %d)\n",
> +			rc, retries);
> +	}
> +}
> +
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct cxl_register_map map;
> @@ -1035,9 +1050,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	cxl_initialize_cdat_callbacks(cxlds);
>  
>  	/* Cache the data early to ensure is_visible() works */
> -	rc = read_cdat_data(cxlds);
> -	if (rc)
> -		dev_err(&pdev->dev, "CDAT data read error (%d)\n", rc);
> +	read_cdat_data(cxlds);
>  
>  	cxl_dvsec_ranges(cxlds);
>  


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 10/10] cxl/port: Parse out DSMAS data from CDAT table
  2022-04-14 20:32 ` [PATCH V8 10/10] cxl/port: Parse out DSMAS data from CDAT table ira.weiny
@ 2022-04-27 18:01   ` Jonathan Cameron
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-27 18:01 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 14 Apr 2022 13:32:37 -0700
ira.weiny@intel.com wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL Ports with memory devices attached need the information from the
> Device Scoped Memory Affinity Structure (DSMAS).  This information is
> contained within the CDAT table buffer which is previously read and
> cached in the device state.
> 
> If CDAT data is available, parse and cache DSMAS data from the table.
> Store this data in unmarshaled struct dsmas data structures for ease of
> use.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

You could hold off on this patch and having it in the series that uses
the data.

Patch itself looks fine - it's just a bit random to parse one particular
record and do nothing with it beyond some debug prints :)

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> 
> ---
> Changes from V7
> 	Rebased on cxl-pending
> 
> Changes from V6
> 	Move to port.c
> 	It is not an error if no DSMAS data is found
> 
> Changes from V5
> 	Fix up sparse warnings
> 	Split out cdat_hdr_valid()
> 	Update cdat_hdr_valid()
> 		Remove revision and cs field parsing
> 			There is no point in these
> 		Add seq check and debug print.
> 	From Jonathan
> 		Add spaces around '+' and '/'
> 		use devm_krealloc() for dmas_ary
> ---
>  drivers/cxl/cdat.h | 17 ++++++++++++
>  drivers/cxl/cxl.h  |  6 +++++
>  drivers/cxl/port.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 88 insertions(+)
> 
> diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> index a7725d26f2d2..66706b238bc9 100644
> --- a/drivers/cxl/cdat.h
> +++ b/drivers/cxl/cdat.h
> @@ -83,6 +83,23 @@
>  #define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
>  #define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
>  
> +/**
> + * struct cxl_dsmas - host unmarshaled version of DSMAS data
> + *
> + * As defined in the Coherent Device Attribute Table (CDAT) specification this
> + * represents a single DSMAS entry in that table.
> + *
> + * @dpa_base: The lowest Device Physical Address associated with this DSMAD
> + * @length: Length in bytes of this DSMAD
> + * @non_volatile: If set, the memory region represents Non-Volatile memory
> + */
> +struct cxl_dsmas {
> +	u64 dpa_base;
> +	u64 length;
> +	/* Flags */
> +	u8 non_volatile:1;
> +};
> +
>  /**
>   * struct cxl_cdat - CXL CDAT data
>   *
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 50817fd2c912..80fd39769a65 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -9,6 +9,8 @@
>  #include <linux/bitops.h>
>  #include <linux/io.h>
>  
> +#include "cdat.h"
> +
>  /**
>   * DOC: cxl objects
>   *
> @@ -269,6 +271,8 @@ struct cxl_nvdimm {
>   * @component_reg_phys: component register capability base address (optional)
>   * @dead: last ep has been removed, force port re-creation
>   * @depth: How deep this port is relative to the root. depth 0 is the root.
> + * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
> + * @nr_dsmas: Number of entries in dsmas_ary
>   */
>  struct cxl_port {
>  	struct device dev;
> @@ -280,6 +284,8 @@ struct cxl_port {
>  	resource_size_t component_reg_phys;
>  	bool dead;
>  	unsigned int depth;
> +	struct cxl_dsmas *dsmas_ary;
> +	int nr_dsmas;
>  };
>  
>  /**
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index d420da5fc39c..2288432a27cd 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -30,6 +30,70 @@ static void schedule_detach(void *cxlmd)
>  	schedule_cxl_memdev_detach(cxlmd);
>  }
>  
> +static void parse_dsmas(struct cxl_port *port, struct cxl_dev_state *cxlds)
> +{
> +	struct device *dev = &port->dev;
> +	struct cxl_dsmas *dsmas_ary = NULL;
> +	u32 *data = cxlds->cdat.table;
> +	int bytes_left = cxlds->cdat.length;
> +	int nr_dsmas = 0;
> +
> +	if (!data) {
> +		dev_info(dev, "No CDAT data available for DSMAS\n");
> +		return;
> +	}
> +
> +	/* Skip header */
> +	data += CDAT_HEADER_LENGTH_DW;
> +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> +
> +	while (bytes_left > 0) {
> +		u32 *cur_rec = data;
> +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> +
> +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
> +			struct cxl_dsmas *new_ary;
> +			u8 flags;
> +
> +			new_ary = devm_krealloc(dev, dsmas_ary,
> +					   sizeof(*dsmas_ary) * (nr_dsmas + 1),
> +					   GFP_KERNEL);
> +			if (!new_ary) {
> +				dev_err(dev,
> +					"Failed to allocate memory for DSMAS data (nr_dsmas %d)\n",
> +					nr_dsmas);
> +				return;
> +			}
> +			dsmas_ary = new_ary;
> +
> +			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
> +
> +			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
> +			dsmas_ary[nr_dsmas].length = CDAT_DSMAS_DPA_LEN(cur_rec);
> +			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
> +
> +			dev_dbg(dev, "DSMAS %d: %llx:%llx %s\n",
> +				nr_dsmas,
> +				dsmas_ary[nr_dsmas].dpa_base,
> +				dsmas_ary[nr_dsmas].dpa_base +
> +					dsmas_ary[nr_dsmas].length,
> +				(dsmas_ary[nr_dsmas].non_volatile ?
> +					"Persistent" : "Volatile")
> +				);
> +
> +			nr_dsmas++;
> +		}
> +
> +		data += (length / sizeof(u32));
> +		bytes_left -= length;
> +	}
> +
> +	dev_dbg(dev, "Found %d DSMAS entries\n", nr_dsmas);
> +	port->dsmas_ary = dsmas_ary;
> +	port->nr_dsmas = nr_dsmas;
> +}
> +


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 05/10] cxl/pci: Create DOE auxiliary driver
  2022-04-27 17:43   ` Jonathan Cameron
@ 2022-04-28 14:48     ` ira.weiny
  2022-04-28 15:17       ` Jonathan Cameron
  0 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2022-04-28 14:48 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Wed, Apr 27, 2022 at 06:43:45PM +0100, Jonathan Cameron wrote:
> On Thu, 14 Apr 2022 13:32:32 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > CXL kernel drivers optionally need to access DOE mailbox capabilities.
> > Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
> > the kernel while other access is designed towards user space usage.  An
> > example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
> > Compliance Mode DOE) which offers a mechanism to set different test
> > modes for a device.
> > 
> > There is no anticipated need for the kernel to share an individual
> > mailbox with user space.  Thus developing an interface to marshal access
> > between the kernel and user space for a single mailbox is unnecessary
> > overhead.  However, having the kernel relinquish some mailboxes to be
> > controlled by user space is a reasonable compromise to share access to
> > the device.
> > 
> > The auxiliary bus provides an elegant solution for this.  Each DOE
> > capability is given its own auxiliary device.  This device is controlled
> > by a kernel driver by default which restricts access to the mailbox.
> > Unbinding the driver from a single auxiliary device (DOE mailbox
> > capability) frees the mailbox for user space access.  This architecture
> > also allows a clear picture on which mailboxes are kernel controlled vs
> > not.
> > 
> > Create a driver for the DOE auxiliary devices.  The driver uses the PCI
> > DOE core to manage the mailbox.
> > 
> > User space must be prevented from unbinding the driver state when the
> > DOE auxiliary driver is being accessed by the kernel.  Add a read write
> > lock to the DOE auxiliary device to protect the driver data portion.
> > 
> > Finally, flag the driver module to be preloaded by device creation to
> > ensure the driver is attached when iterating the DOE capabilities.
> > 
> > User space access can be obtained by unbinding the driver from that
> > device.  For example:
> > 
> > $ ls -l /sys/bus/auxiliary/drivers
> > total 0
> > drwxr-xr-x 2 root root 0 Mar 24 10:45 cxl_doe.cxl_doe_drv
> > 
> > $ ls -l /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci*
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.0 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.1 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.2 -> ../../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.4
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.5 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.6 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.7 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> > 
> > $ echo "cxl_pci.doe.4" > /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/unbind
> > 
> > $ ls -l /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci*
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.0 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.1 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.2 -> ../../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.5 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.6 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.7 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Hi Ira,
> 
> A few minor comments inline.
> With those cleaned up

Thanks!

> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> > 
> > ---
> > Changes from V7:
> > 	Now need to select PCI_DOE
> > 	Change MODULE_LICENSE to 'GPL' instead of old 'GPL v2'
> > 
> > Changes from V6:
> > 	The CXL layer now contains the driver for these auxiliary
> > 	devices.
> > 
> > Changes from V5:
> > 	Split the CXL specific stuff off from the PCI DOE create
> > 	auxiliary device code.
> > ---
> >  drivers/cxl/Kconfig           | 13 +++++
> >  drivers/cxl/Makefile          |  2 +
> >  drivers/cxl/cxlpci.h          | 13 +++++
> >  drivers/cxl/doe.c             | 90 +++++++++++++++++++++++++++++++++++
> >  drivers/cxl/pci.c             | 20 ++++++++
> >  include/uapi/linux/pci_regs.h |  1 +
> >  6 files changed, 139 insertions(+)
> >  create mode 100644 drivers/cxl/doe.c
> > 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index ac0f5ca95431..82f3908fa5cc 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -103,4 +103,17 @@ config CXL_SUSPEND
> >  	def_bool y
> >  	depends on SUSPEND && CXL_MEM
> >  
> > +config CXL_PCI_DOE
> > +      tristate "CXL PCI Data Object Exchange (DOE) support"
> > +      depends on CXL_PCI
> 
> Should be leading tabs not spaces.

Opps must have been a bad copy/paste.  Sorry.

> 
> > +      default CXL_BUS
> > +      select PCI_DOE
> > +      help
> > +        Driver for DOE auxiliary devices.
> > +
> > +	The DOE capabilities provides a simple mailbox in PCI config space that
> > +	is used for a number of different protocols useful to CXL.  The CXL PCI
> > +	subsystem creates auxiliary devices for each DOE mailbox capability
> > +	found.  This driver is required for the kernel to use these devices.
> > +
> >  endif
> > diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> > index a78270794150..c71b7a6345fb 100644
> > --- a/drivers/cxl/Makefile
> > +++ b/drivers/cxl/Makefile
> > @@ -1,6 +1,7 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  obj-y += core/
> >  obj-$(CONFIG_CXL_PCI) += cxl_pci.o
> > +obj-$(CONFIG_CXL_PCI_DOE) += cxl_doe.o
> >  obj-$(CONFIG_CXL_MEM) += cxl_mem.o
> >  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
> >  obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
> > @@ -8,6 +9,7 @@ obj-$(CONFIG_CXL_PORT) += cxl_port.o
> >  
> >  cxl_mem-y := mem.o
> >  cxl_pci-y := pci.o
> > +cxl_doe-y := doe.o
> >  cxl_acpi-y := acpi.o
> >  cxl_pmem-y := pmem.o
> >  cxl_port-y := port.o
> > diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> > index 2ad8715173ce..821fe05e8289 100644
> > --- a/drivers/cxl/cxlpci.h
> > +++ b/drivers/cxl/cxlpci.h
> > @@ -79,6 +79,7 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> >   *
> >   * @adev: Auxiliary bus device
> >   * @pdev: PCI device this belongs to
> > + * @driver_access: Lock the driver during access
> >   * @cap_offset: Capability offset
> >   * @use_irq: Set if IRQs are to be used with this mailbox
> >   *
> > @@ -88,9 +89,21 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> >  struct cxl_doe_dev {
> >  	struct auxiliary_device adev;
> >  	struct pci_dev *pdev;
> > +	struct rw_semaphore driver_access;
> >  	int cap_offset;
> >  	bool use_irq;
> >  };
> >  #define DOE_DEV_NAME "doe"
> >  
> > +/**
> > + * struct cxl_doe_drv_state - state of the DOE Aux driver
> > + *
> > + * @doe_dev: The Auxiliary DOE device
> 
> As far as I can tell no one actually uses the doe_dev from here for anything
> so do we need it at all?

Oh wow!  Great catch.  Worse yet I never even set it.  :-(

It must have been left over cruft from development which I missed in final
review.  I think the logic was that the device goes along with the driver
state...

Yes I will remove it.

But that unfortunately begs the question does cxl_doe_drv_state even need to
exist?

I don't like returning the pci_doe_mb directly but having a struct contain a
struct is worse IMO.

So cxl_pci_doe_get_drv() and cxl_pci_doe_put_drv() are going to be
get_mb/put_mb respectively.

Can I make that change with your review by?

> 
> > + * @doe_mb: PCI DOE mailbox state
> > + */
> > +struct cxl_doe_drv_state {
> > +	struct cxl_doe_dev *doe_dev;
> > +	struct pci_doe_mb *doe_mb;
> > +};
> > +
> >  #endif /* __CXL_PCI_H__ */
> > diff --git a/drivers/cxl/doe.c b/drivers/cxl/doe.c
> > new file mode 100644
> > index 000000000000..1d3a24a77002
> > --- /dev/null
> > +++ b/drivers/cxl/doe.c
> > @@ -0,0 +1,90 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
> > +
> > +#include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> 
> Rather sparse include list given what is going on in here.
> We shouldn't rely too heavily on what comes indirectly except
> where it's really standard headers.
> 
> Probably mod_devicetable.h and some headers for auxiliary devices
> and rwsem.h at least?  

Yea I'll add to this list.

> 
> > +
> > +#include "cxlpci.h"
> > +
> > +static void doe_destroy_mb(void *ds)
> > +{
> > +	struct cxl_doe_drv_state *doe_ds = ds;
> > +
> > +	pci_doe_destroy_mb(doe_ds->doe_mb);
> > +}
> > +
> > +static int cxl_pci_doe_probe(struct auxiliary_device *aux_dev,
> > +			     const struct auxiliary_device_id *id)
> > +{
> > +	struct cxl_doe_dev *doe_dev = container_of(aux_dev, struct cxl_doe_dev,
> > +						   adev);
> > +	struct device *dev = &aux_dev->dev;
> > +	struct cxl_doe_drv_state *doe_ds;
> > +	struct pci_doe_mb *doe_mb;
> > +
> > +	doe_ds = devm_kzalloc(dev, sizeof(*doe_ds), GFP_KERNEL);
> > +	if (!doe_ds)
> > +		return -ENOMEM;
> > +
> > +	doe_mb = pci_doe_create_mb(doe_dev->pdev, doe_dev->cap_offset,
> > +				   doe_dev->use_irq);
> > +	if (IS_ERR(doe_mb)) {
> > +		dev_err(dev, "Failed to create the DOE mailbox state machine\n");
> 
> You could use dev_err_probe() here to tidy this up a little bit.

Sure!

> 
> > +		return PTR_ERR(doe_mb);
> 
> > +	}
> > +
> > +	doe_ds->doe_mb = doe_mb;
> > +	devm_add_action_or_reset(dev, doe_destroy_mb, doe_ds);
> > +
> > +	down_write(&doe_dev->driver_access);
> > +	auxiliary_set_drvdata(aux_dev, doe_ds);
> > +	up_write(&doe_dev->driver_access);
> > +
> > +	return 0;
> > +}
> > +
> > +static void cxl_pci_doe_remove(struct auxiliary_device *aux_dev)
> > +{
> > +	struct cxl_doe_dev *doe_dev = container_of(aux_dev, struct cxl_doe_dev,
> > +						   adev);
> > +
> > +	down_write(&doe_dev->driver_access);
> > +	auxiliary_set_drvdata(aux_dev, NULL);
> 
> This confused me for a bit.  I 'think' you are doing this to be able to use
> it as a flag for whether the driver is still bound. If so, a comment would
> be useful.

Yes.

I'll add this from the commit message:

	User space must be prevented from unbinding the driver state when the
	DOE auxiliary driver is being accessed by the kernel.

> 
> > +	up_write(&doe_dev->driver_access);
> > +}
> > +
> > +static const struct auxiliary_device_id cxl_pci_doe_auxiliary_id_table[] = {
> > +	{.name = "cxl_pci." DOE_DEV_NAME, },
> > +	{},
> > +};
> > +
> > +MODULE_DEVICE_TABLE(auxiliary, cxl_pci_doe_auxiliary_id_table);
> > +
> > +struct auxiliary_driver cxl_pci_doe_auxiliary_drv = {
> > +	.name = "cxl_doe_drv",
> > +	.id_table = cxl_pci_doe_auxiliary_id_table,
> > +	.probe = cxl_pci_doe_probe,
> > +	.remove = cxl_pci_doe_remove,
> > +};
> > +
> > +static int __init cxl_pci_doe_init_module(void)
> > +{
> > +	int ret;
> > +
> > +	ret = auxiliary_driver_register(&cxl_pci_doe_auxiliary_drv);
> > +	if (ret) {
> > +		pr_err("Failed cxl_pci_doe auxiliary_driver_register() ret=%d\n",
> > +		       ret);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static void __exit cxl_pci_doe_exit_module(void)
> > +{
> > +	auxiliary_driver_unregister(&cxl_pci_doe_auxiliary_drv);
> > +}
> > +
> > +module_init(cxl_pci_doe_init_module);
> > +module_exit(cxl_pci_doe_exit_module);
> > +MODULE_LICENSE("GPL");
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 41a6f3eb0a5c..0dec1f1a3f38 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -590,6 +590,17 @@ static void cxl_pci_doe_destroy_device(void *ad)
> >  	auxiliary_device_uninit(ad);
> >  }
> >  
> > +static struct cxl_doe_drv_state *cxl_pci_doe_get_drv(struct cxl_doe_dev *doe_dev)
> > +{
> > +	down_read(&doe_dev->driver_access);
> > +	return auxiliary_get_drvdata(&doe_dev->adev);
> > +}
> > +
> > +static void cxl_pci_doe_put_drv(struct cxl_doe_dev *doe_dev)
> > +{
> > +	up_read(&doe_dev->driver_access);
> > +}
> > +
> >  /**
> >   * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
> >   *				mailboxes found
> > @@ -652,6 +663,7 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> >  			return -ENOMEM;
> >  
> >  		new_dev->pdev = pdev;
> > +		init_rwsem(&new_dev->driver_access);
> >  		new_dev->cap_offset = off;
> >  		new_dev->use_irq = use_irq;
> >  
> > @@ -682,6 +694,13 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> >  					      adev);
> >  		if (rc)
> >  			return rc;
> > +
> > +		if (device_attach(&adev->dev) != 1) {
> > +			dev_err(&adev->dev,
> > +				"Failed to attach a driver to DOE device %d\n",
> > +				adev->id);
> > +			return -ENODEV;
> > +		}
> 
> Can you add a comment on why this has to be the case at this point.  Why can't
> the driver come along later?

Yes I will.

I've not really liked this aspect of the auxiliary device arch.  But I've not
convinced myself that it will work ok to leave this till later.

Putting the CDAT retry on a timer might be a better option but it would also
mean searching for the proper DOE would need to be delayed as well...

For now I'll put in a comment.

Thanks for the review!
Ira

> 
> >  	}
> >  
> >  	return 0;
> > @@ -785,6 +804,7 @@ static struct pci_driver cxl_pci_driver = {
> >  	},
> >  };
> 
> ...
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 05/10] cxl/pci: Create DOE auxiliary driver
  2022-04-28 14:48     ` ira.weiny
@ 2022-04-28 15:17       ` Jonathan Cameron
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-28 15:17 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 28 Apr 2022 07:48:00 -0700
ira.weiny@intel.com wrote:

> On Wed, Apr 27, 2022 at 06:43:45PM +0100, Jonathan Cameron wrote:
> > On Thu, 14 Apr 2022 13:32:32 -0700
> > ira.weiny@intel.com wrote:
> >   
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > CXL kernel drivers optionally need to access DOE mailbox capabilities.
> > > Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
> > > the kernel while other access is designed towards user space usage.  An
> > > example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
> > > Compliance Mode DOE) which offers a mechanism to set different test
> > > modes for a device.
> > > 
> > > There is no anticipated need for the kernel to share an individual
> > > mailbox with user space.  Thus developing an interface to marshal access
> > > between the kernel and user space for a single mailbox is unnecessary
> > > overhead.  However, having the kernel relinquish some mailboxes to be
> > > controlled by user space is a reasonable compromise to share access to
> > > the device.
> > > 
> > > The auxiliary bus provides an elegant solution for this.  Each DOE
> > > capability is given its own auxiliary device.  This device is controlled
> > > by a kernel driver by default which restricts access to the mailbox.
> > > Unbinding the driver from a single auxiliary device (DOE mailbox
> > > capability) frees the mailbox for user space access.  This architecture
> > > also allows a clear picture on which mailboxes are kernel controlled vs
> > > not.
> > > 
> > > Create a driver for the DOE auxiliary devices.  The driver uses the PCI
> > > DOE core to manage the mailbox.
> > > 
> > > User space must be prevented from unbinding the driver state when the
> > > DOE auxiliary driver is being accessed by the kernel.  Add a read write
> > > lock to the DOE auxiliary device to protect the driver data portion.
> > > 
> > > Finally, flag the driver module to be preloaded by device creation to
> > > ensure the driver is attached when iterating the DOE capabilities.
> > > 
> > > User space access can be obtained by unbinding the driver from that
> > > device.  For example:
> > > 
> > > $ ls -l /sys/bus/auxiliary/drivers
> > > total 0
> > > drwxr-xr-x 2 root root 0 Mar 24 10:45 cxl_doe.cxl_doe_drv
> > > 
> > > $ ls -l /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci*
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.0 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.1 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.2 -> ../../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.4
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.5 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.6 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.7 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> > > 
> > > $ echo "cxl_pci.doe.4" > /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/unbind
> > > 
> > > $ ls -l /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci*
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.0 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.1 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.2 -> ../../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.3 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.5 -> ../../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.6 -> ../../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:53 /sys/bus/auxiliary/drivers/cxl_doe.cxl_doe_drv/cxl_pci.doe.7 -> ../../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>  
> > 
> > Hi Ira,
> > 
> > A few minor comments inline.
> > With those cleaned up  
> 
> Thanks!
> 
> > 
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >   
> > > 
> > > ---
> > > Changes from V7:
> > > 	Now need to select PCI_DOE
> > > 	Change MODULE_LICENSE to 'GPL' instead of old 'GPL v2'
> > > 
> > > Changes from V6:
> > > 	The CXL layer now contains the driver for these auxiliary
> > > 	devices.
> > > 
> > > Changes from V5:
> > > 	Split the CXL specific stuff off from the PCI DOE create
> > > 	auxiliary device code.
> > > ---
> > >  drivers/cxl/Kconfig           | 13 +++++
> > >  drivers/cxl/Makefile          |  2 +
> > >  drivers/cxl/cxlpci.h          | 13 +++++
> > >  drivers/cxl/doe.c             | 90 +++++++++++++++++++++++++++++++++++
> > >  drivers/cxl/pci.c             | 20 ++++++++
> > >  include/uapi/linux/pci_regs.h |  1 +
> > >  6 files changed, 139 insertions(+)
> > >  create mode 100644 drivers/cxl/doe.c
> > > 

...


> > > diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> > > index 2ad8715173ce..821fe05e8289 100644
> > > --- a/drivers/cxl/cxlpci.h
> > > +++ b/drivers/cxl/cxlpci.h
> > > @@ -79,6 +79,7 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> > >   *
> > >   * @adev: Auxiliary bus device
> > >   * @pdev: PCI device this belongs to
> > > + * @driver_access: Lock the driver during access
> > >   * @cap_offset: Capability offset
> > >   * @use_irq: Set if IRQs are to be used with this mailbox
> > >   *
> > > @@ -88,9 +89,21 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> > >  struct cxl_doe_dev {
> > >  	struct auxiliary_device adev;
> > >  	struct pci_dev *pdev;
> > > +	struct rw_semaphore driver_access;
> > >  	int cap_offset;
> > >  	bool use_irq;
> > >  };
> > >  #define DOE_DEV_NAME "doe"
> > >  
> > > +/**
> > > + * struct cxl_doe_drv_state - state of the DOE Aux driver
> > > + *
> > > + * @doe_dev: The Auxiliary DOE device  
> > 
> > As far as I can tell no one actually uses the doe_dev from here for anything
> > so do we need it at all?  
> 
> Oh wow!  Great catch.  Worse yet I never even set it.  :-(
> 
> It must have been left over cruft from development which I missed in final
> review.  I think the logic was that the device goes along with the driver
> state...
> 
> Yes I will remove it.
> 
> But that unfortunately begs the question does cxl_doe_drv_state even need to
> exist?
> 
> I don't like returning the pci_doe_mb directly but having a struct contain a
> struct is worse IMO.
> 
> So cxl_pci_doe_get_drv() and cxl_pci_doe_put_drv() are going to be
> get_mb/put_mb respectively.
> 
> Can I make that change with your review by?

Yes, that's fine.

> 
> >   
> > > + * @doe_mb: PCI DOE mailbox state
> > > + */
> > > +struct cxl_doe_drv_state {
> > > +	struct cxl_doe_dev *doe_dev;
> > > +	struct pci_doe_mb *doe_mb;
> > > +};
> > > +
> > >  #endif /* __CXL_PCI_H__ */
> > > diff --git a/drivers/cxl/doe.c b/drivers/cxl/doe.c
> > > new file mode 100644
> > > index 000000000000..1d3a24a77002
> > > --- /dev/null
> > > +++ b/drivers/cxl/doe.c

...

> >   
> > > +		return PTR_ERR(doe_mb);  
> >   
> > > +	}
> > > +
> > > +	doe_ds->doe_mb = doe_mb;
> > > +	devm_add_action_or_reset(dev, doe_destroy_mb, doe_ds);
> > > +
> > > +	down_write(&doe_dev->driver_access);
> > > +	auxiliary_set_drvdata(aux_dev, doe_ds);
> > > +	up_write(&doe_dev->driver_access);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void cxl_pci_doe_remove(struct auxiliary_device *aux_dev)
> > > +{
> > > +	struct cxl_doe_dev *doe_dev = container_of(aux_dev, struct cxl_doe_dev,
> > > +						   adev);
> > > +
> > > +	down_write(&doe_dev->driver_access);
> > > +	auxiliary_set_drvdata(aux_dev, NULL);  
> > 
> > This confused me for a bit.  I 'think' you are doing this to be able to use
> > it as a flag for whether the driver is still bound. If so, a comment would
> > be useful.  
> 
> Yes.
> 
> I'll add this from the commit message:
> 
> 	User space must be prevented from unbinding the driver state when the
> 	DOE auxiliary driver is being accessed by the kernel.

Maybe a comment here as well as setting drvdata to NULL is rather rare
as I'm fairly sure the driver core does it for you these days (but too late
for this particular use).

> 
> >   
> > > +	up_write(&doe_dev->driver_access);
> > > +}
> > > +
> > > +static const struct auxiliary_device_id cxl_pci_doe_auxiliary_id_table[] = {
> > > +	{.name = "cxl_pci." DOE_DEV_NAME, },
> > > +	{},
> > > +};
> > > +
> > > +MODULE_DEVICE_TABLE(auxiliary, cxl_pci_doe_auxiliary_id_table);
> > > +
> > > +struct auxiliary_driver cxl_pci_doe_auxiliary_drv = {
> > > +	.name = "cxl_doe_drv",
> > > +	.id_table = cxl_pci_doe_auxiliary_id_table,
> > > +	.probe = cxl_pci_doe_probe,
> > > +	.remove = cxl_pci_doe_remove,
> > > +};
> > > +
> > > +static int __init cxl_pci_doe_init_module(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = auxiliary_driver_register(&cxl_pci_doe_auxiliary_drv);
> > > +	if (ret) {
> > > +		pr_err("Failed cxl_pci_doe auxiliary_driver_register() ret=%d\n",
> > > +		       ret);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static void __exit cxl_pci_doe_exit_module(void)
> > > +{
> > > +	auxiliary_driver_unregister(&cxl_pci_doe_auxiliary_drv);
> > > +}
> > > +
> > > +module_init(cxl_pci_doe_init_module);
> > > +module_exit(cxl_pci_doe_exit_module);
> > > +MODULE_LICENSE("GPL");
> > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > > index 41a6f3eb0a5c..0dec1f1a3f38 100644
> > > --- a/drivers/cxl/pci.c
> > > +++ b/drivers/cxl/pci.c
> > > @@ -590,6 +590,17 @@ static void cxl_pci_doe_destroy_device(void *ad)
> > >  	auxiliary_device_uninit(ad);
> > >  }
> > >  
> > > +static struct cxl_doe_drv_state *cxl_pci_doe_get_drv(struct cxl_doe_dev *doe_dev)
> > > +{
> > > +	down_read(&doe_dev->driver_access);
> > > +	return auxiliary_get_drvdata(&doe_dev->adev);
> > > +}
> > > +
> > > +static void cxl_pci_doe_put_drv(struct cxl_doe_dev *doe_dev)
> > > +{
> > > +	up_read(&doe_dev->driver_access);
> > > +}
> > > +
> > >  /**
> > >   * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
> > >   *				mailboxes found
> > > @@ -652,6 +663,7 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > >  			return -ENOMEM;
> > >  
> > >  		new_dev->pdev = pdev;
> > > +		init_rwsem(&new_dev->driver_access);
> > >  		new_dev->cap_offset = off;
> > >  		new_dev->use_irq = use_irq;
> > >  
> > > @@ -682,6 +694,13 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > >  					      adev);
> > >  		if (rc)
> > >  			return rc;
> > > +
> > > +		if (device_attach(&adev->dev) != 1) {
> > > +			dev_err(&adev->dev,
> > > +				"Failed to attach a driver to DOE device %d\n",
> > > +				adev->id);
> > > +			return -ENODEV;
> > > +		}  
> > 
> > Can you add a comment on why this has to be the case at this point.  Why can't
> > the driver come along later?  
> 
> Yes I will.
> 
> I've not really liked this aspect of the auxiliary device arch.  But I've not
> convinced myself that it will work ok to leave this till later.
> 
> Putting the CDAT retry on a timer might be a better option but it would also
> mean searching for the proper DOE would need to be delayed as well...

Ultimately we may not need to know CDAT info at the time of driver
load, but only on either a userspace read, or setup of a region etc.
I guess in majority of cases that will all be automated though (based
on LSA contents etc).

> 
> For now I'll put in a comment.
Great,

> 
> Thanks for the review!

You are welcome.

Thanks for driving this forwards.

Jonathan

> Ira
> 
> >   
> > >  	}
> > >  
> > >  	return 0;
> > > @@ -785,6 +804,7 @@ static struct pci_driver cxl_pci_driver = {
> > >  	},
> > >  };  
> > 
> > ...
> >   


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox
  2022-04-27 17:19   ` Jonathan Cameron
@ 2022-04-28 21:09     ` ira.weiny
  2022-04-29 16:38       ` Jonathan Cameron
  0 siblings, 1 reply; 42+ messages in thread
From: ira.weiny @ 2022-04-28 21:09 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Wed, Apr 27, 2022 at 06:19:42PM +0100, Jonathan Cameron wrote:
> On Thu, 14 Apr 2022 13:32:31 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > CXL kernel drivers optionally need to access DOE mailbox capabilities.
> > Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
> > the kernel while other access is designed towards user space usage.  An
> > example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
> > Compliance Mode DOE) which offers a mechanism to set different test
> > modes for a device.
> > 
> > There is no anticipated need for the kernel to share an individual
> > mailbox with user space.  Thus developing an interface to marshal access
> > between the kernel and user space for a single mailbox is unnecessary
> > overhead.  However, having the kernel relinquish some mailboxes to be
> > controlled by user space is a reasonable compromise to share access to
> > the device.
> > 
> > The auxiliary bus provides an elegant solution for this.  Each DOE
> > capability is given its own auxiliary device.  This device is controlled
> > by a kernel driver by default which restricts access to the mailbox.
> > Unbinding the driver from a single auxiliary device (DOE mailbox
> > capability) frees the mailbox for user space access.  This architecture
> > also allows a clear picture on which mailboxes are kernel controlled vs
> > not.
> > 
> > Iterate each DOE mailbox capability and create auxiliary bus devices.
> > Follow on patches will define a driver for the newly created devices.
> > 
> > sysfs shows the devices.
> > 
> > $ ls -l /sys/bus/auxiliary/devices/
> > total 0
> > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.0 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.1 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.2 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.3 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.4 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.4
> > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.5 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.6 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.7 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> I'm not 100% happy with effectively having one solution for CXL
> and probably a different one for DOEs on switch ports
> (which I just hacked into a port service driver to convince
> myself there was at least one plausible way of doing that) but if
> this effectively separates the two discussions then I guess I can
> live with it for now ;)

I took some time this morning to mull this over and talk to Dan...

:-(

Truthfully the aux driver does very little except provide a way for admins to
trigger the driver to stop/start accessing the Mailbox.

I suppose a simple sysfs interface could be done to do the same?

I'll let Dan weigh in here.

> 
> Once this is merged we can start the discussion about how to
> handle switch ports with DOEs both for CDAT and SPDM.

I'm ok with that too.  However, I was thinking that this was not a user ABI.
But it really is.  If user space starts writing script to unbind drivers and
then we drop the aux driver support it will break them...

> 
> I'll send out an RFC that is so hideous it will get people to
> suggestion how to do it better!

I think I'd like to see that.

> Currently it starts and
> stops the mailbox 3 times in the registration path and I think
> it's more luck than judgement that is landing me with the right
> MSI.
> 
> Anyhow, this looks good to me.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks!

The good news is that the main support for the algorithm is now just part of
the pci core as library functions.  I think this discussion is a pretty
light lift to move the calls to those around...

Ira

> 
> 
> > 
> > ---
> > Changes from V7:
> > 	Minor code clean ups
> > 	Rebased on cxl-pending
> > 
> > Changes from V6:
> > 	Move all the auxiliary device stuff to the CXL layer
> > 
> > Changes from V5:
> > 	Split the CXL specific stuff off from the PCI DOE create
> > 	auxiliary device code.
> > ---
> >  drivers/cxl/Kconfig  |   1 +
> >  drivers/cxl/cxlpci.h |  21 +++++++
> >  drivers/cxl/pci.c    | 127 +++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 149 insertions(+)
> > 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index f64e3984689f..ac0f5ca95431 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -16,6 +16,7 @@ if CXL_BUS
> >  config CXL_PCI
> >  	tristate "PCI manageability"
> >  	default CXL_BUS
> > +	select AUXILIARY_BUS
> >  	help
> >  	  The CXL specification defines a "CXL memory device" sub-class in the
> >  	  PCI "memory controller" base class of devices. Device's identified by
> > diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> > index 329e7ea3f36a..2ad8715173ce 100644
> > --- a/drivers/cxl/cxlpci.h
> > +++ b/drivers/cxl/cxlpci.h
> > @@ -2,6 +2,7 @@
> >  /* Copyright(c) 2020 Intel Corporation. All rights reserved. */
> >  #ifndef __CXL_PCI_H__
> >  #define __CXL_PCI_H__
> > +#include <linux/auxiliary_bus.h>
> >  #include <linux/pci.h>
> >  #include "cxl.h"
> >  
> > @@ -72,4 +73,24 @@ static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
> >  }
> >  
> >  int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> > +
> > +/**
> > + * struct cxl_doe_dev - CXL DOE auxiliary bus device
> > + *
> > + * @adev: Auxiliary bus device
> > + * @pdev: PCI device this belongs to
> > + * @cap_offset: Capability offset
> > + * @use_irq: Set if IRQs are to be used with this mailbox
> > + *
> > + * This represents a single DOE mailbox device.  CXL devices should create this
> > + * device and register it on the Auxiliary bus for the CXL DOE driver to drive.
> > + */
> > +struct cxl_doe_dev {
> > +	struct auxiliary_device adev;
> > +	struct pci_dev *pdev;
> > +	int cap_offset;
> > +	bool use_irq;
> > +};
> > +#define DOE_DEV_NAME "doe"
> > +
> >  #endif /* __CXL_PCI_H__ */
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index e7ab9a34d718..41a6f3eb0a5c 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/mutex.h>
> >  #include <linux/list.h>
> >  #include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> >  #include <linux/io.h>
> >  #include "cxlmem.h"
> >  #include "cxlpci.h"
> > @@ -564,6 +565,128 @@ static void cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
> >  	info->ranges = __cxl_dvsec_ranges(cxlds, info);
> >  }
> >  
> > +static void cxl_pci_free_irq_vectors(void *data)
> > +{
> > +	pci_free_irq_vectors(data);
> > +}
> > +
> > +static DEFINE_IDA(pci_doe_adev_ida);
> > +
> > +static void cxl_pci_doe_dev_release(struct device *dev)
> > +{
> > +	struct auxiliary_device *adev = container_of(dev,
> > +						struct auxiliary_device,
> > +						dev);
> > +	struct cxl_doe_dev *doe_dev = container_of(adev, struct cxl_doe_dev,
> > +						   adev);
> > +
> > +	ida_free(&pci_doe_adev_ida, adev->id);
> > +	kfree(doe_dev);
> > +}
> > +
> > +static void cxl_pci_doe_destroy_device(void *ad)
> > +{
> > +	auxiliary_device_delete(ad);
> > +	auxiliary_device_uninit(ad);
> > +}
> > +
> > +/**
> > + * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
> > + *				mailboxes found
> > + *
> > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > + *
> > + * There is no coresponding destroy of these devices.  This function associates
> > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > + * association is device managed (devm_*) such that the DOE auxiliary device
> > + * lifetime is always less than or equal to the lifetime of the pci_dev.
> > + *
> > + * RETURNS: 0 on success -ERRNO on failure.
> > + */
> > +static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > +{
> > +	struct device *dev = &pdev->dev;
> > +	bool use_irq = true;
> > +	int irqs = 0;
> > +	u16 off = 0;
> > +	int rc;
> > +
> > +	pci_doe_for_each_off(pdev, off)
> > +		irqs++;
> > +	pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> > +
> > +	/*
> > +	 * Allocate enough vectors for the DOE's
> > +	 */
> > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> > +						     PCI_IRQ_MSIX);
> > +	if (rc != irqs) {
> > +		pci_err(pdev,
> > +			"Not enough interrupts for all the DOEs; use polling\n");
> > +		use_irq = false;
> > +		/* Some got allocated; clean them up */
> > +		if (rc > 0)
> > +			cxl_pci_free_irq_vectors(pdev);
> > +	} else {
> > +		/*
> > +		 * Enabling bus mastering is require for MSI/MSIx.  It could be
> > +		 * done later within the DOE initialization, but as it
> > +		 * potentially has other impacts keep it here when setting up
> > +		 * the IRQ's.
> > +		 */
> > +		pci_set_master(pdev);
> > +		rc = devm_add_action_or_reset(dev,
> > +					      cxl_pci_free_irq_vectors,
> > +					      pdev);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> > +	pci_doe_for_each_off(pdev, off) {
> > +		struct auxiliary_device *adev;
> > +		struct cxl_doe_dev *new_dev;
> > +		int id;
> > +
> > +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> > +		if (!new_dev)
> > +			return -ENOMEM;
> > +
> > +		new_dev->pdev = pdev;
> > +		new_dev->cap_offset = off;
> > +		new_dev->use_irq = use_irq;
> > +
> > +		/* Set up struct auxiliary_device */
> > +		adev = &new_dev->adev;
> > +		id = ida_alloc(&pci_doe_adev_ida, GFP_KERNEL);
> > +		if (id < 0) {
> > +			kfree(new_dev);
> > +			return -ENOMEM;
> > +		}
> > +
> > +		adev->id = id;
> > +		adev->name = DOE_DEV_NAME;
> > +		adev->dev.release = cxl_pci_doe_dev_release;
> > +		adev->dev.parent = dev;
> > +
> > +		if (auxiliary_device_init(adev)) {
> > +			cxl_pci_doe_dev_release(&adev->dev);
> > +			return -EIO;
> > +		}
> > +
> > +		if (auxiliary_device_add(adev)) {
> > +			auxiliary_device_uninit(adev);
> > +			return -EIO;
> > +		}
> > +
> > +		rc = devm_add_action_or_reset(dev, cxl_pci_doe_destroy_device,
> > +					      adev);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  {
> >  	struct cxl_register_map map;
> > @@ -630,6 +753,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  	if (rc)
> >  		return rc;
> >  
> > +	rc = cxl_pci_create_doe_devices(pdev);
> > +	if (rc)
> > +		return rc;
> > +
> >  	cxl_dvsec_ranges(cxlds);
> >  
> >  	cxlmd = devm_cxl_add_memdev(cxlds);
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-04-14 20:32 ` [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes ira.weiny
@ 2022-04-28 21:27   ` Bjorn Helgaas
  2022-05-02  5:36     ` ira.weiny
  2022-05-30 19:06   ` Lukas Wunner
  1 sibling, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2022-04-28 21:27 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Jonathan Cameron, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Introduced in a PCI v6.0[1], DOE provides a config space based mailbox
 
  Introduced in PCIe r6.0, sec 6.30, DOE ...

> with standard protocol discovery.  Each mailbox is accessed through a
> DOE Extended Capability.
> 
> Each DOE mailbox must support the DOE discovery protocol in addition to
> any number of additional protocols.
> 
> Define core PCI functionality to manage a single PCI DOE mailbox at a
> defined config space offset.  Functionality includes creating, supported
> protocol queries, submit tasks to, and destroying the new state objects.
> 
> If interrupts are desired, interrupts vectors should be allocated prior
> to asking for irq's when creating a mailbox object.
                IRQs

> [1] https://members.pcisig.com/wg/PCI-SIG/document/16609

The link is only useful for PCI-SIG members, and only as long as the
SIG maintains this structure.  I think "PCIe r6.0" is sufficient.

> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Minor comments below.

> ---
> Changes from V7
> 	Add a Kconfig for this functionality
> 	Fix bug in pci_doe_supports_prot()
> 	Rebased on cxl-pending
> 
> Changes from V6
> 	Clean up signed off by lines
> 	Make this functionality all PCI library functions
> 	Clean up header files
> 	s/pci_doe_irq/pci_doe_irq_handler
> 	Use pci_{request,free}_irq
> 		Remove irq_name (maintained by pci_request_irq)
> 	Fix checks to use an irq
> 	Consistently use u16 for cap_offset
> 	Cleanup kdocs and comments
> 	Create a helper retire_cur_task() to handle locking of the
> 		current task pointer.
> 	Remove devm_ calls from PCI layer.
> 		The devm_ calls do not allow for the pci_doe_mb objects
> 		to be tied to an auxiliary device.  Leave it to the
> 		caller to use devm_ if desired.
> 	From Dan Williams
> 		s/cb/end_task/; Pass pci_doe_task to end_task
> 		Clarify exchange/task/request/response.
> 			Merge pci_doe_task and pci_doe_exchange into
> 			pci_doe_task which represents a single
> 			request/response task for the state machine to
> 			process.
> 		Simplify submitting work to the mailbox
> 			Replace pci_doe_exchange_sync() with
> 			pci_doe_submit_task() Consumers of the mailbox
> 			are now responsible for setting up callbacks
> 			within a task object and submitting them to the
> 			mailbox to be processed.
> 		Remove WARN_ON when task != NULL and be sure to abort that task.
> 		Convert abort/dead to atomic flags
> 		s/state_lock/task_lock to better define what the lock is
> 			protecting
> 		Remove all the auxiliary bus code from the PCI layer
> 			The PCI layer provides helpers to use the DOE
> 			Mailboxes.  Each subsystem can then use the
> 			helpers as they see fit.  The CXL layer in this
> 			series uses aux devices to manage the new
> 			pci_doe_mb objects.
> 
> 	From Bjorn
> 		Clarify the fact that DOE mailboxes are capabilities of
> 			the device.
> 		Code clean ups
> 		Cleanup Makefile
> 		Update references to PCI SIG spec v6.0
> 		Move this attribution here:
> 		This code is based on Jonathan's V4 series here:
> 		https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> 
> Changes from V5
> 	From Bjorn
> 		s/pci_WARN/pci_warn
> 			Add timeout period to print
> 		Trim to 80 chars
> 		Use Tabs for DOE define spacing
> 		Use %#x for clarity
> 	From Jonathan
> 		Addresses concerns about the order of unwinding stuff
> 		s/doe/doe_dev in pci_doe_exhcnage_sync
> 		Correct kernel Doc comment
> 		Move pci_doe_task_complete() down in the file.
> 		Rework pci_doe_irq()
> 			process STATUS_ERROR first
> 			Return IRQ_NONE if the irq is not processed
> 			Use PCI_DOE_STATUS_INT_STATUS explicitly to
> 				clear the irq
> 	Clean up goto label s/err_free_irqs/err_free_irq
> 	use devm_kzalloc for doe struct
> 	clean up error paths in pci_doe_probe
> 	s/pci_doe_drv/pci_doe
> 	remove include mutex.h
> 	remove device name and define, move it in the next patch which uses it
> 	use devm_kasprintf() for irq_name
> 	use devm_request_irq()
> 	remove pci_doe_unregister()
> 		[get/put]_device() were unneeded and with the use of
> 		devm_* this function can be removed completely.
> 	refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq
> 		make this function just a registration of the irq and
> 		move pci_doe_abort() into pci_doe_probe()
> 	use devm_* to allocate the protocol array
> 
> Changes from Jonathan's V4
> 	Move the DOE MB code into the DOE auxiliary driver
> 	Remove Task List in favor of a wait queue
> 
> Changes from Ben
> 	remove CXL references
> 	propagate rc from pci functions on error
> 
> squash: make PCI_DOE optional based on selecting by those who need it.
> 
> squash: Create PCI library functions
> 
> Fix bug in protocol support check
> ---
>  drivers/pci/Kconfig           |   3 +
>  drivers/pci/Makefile          |   1 +
>  drivers/pci/pci-doe.c         | 607 ++++++++++++++++++++++++++++++++++
>  include/linux/pci-doe.h       | 119 +++++++
>  include/uapi/linux/pci_regs.h |  29 +-
>  5 files changed, 758 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/pci/pci-doe.c
>  create mode 100644 include/linux/pci-doe.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 133c73207782..b2f2e588a817 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -121,6 +121,9 @@ config XEN_PCIDEV_FRONTEND
>  config PCI_ATS
>  	bool
>  
> +config PCI_DOE
> +	bool
> +
>  config PCI_ECAM
>  	bool
>  
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 0da6b1ebc694..b609d8ad813d 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -31,6 +31,7 @@ obj-$(CONFIG_PCI_ECAM)		+= ecam.o
>  obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
>  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>  obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
> +obj-$(CONFIG_PCI_DOE)		+= pci-doe.o

Is there value in the "pci-" prefix?  There are a few other files with
the prefix, but I can't remember why.  Most things are of the form
"doe.c".

>  # Endpoint library must be initialized before its users
>  obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
> diff --git a/drivers/pci/pci-doe.c b/drivers/pci/pci-doe.c
> new file mode 100644
> index 000000000000..ccf936421d2a
> --- /dev/null
> +++ b/drivers/pci/pci-doe.c
> @@ -0,0 +1,607 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Object Exchange
> + * https://members.pcisig.com/wg/PCI-SIG/document/16609

I think "PCIe r6.0, sec 6.30" is more generally useful.

> + * Copyright (C) 2021 Huawei
> + *	Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + *
> + * Copyright (C) 2022 Intel Corporation
> + *	Ira Weiny <ira.weiny@intel.com>
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci-doe.h>
> +#include <linux/workqueue.h>
> +
> +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> +
> +#define PCI_DOE_BUSY_MAX_RETRIES 16
> +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> +
> +/* Timeout of 1 second from 6.30.2 Operation, PCI Spec v6.0 */

Pedantic: the PCI specs use "r6.0", not "v6.0".  Reverse for ACPI.  Of
course.  Even worse, PCI uses "Revision" for the major, "Version" for
the minor.

> +#define PCI_DOE_TIMEOUT HZ
> +
> +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> +{
> +	struct pci_doe_mb *doe_mb = data;
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +	/* Leave the error case to be handled outside IRQ */

s/irq/IRQ/ for similar comment and log message uses below.

> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> +					PCI_DOE_STATUS_INT_STATUS);
> +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	return IRQ_NONE;
> +}
> +
> +/*
> + * Only called when safe to directly access the DOE from
> + * doe_statemachine_work().  Outside access is not protected.  Users who
> + * perform such access are left with the pieces.
> + */
> +static void pci_doe_abort_start(struct pci_doe_mb *doe_mb)
> +{
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	u32 val;
> +
> +	val = PCI_DOE_CTRL_ABORT;
> +	if (doe_mb->irq >= 0)
> +		val |= PCI_DOE_CTRL_INT_EN;
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +
> +	doe_mb->timeout_jiffies = jiffies + HZ;
> +	schedule_delayed_work(&doe_mb->statemachine, HZ);
> +}
> +
> +static int pci_doe_send_req(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)

Wrap to fit in 80 columns, like the rest of drivers/pci/*

> +{
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	u32 val;
> +	int i;
> +
> +	/*
> +	 * Check the DOE busy bit is not set. If it is set, this could indicate
> +	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
> +	 * it is expected that firmware and OS will negotiate access rights via
> +	 * an, as yet to be defined method.
> +	 */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> +		return -EBUSY;
> +
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +		return -EIO;
> +
> +	/* Write DOE Header */
> +	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, task->prot.vid) |
> +		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, task->prot.type);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> +	/* Length is 2 DW of header + length of payload in DW */
> +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> +					  2 + task->request_pl_sz /
> +						sizeof(u32)));
> +	for (i = 0; i < task->request_pl_sz / sizeof(u32); i++)
> +		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +				       task->request_pl[i]);
> +
> +	val = PCI_DOE_CTRL_GO;
> +	if (doe_mb->irq >= 0)
> +		val |= PCI_DOE_CTRL_INT_EN;
> +
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +	/* Request is sent - now wait for poll or IRQ */
> +	return 0;
> +}
> +
> +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> +{
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	size_t length;
> +	u32 val;
> +	int i;
> +
> +	/* Read the first dword to get the protocol */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) ||
> +	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) {
> +		pci_err(pdev,
> +			"Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",

Needs a "DOE" or similar hint for where to look when you see this
error.

"%04x" is the typical way we print a PCI vendor ID.  I guess a leading
"0x" isn't a disaster, but I do think it should be 4 hex digits so it
doesn't look like a length or something.

It looks like you anticipate multiple DOE capabilities.  Maybe
messages should include the capability offset to be specific?
I see you print with %u in pci_doe_create_mb().  I'm pretty sure
"lspci -vv" prints them in hex, and it's probably useful to match.

> +			task->prot.vid, task->prot.type,
> +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> +		return -EIO;
> +	}
> +
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	/* Read the second dword to get the length */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +
> +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> +	if (length > SZ_1M || length < 2)
> +		return -EIO;
> +
> +	/* First 2 dwords have already been read */
> +	length -= 2;
> +	/* Read the rest of the response payload */
> +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> +				      &task->response_pl[i]);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}
> +
> +	/* Flush excess length */
> +	for (; i < length; i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}
> +	/* Final error check to pick up on any since Data Object Ready */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +		return -EIO;
> +
> +	return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32);
> +}
> +
> +static void signal_task_complete(struct pci_doe_task *task, int rv)
> +{
> +	task->rv = rv;
> +	task->complete(task);
> +}
> +
> +static void retire_cur_task(struct pci_doe_mb *doe_mb)
> +{
> +	mutex_lock(&doe_mb->task_lock);
> +	doe_mb->cur_task = NULL;
> +	mutex_unlock(&doe_mb->task_lock);
> +	wake_up_interruptible(&doe_mb->wq);
> +}
> +
> +static void doe_statemachine_work(struct work_struct *work)
> +{
> +	struct delayed_work *w = to_delayed_work(work);
> +	struct pci_doe_mb *doe_mb = container_of(w, struct pci_doe_mb,
> +						 statemachine);
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	struct pci_doe_task *task;
> +	u32 val;
> +	int rc;
> +
> +	mutex_lock(&doe_mb->task_lock);
> +	task = doe_mb->cur_task;
> +	mutex_unlock(&doe_mb->task_lock);
> +
> +	if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) {
> +		/*
> +		 * Currently only used during init - care needed if
> +		 * pci_doe_abort() is generally exposed as it would impact
> +		 * queries in flight.
> +		 */
> +		if (task)
> +			pr_err("Aborting with active task!\n");

This should be a dev_err() and probably include "DOE" or some kind of
hint about what this applies to.

> +		doe_mb->state = DOE_WAIT_ABORT;
> +		pci_doe_abort_start(doe_mb);
> +		return;
> +	}
> +
> +	switch (doe_mb->state) {
> +	case DOE_IDLE:
> +		if (task == NULL)
> +			return;
> +
> +		rc = pci_doe_send_req(doe_mb, task);

Add blank line before block comment.

> +		/*
> +		 * The specification does not provide any guidance on how long
> +		 * some other entity could keep the DOE busy, so try for 1
> +		 * second then fail. Busy handling is best effort only, because
> +		 * there is no way of avoiding racing against another user of
> +		 * the DOE.
> +		 */
> +		if (rc == -EBUSY) {
> +			doe_mb->busy_retries++;
> +			if (doe_mb->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> +				/* Long enough, fail this request */
> +				pci_warn(pdev,
> +					"DOE busy for too long (> 1 sec)\n");
> +				doe_mb->busy_retries = 0;
> +				goto err_busy;
> +			}
> +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> +			return;
> +		}
> +		if (rc)
> +			goto err_abort;
> +		doe_mb->busy_retries = 0;
> +
> +		doe_mb->state = DOE_WAIT_RESP;
> +		doe_mb->timeout_jiffies = jiffies + HZ;
> +		/* Now poll or wait for IRQ with timeout */
> +		if (doe_mb->irq >= 0)
> +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> +		else
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +		return;
> +
> +	case DOE_WAIT_RESP:
> +		/* Not possible to get here with NULL task */
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +			rc = -EIO;
> +			goto err_abort;
> +		}
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> +			/* If not yet at timeout reschedule otherwise abort */
> +			if (time_after(jiffies, doe_mb->timeout_jiffies)) {
> +				rc = -ETIMEDOUT;
> +				goto err_abort;
> +			}
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +			return;
> +		}
> +
> +		rc  = pci_doe_recv_resp(doe_mb, task);
> +		if (rc < 0)
> +			goto err_abort;
> +
> +		doe_mb->state = DOE_IDLE;
> +
> +		retire_cur_task(doe_mb);
> +		/* Set the return value to the length of received payload */
> +		signal_task_complete(task, rc);
> +
> +		return;
> +
> +	case DOE_WAIT_ABORT:
> +	case DOE_WAIT_ABORT_ON_ERR:
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> +			/* Back to normal state - carry on */
> +			retire_cur_task(doe_mb);
> +
> +			/*
> +			 * For deliberately triggered abort, someone is
> +			 * waiting.
> +			 */
> +			if (doe_mb->state == DOE_WAIT_ABORT) {
> +				if (task)
> +					signal_task_complete(task, -EFAULT);
> +				complete(&doe_mb->abort_c);
> +			}
> +
> +			doe_mb->state = DOE_IDLE;
> +			return;
> +		}
> +		if (time_after(jiffies, doe_mb->timeout_jiffies)) {
> +			/* Task has timed out and is dead - abort */
> +			pci_err(pdev, "DOE ABORT timed out\n");
> +			set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> +			retire_cur_task(doe_mb);
> +
> +			if (doe_mb->state == DOE_WAIT_ABORT) {
> +				if (task)
> +					signal_task_complete(task, -EFAULT);
> +				complete(&doe_mb->abort_c);
> +			}
> +		}
> +		return;
> +	}
> +
> +err_abort:
> +	doe_mb->state = DOE_WAIT_ABORT_ON_ERR;
> +	pci_doe_abort_start(doe_mb);
> +err_busy:
> +	signal_task_complete(task, rc);
> +	if (doe_mb->state == DOE_IDLE)
> +		retire_cur_task(doe_mb);
> +}
> +
> +static void pci_doe_task_complete(struct pci_doe_task *task)
> +{
> +	complete(task->private);
> +}
> +
> +static int pci_doe_discovery(struct pci_doe_mb *doe_mb, u8 *index, u16 *vid,
> +			     u8 *protocol)
> +{
> +	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
> +				    *index);
> +	u32 response_pl;
> +	DECLARE_COMPLETION_ONSTACK(c);
> +	struct pci_doe_task task = {
> +		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
> +		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> +		.request_pl = &request_pl,
> +		.request_pl_sz = sizeof(request_pl),
> +		.response_pl = &response_pl,
> +		.response_pl_sz = sizeof(response_pl),
> +		.complete = pci_doe_task_complete,
> +		.private = &c,
> +	};
> +	int ret;
> +
> +	ret = pci_doe_submit_task(doe_mb, &task);
> +	if (ret < 0)
> +		return ret;
> +
> +	wait_for_completion(&c);
> +
> +	if (task.rv != sizeof(response_pl))
> +		return -EIO;
> +
> +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> +			      response_pl);
> +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> +			   response_pl);
> +
> +	return 0;
> +}
> +
> +static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb)
> +{
> +	u8 index = 0;
> +	int num_prots;
> +	int rc;
> +
> +	/* Discovery protocol must always be supported and must report itself */
> +	num_prots = 1;
> +
> +	doe_mb->prots = kcalloc(num_prots, sizeof(*doe_mb->prots), GFP_KERNEL);
> +	if (!doe_mb->prots)
> +		return -ENOMEM;
> +
> +	/*
> +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on

"pci_doe_free_mb()" to match other function name references.

> +	 * error if pci_doe_cache_protocols() fails past this point.
> +	 */
> +	do {
> +		struct pci_doe_protocol *prot;
> +
> +		prot = &doe_mb->prots[num_prots - 1];
> +		rc = pci_doe_discovery(doe_mb, &index, &prot->vid, &prot->type);
> +		if (rc)
> +			return rc;
> +
> +		if (index) {
> +			struct pci_doe_protocol *prot_new;
> +
> +			num_prots++;
> +			prot_new = krealloc(doe_mb->prots,
> +					    sizeof(*doe_mb->prots) * num_prots,
> +					    GFP_KERNEL);
> +			if (!prot_new)
> +				return -ENOMEM;
> +
> +			doe_mb->prots = prot_new;
> +		}
> +	} while (index);
> +
> +	doe_mb->num_prots = num_prots;
> +	return 0;
> +}
> +
> +static int pci_doe_abort(struct pci_doe_mb *doe_mb)
> +{
> +	reinit_completion(&doe_mb->abort_c);
> +	set_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags);
> +	schedule_delayed_work(&doe_mb->statemachine, 0);
> +	wait_for_completion(&doe_mb->abort_c);
> +
> +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags))
> +		return -EIO;
> +
> +	return 0;
> +}
> +
> +static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> +{
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	int doe_irq, rc;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> +
> +	if (!FIELD_GET(PCI_DOE_CAP_INT, val))
> +		return -EOPNOTSUPP;
> +
> +	doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
> +	rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
> +			     NULL, doe_mb,
> +			     "DOE[%d:%s]", doe_irq, pci_name(pdev));
> +	if (rc)
> +		return rc;
> +
> +	doe_mb->irq = doe_irq;
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> +			       PCI_DOE_CTRL_INT_EN);
> +	return 0;
> +}
> +
> +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> +{
> +	if (doe_mb->irq >= 0)
> +		pci_free_irq(doe_mb->pdev, doe_mb->irq, doe_mb);
> +	kfree(doe_mb->prots);
> +	kfree(doe_mb);
> +}
> +
> +/**
> + * pci_doe_create_mb() - Create a DOE mailbox object
> + *
> + * @pdev: PCI device to create the DOE mailbox for
> + * @cap_offset: Offset of the DOE mailbox
> + * @use_irq: Should the state machine use an irq
> + *
> + * Create a single mailbox object to manage the mailbox protocol at the
> + * cap_offset specified.
> + *
> + * Caller should allocate PCI irq vectors before setting use_irq.
> + *
> + * RETURNS: created mailbox object on success
> + *	    ERR_PTR(-errno) on failure
> + */
> +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> +				     bool use_irq)
> +{
> +	struct pci_doe_mb *doe_mb;
> +	int rc;
> +
> +	doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL);
> +	if (!doe_mb)
> +		return ERR_PTR(-ENOMEM);
> +
> +	doe_mb->pdev = pdev;
> +	init_completion(&doe_mb->abort_c);
> +	doe_mb->irq = -1;
> +	doe_mb->cap_offset = cap_offset;
> +
> +	init_waitqueue_head(&doe_mb->wq);
> +	mutex_init(&doe_mb->task_lock);
> +	INIT_DELAYED_WORK(&doe_mb->statemachine, doe_statemachine_work);
> +	doe_mb->state = DOE_IDLE;
> +
> +	if (use_irq) {
> +		rc = pci_doe_request_irq(doe_mb);
> +		if (rc)
> +			pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
> +				cap_offset, rc);
> +	}
> +
> +	/* Reset the mailbox by issuing an abort */
> +	rc = pci_doe_abort(doe_mb);
> +	if (rc) {
> +		pci_err(pdev, "DOE failed to reset the mailbox @ %u : %d\n",
> +			cap_offset, rc);
> +		pci_doe_free_mb(doe_mb);
> +		return ERR_PTR(rc);
> +	}
> +
> +	rc = pci_doe_cache_protocols(doe_mb);
> +	if (rc) {
> +		pci_err(pdev, "DOE failed to cache protocols for mailbox @ %u : %d\n",
> +			cap_offset, rc);
> +		pci_doe_free_mb(doe_mb);
> +		return ERR_PTR(rc);
> +	}
> +
> +	return doe_mb;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_create_mb);
> +
> +/**
> + * pci_doe_supports_prot() - Return if the DOE instance supports the given
> + *			     protocol
> + * @doe_mb: DOE mailbox capability to query
> + * @vid: Protocol Vendor ID
> + * @type: protocol type
             Protocol
> + *
> + * RETURNS: True if the DOE device supports the protocol specified

I think you typically use "DOE mailbox", not "DOE device".

> + */
> +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type)
> +{
> +	int i;
> +
> +	/* The discovery protocol must always be supported */
> +	if (vid == PCI_VENDOR_ID_PCI_SIG && type == PCI_DOE_PROTOCOL_DISCOVERY)
> +		return true;
> +
> +	for (i = 0; i < doe_mb->num_prots; i++)
> +		if ((doe_mb->prots[i].vid == vid) &&
> +		    (doe_mb->prots[i].type == type))
> +			return true;
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> +
> +/**
> + * pci_doe_submit_task() - Submit a task to be processed by the state machine
> + *
> + * @doe_mb: DOE mailbox capability to submit to
> + * @task: task to be queued
> + *
> + * Submit a DOE task (request/response) to the DOE mailbox to be processed.
> + * Returns upon queueing the task object.  If the queue is full this function
> + * will sleep until there is room in the queue.
> + *
> + * task->complete will be called when the state machine is done processing this
> + * task.
> + *
> + * Excess data will be discarded.
> + *
> + * RETURNS: 0 when task has been successful queued, -ERRNO on error
> + */
> +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> +{
> +	if (!pci_doe_supports_prot(doe_mb, task->prot.vid, task->prot.type))
> +		return -EINVAL;
> +
> +	/* DOE requests must be a whole number of DW */
> +	if (task->request_pl_sz % sizeof(u32))
> +		return -EINVAL;
> +
> +again:
> +	mutex_lock(&doe_mb->task_lock);
> +	if (doe_mb->cur_task) {
> +		mutex_unlock(&doe_mb->task_lock);
> +		wait_event_interruptible(doe_mb->wq, doe_mb->cur_task == NULL);
> +		goto again;
> +	}
> +
> +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> +		mutex_unlock(&doe_mb->task_lock);
> +		return -EIO;
> +	}
> +	doe_mb->cur_task = task;
> +	mutex_unlock(&doe_mb->task_lock);
> +	schedule_delayed_work(&doe_mb->statemachine, 0);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_submit_task);
> +
> +/**
> + * pci_doe_destroy_mb() - Destroy a DOE mailbox object created with
> + * pci_doe_create_mb()
> + *
> + * @doe_mb: DOE mailbox capability structure to destroy
> + *
> + * The mailbox becomes invalid and should not be used after this call.
> + */
> +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb)
> +{
> +	/* abort any work in progress */
> +	pci_doe_abort(doe_mb);
> +
> +	/* halt the state machine */
> +	cancel_delayed_work_sync(&doe_mb->statemachine);
> +
> +	pci_doe_free_mb(doe_mb);
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_destroy_mb);
> diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> new file mode 100644
> index 000000000000..7e6ebaf9930a
> --- /dev/null
> +++ b/include/linux/pci-doe.h
> @@ -0,0 +1,119 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data Object Exchange
> + * https://members.pcisig.com/wg/PCI-SIG/document/16609

Ditto.

> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + *
> + * Copyright (C) 2022 Intel Corporation
> + *	Ira Weiny <ira.weiny@intel.com>
> + */
> +
> +#ifndef LINUX_PCI_DOE_H
> +#define LINUX_PCI_DOE_H
> +
> +#include <linux/completion.h>
> +
> +enum pci_doe_state {
> +	DOE_IDLE,
> +	DOE_WAIT_RESP,
> +	DOE_WAIT_ABORT,
> +	DOE_WAIT_ABORT_ON_ERR,
> +};
> +
> +#define PCI_DOE_FLAG_ABORT	0
> +#define PCI_DOE_FLAG_DEAD	1
> +
> +struct pci_doe_protocol {
> +	u16 vid;
> +	u8 type;
> +};
> +
> +/**
> + * struct pci_doe_task - represents a single query/response
> + *
> + * @prot: DOE Protocol
> + * @request_pl: The request payload
> + * @request_pl_sz: Size of the request payload
> + * @response_pl: The response payload
> + * @response_pl_sz: Size of the response payload
> + * @rv: Return value.  Length of received response or error
> + * @complete: Called when task is complete
> + * @private: Private data for the consumer
> + */
> +struct pci_doe_task {
> +	struct pci_doe_protocol prot;
> +	u32 *request_pl;
> +	size_t request_pl_sz;
> +	u32 *response_pl;
> +	size_t response_pl_sz;
> +	int rv;
> +	void (*complete)(struct pci_doe_task *task);
> +	void *private;
> +};
> +
> +/**
> + * struct pci_doe_mb - State for a single DOE mailbox
> + *
> + * This state is used to manage a single DOE mailbox capability.  All fields
> + * should be considered opaque to the consumers and the structure passed into
> + * the helpers below after being created by devm_pci_doe_create()
> + *
> + * @pdev: PCI device this belongs to mailbox belongs to
> + * @abort_c: Completion used for initial abort handling
> + * @irq: Interrupt used for signaling DOE ready or abort
> + * @prots: Array of protocols supported on this DOE
> + * @num_prots: Size of prots array
> + * @cap_offset: Capability offset
> + * @wq: Wait queue to wait on if a query is in progress
> + * @cur_task: Current task the state machine is working on
> + * @task_lock: Protect cur_task
> + * @statemachine: Work item for the DOE state machine
> + * @state: Current state of this DOE
> + * @timeout_jiffies: 1 second after GO set
> + * @busy_retries: Count of retry attempts
> + * @flags: Bit array of PCI_DOE_FLAG_* flags
> + *
> + * Note: prots can't be allocated with struct size because the number of
> + * protocols is not known until after this structure is in use.  However, the
> + * single discovery protocol is always required to query for the number of
> + * protocols.
> + */
> +struct pci_doe_mb {
> +	struct pci_dev *pdev;
> +	struct completion abort_c;
> +	int irq;
> +	struct pci_doe_protocol *prots;
> +	int num_prots;
> +	u16 cap_offset;
> +
> +	wait_queue_head_t wq;
> +	struct pci_doe_task *cur_task;
> +	struct mutex task_lock;
> +	struct delayed_work statemachine;
> +	enum pci_doe_state state;
> +	unsigned long timeout_jiffies;
> +	unsigned int busy_retries;
> +	unsigned long flags;
> +};
> +
> +/**
> + * pci_doe_for_each_off - Iterate each DOE capability
> + * @pdev: struct pci_dev to iterate
> + * @off: u16 of config space offset of each mailbox capability found
> + */
> +#define pci_doe_for_each_off(pdev, off) \
> +	for (off = pci_find_next_ext_capability(pdev, off, \
> +					PCI_EXT_CAP_ID_DOE); \
> +		off > 0; \
> +		off = pci_find_next_ext_capability(pdev, off, \
> +					PCI_EXT_CAP_ID_DOE))
> +
> +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> +				     bool use_irq);
> +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);
> +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task);
> +
> +#endif
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index bee1a9ed6e66..4e96b45ee36d 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -736,7 +736,8 @@
>  #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
>  #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
>  #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
> -#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
> +#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
> +#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
>  
>  #define PCI_EXT_CAP_DSN_SIZEOF	12
>  #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
> @@ -1102,4 +1103,30 @@
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
>  
> +/* Data Object Exchange */
> +#define PCI_DOE_CAP		0x04    /* DOE Capabilities Register */
> +#define  PCI_DOE_CAP_INT			0x00000001  /* Interrupt Support */
> +#define  PCI_DOE_CAP_IRQ			0x00000ffe  /* Interrupt Message Number */
> +#define PCI_DOE_CTRL		0x08    /* DOE Control Register */
> +#define  PCI_DOE_CTRL_ABORT			0x00000001  /* DOE Abort */
> +#define  PCI_DOE_CTRL_INT_EN			0x00000002  /* DOE Interrupt Enable */
> +#define  PCI_DOE_CTRL_GO			0x80000000  /* DOE Go */
> +#define PCI_DOE_STATUS		0x0c    /* DOE Status Register */
> +#define  PCI_DOE_STATUS_BUSY			0x00000001  /* DOE Busy */
> +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */
> +#define  PCI_DOE_STATUS_ERROR			0x00000004  /* DOE Error */
> +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
> +#define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
> +#define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
> +
> +/* DOE Data Object - note not actually registers */
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> +
> +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
> +
>  #endif /* LINUX_PCI_REGS_H */
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox
  2022-04-14 20:32 ` [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox ira.weiny
  2022-04-27 17:19   ` Jonathan Cameron
@ 2022-04-29 15:55   ` Jonathan Cameron
  2022-04-29 17:20     ` Ira Weiny
  1 sibling, 1 reply; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-29 15:55 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 14 Apr 2022 13:32:31 -0700
ira.weiny@intel.com wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL kernel drivers optionally need to access DOE mailbox capabilities.
> Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
> the kernel while other access is designed towards user space usage.  An
> example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
> Compliance Mode DOE) which offers a mechanism to set different test
> modes for a device.
> 
> There is no anticipated need for the kernel to share an individual
> mailbox with user space.  Thus developing an interface to marshal access
> between the kernel and user space for a single mailbox is unnecessary
> overhead.  However, having the kernel relinquish some mailboxes to be
> controlled by user space is a reasonable compromise to share access to
> the device.
> 
> The auxiliary bus provides an elegant solution for this.  Each DOE
> capability is given its own auxiliary device.  This device is controlled
> by a kernel driver by default which restricts access to the mailbox.
> Unbinding the driver from a single auxiliary device (DOE mailbox
> capability) frees the mailbox for user space access.  This architecture
> also allows a clear picture on which mailboxes are kernel controlled vs
> not.
> 
> Iterate each DOE mailbox capability and create auxiliary bus devices.
> Follow on patches will define a driver for the newly created devices.
> 
> sysfs shows the devices.
> 
> $ ls -l /sys/bus/auxiliary/devices/
> total 0
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.0 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.1 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.2 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.3 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.4 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.4
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.5 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.6 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.7 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

oops. I just replied to v7. The point still stands though.
Repeated briefly below.

> 
> ---
> Changes from V7:
> 	Minor code clean ups
> 	Rebased on cxl-pending
> 
> Changes from V6:
> 	Move all the auxiliary device stuff to the CXL layer
> 
> Changes from V5:
> 	Split the CXL specific stuff off from the PCI DOE create
> 	auxiliary device code.
> ---
>  drivers/cxl/Kconfig  |   1 +
>  drivers/cxl/cxlpci.h |  21 +++++++
>  drivers/cxl/pci.c    | 127 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 149 insertions(+)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index f64e3984689f..ac0f5ca95431 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -16,6 +16,7 @@ if CXL_BUS
>  config CXL_PCI
>  	tristate "PCI manageability"
>  	default CXL_BUS
> +	select AUXILIARY_BUS
>  	help
>  	  The CXL specification defines a "CXL memory device" sub-class in the
>  	  PCI "memory controller" base class of devices. Device's identified by
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 329e7ea3f36a..2ad8715173ce 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -2,6 +2,7 @@
>  /* Copyright(c) 2020 Intel Corporation. All rights reserved. */
>  #ifndef __CXL_PCI_H__
>  #define __CXL_PCI_H__
> +#include <linux/auxiliary_bus.h>
>  #include <linux/pci.h>
>  #include "cxl.h"
>  
> @@ -72,4 +73,24 @@ static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
>  }
>  
>  int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> +
> +/**
> + * struct cxl_doe_dev - CXL DOE auxiliary bus device
> + *
> + * @adev: Auxiliary bus device
> + * @pdev: PCI device this belongs to
> + * @cap_offset: Capability offset
> + * @use_irq: Set if IRQs are to be used with this mailbox
> + *
> + * This represents a single DOE mailbox device.  CXL devices should create this
> + * device and register it on the Auxiliary bus for the CXL DOE driver to drive.
> + */
> +struct cxl_doe_dev {
> +	struct auxiliary_device adev;
> +	struct pci_dev *pdev;
> +	int cap_offset;
> +	bool use_irq;
> +};
> +#define DOE_DEV_NAME "doe"
> +
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index e7ab9a34d718..41a6f3eb0a5c 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -8,6 +8,7 @@
>  #include <linux/mutex.h>
>  #include <linux/list.h>
>  #include <linux/pci.h>
> +#include <linux/pci-doe.h>
>  #include <linux/io.h>
>  #include "cxlmem.h"
>  #include "cxlpci.h"
> @@ -564,6 +565,128 @@ static void cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
>  	info->ranges = __cxl_dvsec_ranges(cxlds, info);
>  }
>  
> +static void cxl_pci_free_irq_vectors(void *data)
> +{
> +	pci_free_irq_vectors(data);
> +}
> +
> +static DEFINE_IDA(pci_doe_adev_ida);
> +
> +static void cxl_pci_doe_dev_release(struct device *dev)
> +{
> +	struct auxiliary_device *adev = container_of(dev,
> +						struct auxiliary_device,
> +						dev);
> +	struct cxl_doe_dev *doe_dev = container_of(adev, struct cxl_doe_dev,
> +						   adev);
> +
> +	ida_free(&pci_doe_adev_ida, adev->id);
> +	kfree(doe_dev);
> +}
> +
> +static void cxl_pci_doe_destroy_device(void *ad)
> +{
> +	auxiliary_device_delete(ad);
> +	auxiliary_device_uninit(ad);
> +}
> +
> +/**
> + * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
> + *				mailboxes found
> + *
> + * @pci_dev: The PCI device to scan for DOE mailboxes
> + *
> + * There is no coresponding destroy of these devices.  This function associates
> + * the DOE auxiliary devices created with the pci_dev passed in.  That
> + * association is device managed (devm_*) such that the DOE auxiliary device
> + * lifetime is always less than or equal to the lifetime of the pci_dev.
> + *
> + * RETURNS: 0 on success -ERRNO on failure.
> + */
> +static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	bool use_irq = true;
> +	int irqs = 0;
> +	u16 off = 0;
> +	int rc;
> +
> +	pci_doe_for_each_off(pdev, off)
> +		irqs++;
I believe this is insufficient because there may be other irqs in use
on the device.  In a similar fashion to that done in pcie/portdrv_core.c
I think we need to instead find the maximum msi/msix vector number
by reading the config space.  Then we request one more vector
than that max value to ensure the vector we need has been requested.

Jonathan

> +	pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> +
> +	/*
> +	 * Allocate enough vectors for the DOE's
> +	 */
> +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> +						     PCI_IRQ_MSIX);
> +	if (rc != irqs) {
> +		pci_err(pdev,
> +			"Not enough interrupts for all the DOEs; use polling\n");
> +		use_irq = false;
> +		/* Some got allocated; clean them up */
> +		if (rc > 0)
> +			cxl_pci_free_irq_vectors(pdev);
> +	} else {
> +		/*
> +		 * Enabling bus mastering is require for MSI/MSIx.  It could be
> +		 * done later within the DOE initialization, but as it
> +		 * potentially has other impacts keep it here when setting up
> +		 * the IRQ's.
> +		 */
> +		pci_set_master(pdev);
> +		rc = devm_add_action_or_reset(dev,
> +					      cxl_pci_free_irq_vectors,
> +					      pdev);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	pci_doe_for_each_off(pdev, off) {
> +		struct auxiliary_device *adev;
> +		struct cxl_doe_dev *new_dev;
> +		int id;
> +
> +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> +		if (!new_dev)
> +			return -ENOMEM;
> +
> +		new_dev->pdev = pdev;
> +		new_dev->cap_offset = off;
> +		new_dev->use_irq = use_irq;
> +
> +		/* Set up struct auxiliary_device */
> +		adev = &new_dev->adev;
> +		id = ida_alloc(&pci_doe_adev_ida, GFP_KERNEL);
> +		if (id < 0) {
> +			kfree(new_dev);
> +			return -ENOMEM;
> +		}
> +
> +		adev->id = id;
> +		adev->name = DOE_DEV_NAME;
> +		adev->dev.release = cxl_pci_doe_dev_release;
> +		adev->dev.parent = dev;
> +
> +		if (auxiliary_device_init(adev)) {
> +			cxl_pci_doe_dev_release(&adev->dev);
> +			return -EIO;
> +		}
> +
> +		if (auxiliary_device_add(adev)) {
> +			auxiliary_device_uninit(adev);
> +			return -EIO;
> +		}
> +
> +		rc = devm_add_action_or_reset(dev, cxl_pci_doe_destroy_device,
> +					      adev);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	return 0;
> +}
> +
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct cxl_register_map map;
> @@ -630,6 +753,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxl_pci_create_doe_devices(pdev);
> +	if (rc)
> +		return rc;
> +
>  	cxl_dvsec_ranges(cxlds);
>  
>  	cxlmd = devm_cxl_add_memdev(cxlds);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox
  2022-04-28 21:09     ` ira.weiny
@ 2022-04-29 16:38       ` Jonathan Cameron
  2022-04-29 17:01         ` Dan Williams
  0 siblings, 1 reply; 42+ messages in thread
From: Jonathan Cameron @ 2022-04-29 16:38 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 28 Apr 2022 14:09:38 -0700
ira.weiny@intel.com wrote:

> On Wed, Apr 27, 2022 at 06:19:42PM +0100, Jonathan Cameron wrote:
> > On Thu, 14 Apr 2022 13:32:31 -0700
> > ira.weiny@intel.com wrote:
> >   
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > CXL kernel drivers optionally need to access DOE mailbox capabilities.
> > > Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
> > > the kernel while other access is designed towards user space usage.  An
> > > example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
> > > Compliance Mode DOE) which offers a mechanism to set different test
> > > modes for a device.
> > > 
> > > There is no anticipated need for the kernel to share an individual
> > > mailbox with user space.  Thus developing an interface to marshal access
> > > between the kernel and user space for a single mailbox is unnecessary
> > > overhead.  However, having the kernel relinquish some mailboxes to be
> > > controlled by user space is a reasonable compromise to share access to
> > > the device.
> > > 
> > > The auxiliary bus provides an elegant solution for this.  Each DOE
> > > capability is given its own auxiliary device.  This device is controlled
> > > by a kernel driver by default which restricts access to the mailbox.
> > > Unbinding the driver from a single auxiliary device (DOE mailbox
> > > capability) frees the mailbox for user space access.  This architecture
> > > also allows a clear picture on which mailboxes are kernel controlled vs
> > > not.
> > > 
> > > Iterate each DOE mailbox capability and create auxiliary bus devices.
> > > Follow on patches will define a driver for the newly created devices.
> > > 
> > > sysfs shows the devices.
> > > 
> > > $ ls -l /sys/bus/auxiliary/devices/
> > > total 0
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.0 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.1 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.2 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.3 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.4 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.4
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.5 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.6 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.7 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>  
> > 
> > I'm not 100% happy with effectively having one solution for CXL
> > and probably a different one for DOEs on switch ports
> > (which I just hacked into a port service driver to convince
> > myself there was at least one plausible way of doing that) but if
> > this effectively separates the two discussions then I guess I can
> > live with it for now ;)  
> 
> I took some time this morning to mull this over and talk to Dan...
> 
> :-(
> 
> Truthfully the aux driver does very little except provide a way for admins to
> trigger the driver to stop/start accessing the Mailbox.
> 
> I suppose a simple sysfs interface could be done to do the same?
> 
> I'll let Dan weigh in here.

I wonder if best short term option is to not provide a means of
removing it at all (separate from the PCI driver that is).
Then we can take our time to decide on the interface if we ever
get much demand for one.

> 
> > 
> > Once this is merged we can start the discussion about how to
> > handle switch ports with DOEs both for CDAT and SPDM.  
> 
> I'm ok with that too.  However, I was thinking that this was not a user ABI.
> But it really is.  If user space starts writing script to unbind drivers and
> then we drop the aux driver support it will break them...
> 
> > 
> > I'll send out an RFC that is so hideous it will get people to
> > suggestion how to do it better!  
> 
> I think I'd like to see that.

Fair enough. It may muddy the waters a bit :( I'll send an RFC
next week.  I've not looked at how the CXL region code etc would
actually get to the latency / bandwidth info from the driver yet
it just goes as far as reading a CDAT length. I also want to actually
hook up some decent switch CDAT emulation in the QEMU code
(right now it's giving the same default table as for a type 3 device).

I just hope we don't bikeshed around the RFC in a fashion that slows
this series moving forwards.

> 
> > Currently it starts and
> > stops the mailbox 3 times in the registration path and I think
> > it's more luck than judgement that is landing me with the right
> > MSI.
> > 
> > Anyhow, this looks good to me.
> > 
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>  
> 
> Thanks!
> 
> The good news is that the main support for the algorithm is now just part of
> the pci core as library functions.  I think this discussion is a pretty
> light lift to move the calls to those around...

Absolutely - one nasty bit of code that seems to be almost agreed on -
definite progress!

Jonathan

> 
> Ira
> 
> > 
> >   
> > > 
> > > ---
> > > Changes from V7:
> > > 	Minor code clean ups
> > > 	Rebased on cxl-pending
> > > 
> > > Changes from V6:
> > > 	Move all the auxiliary device stuff to the CXL layer
> > > 
> > > Changes from V5:
> > > 	Split the CXL specific stuff off from the PCI DOE create
> > > 	auxiliary device code.
> > > ---
> > >  drivers/cxl/Kconfig  |   1 +
> > >  drivers/cxl/cxlpci.h |  21 +++++++
> > >  drivers/cxl/pci.c    | 127 +++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 149 insertions(+)
> > > 
> > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > index f64e3984689f..ac0f5ca95431 100644
> > > --- a/drivers/cxl/Kconfig
> > > +++ b/drivers/cxl/Kconfig
> > > @@ -16,6 +16,7 @@ if CXL_BUS
> > >  config CXL_PCI
> > >  	tristate "PCI manageability"
> > >  	default CXL_BUS
> > > +	select AUXILIARY_BUS
> > >  	help
> > >  	  The CXL specification defines a "CXL memory device" sub-class in the
> > >  	  PCI "memory controller" base class of devices. Device's identified by
> > > diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> > > index 329e7ea3f36a..2ad8715173ce 100644
> > > --- a/drivers/cxl/cxlpci.h
> > > +++ b/drivers/cxl/cxlpci.h
> > > @@ -2,6 +2,7 @@
> > >  /* Copyright(c) 2020 Intel Corporation. All rights reserved. */
> > >  #ifndef __CXL_PCI_H__
> > >  #define __CXL_PCI_H__
> > > +#include <linux/auxiliary_bus.h>
> > >  #include <linux/pci.h>
> > >  #include "cxl.h"
> > >  
> > > @@ -72,4 +73,24 @@ static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
> > >  }
> > >  
> > >  int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> > > +
> > > +/**
> > > + * struct cxl_doe_dev - CXL DOE auxiliary bus device
> > > + *
> > > + * @adev: Auxiliary bus device
> > > + * @pdev: PCI device this belongs to
> > > + * @cap_offset: Capability offset
> > > + * @use_irq: Set if IRQs are to be used with this mailbox
> > > + *
> > > + * This represents a single DOE mailbox device.  CXL devices should create this
> > > + * device and register it on the Auxiliary bus for the CXL DOE driver to drive.
> > > + */
> > > +struct cxl_doe_dev {
> > > +	struct auxiliary_device adev;
> > > +	struct pci_dev *pdev;
> > > +	int cap_offset;
> > > +	bool use_irq;
> > > +};
> > > +#define DOE_DEV_NAME "doe"
> > > +
> > >  #endif /* __CXL_PCI_H__ */
> > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > > index e7ab9a34d718..41a6f3eb0a5c 100644
> > > --- a/drivers/cxl/pci.c
> > > +++ b/drivers/cxl/pci.c
> > > @@ -8,6 +8,7 @@
> > >  #include <linux/mutex.h>
> > >  #include <linux/list.h>
> > >  #include <linux/pci.h>
> > > +#include <linux/pci-doe.h>
> > >  #include <linux/io.h>
> > >  #include "cxlmem.h"
> > >  #include "cxlpci.h"
> > > @@ -564,6 +565,128 @@ static void cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
> > >  	info->ranges = __cxl_dvsec_ranges(cxlds, info);
> > >  }
> > >  
> > > +static void cxl_pci_free_irq_vectors(void *data)
> > > +{
> > > +	pci_free_irq_vectors(data);
> > > +}
> > > +
> > > +static DEFINE_IDA(pci_doe_adev_ida);
> > > +
> > > +static void cxl_pci_doe_dev_release(struct device *dev)
> > > +{
> > > +	struct auxiliary_device *adev = container_of(dev,
> > > +						struct auxiliary_device,
> > > +						dev);
> > > +	struct cxl_doe_dev *doe_dev = container_of(adev, struct cxl_doe_dev,
> > > +						   adev);
> > > +
> > > +	ida_free(&pci_doe_adev_ida, adev->id);
> > > +	kfree(doe_dev);
> > > +}
> > > +
> > > +static void cxl_pci_doe_destroy_device(void *ad)
> > > +{
> > > +	auxiliary_device_delete(ad);
> > > +	auxiliary_device_uninit(ad);
> > > +}
> > > +
> > > +/**
> > > + * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
> > > + *				mailboxes found
> > > + *
> > > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > > + *
> > > + * There is no coresponding destroy of these devices.  This function associates
> > > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > > + * association is device managed (devm_*) such that the DOE auxiliary device
> > > + * lifetime is always less than or equal to the lifetime of the pci_dev.
> > > + *
> > > + * RETURNS: 0 on success -ERRNO on failure.
> > > + */
> > > +static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > > +{
> > > +	struct device *dev = &pdev->dev;
> > > +	bool use_irq = true;
> > > +	int irqs = 0;
> > > +	u16 off = 0;
> > > +	int rc;
> > > +
> > > +	pci_doe_for_each_off(pdev, off)
> > > +		irqs++;
> > > +	pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> > > +
> > > +	/*
> > > +	 * Allocate enough vectors for the DOE's
> > > +	 */
> > > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> > > +						     PCI_IRQ_MSIX);
> > > +	if (rc != irqs) {
> > > +		pci_err(pdev,
> > > +			"Not enough interrupts for all the DOEs; use polling\n");
> > > +		use_irq = false;
> > > +		/* Some got allocated; clean them up */
> > > +		if (rc > 0)
> > > +			cxl_pci_free_irq_vectors(pdev);
> > > +	} else {
> > > +		/*
> > > +		 * Enabling bus mastering is require for MSI/MSIx.  It could be
> > > +		 * done later within the DOE initialization, but as it
> > > +		 * potentially has other impacts keep it here when setting up
> > > +		 * the IRQ's.
> > > +		 */
> > > +		pci_set_master(pdev);
> > > +		rc = devm_add_action_or_reset(dev,
> > > +					      cxl_pci_free_irq_vectors,
> > > +					      pdev);
> > > +		if (rc)
> > > +			return rc;
> > > +	}
> > > +
> > > +	pci_doe_for_each_off(pdev, off) {
> > > +		struct auxiliary_device *adev;
> > > +		struct cxl_doe_dev *new_dev;
> > > +		int id;
> > > +
> > > +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> > > +		if (!new_dev)
> > > +			return -ENOMEM;
> > > +
> > > +		new_dev->pdev = pdev;
> > > +		new_dev->cap_offset = off;
> > > +		new_dev->use_irq = use_irq;
> > > +
> > > +		/* Set up struct auxiliary_device */
> > > +		adev = &new_dev->adev;
> > > +		id = ida_alloc(&pci_doe_adev_ida, GFP_KERNEL);
> > > +		if (id < 0) {
> > > +			kfree(new_dev);
> > > +			return -ENOMEM;
> > > +		}
> > > +
> > > +		adev->id = id;
> > > +		adev->name = DOE_DEV_NAME;
> > > +		adev->dev.release = cxl_pci_doe_dev_release;
> > > +		adev->dev.parent = dev;
> > > +
> > > +		if (auxiliary_device_init(adev)) {
> > > +			cxl_pci_doe_dev_release(&adev->dev);
> > > +			return -EIO;
> > > +		}
> > > +
> > > +		if (auxiliary_device_add(adev)) {
> > > +			auxiliary_device_uninit(adev);
> > > +			return -EIO;
> > > +		}
> > > +
> > > +		rc = devm_add_action_or_reset(dev, cxl_pci_doe_destroy_device,
> > > +					      adev);
> > > +		if (rc)
> > > +			return rc;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > >  {
> > >  	struct cxl_register_map map;
> > > @@ -630,6 +753,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > >  	if (rc)
> > >  		return rc;
> > >  
> > > +	rc = cxl_pci_create_doe_devices(pdev);
> > > +	if (rc)
> > > +		return rc;
> > > +
> > >  	cxl_dvsec_ranges(cxlds);
> > >  
> > >  	cxlmd = devm_cxl_add_memdev(cxlds);  
> >   


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox
  2022-04-29 16:38       ` Jonathan Cameron
@ 2022-04-29 17:01         ` Dan Williams
  2022-05-03 16:14           ` Jonathan Cameron
  0 siblings, 1 reply; 42+ messages in thread
From: Dan Williams @ 2022-04-29 17:01 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Weiny, Ira, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Fri, Apr 29, 2022 at 9:39 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Thu, 28 Apr 2022 14:09:38 -0700
> ira.weiny@intel.com wrote:
>
> > On Wed, Apr 27, 2022 at 06:19:42PM +0100, Jonathan Cameron wrote:
> > > On Thu, 14 Apr 2022 13:32:31 -0700
> > > ira.weiny@intel.com wrote:
> > >
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > >
> > > > CXL kernel drivers optionally need to access DOE mailbox capabilities.
> > > > Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
> > > > the kernel while other access is designed towards user space usage.  An
> > > > example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
> > > > Compliance Mode DOE) which offers a mechanism to set different test
> > > > modes for a device.
> > > >
> > > > There is no anticipated need for the kernel to share an individual
> > > > mailbox with user space.  Thus developing an interface to marshal access
> > > > between the kernel and user space for a single mailbox is unnecessary
> > > > overhead.  However, having the kernel relinquish some mailboxes to be
> > > > controlled by user space is a reasonable compromise to share access to
> > > > the device.
> > > >
> > > > The auxiliary bus provides an elegant solution for this.  Each DOE
> > > > capability is given its own auxiliary device.  This device is controlled
> > > > by a kernel driver by default which restricts access to the mailbox.
> > > > Unbinding the driver from a single auxiliary device (DOE mailbox
> > > > capability) frees the mailbox for user space access.  This architecture
> > > > also allows a clear picture on which mailboxes are kernel controlled vs
> > > > not.
> > > >
> > > > Iterate each DOE mailbox capability and create auxiliary bus devices.
> > > > Follow on patches will define a driver for the newly created devices.
> > > >
> > > > sysfs shows the devices.
> > > >
> > > > $ ls -l /sys/bus/auxiliary/devices/
> > > > total 0
> > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.0 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.1 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.2 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.3 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.4 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.4
> > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.5 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.6 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.7 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> > > >
> > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > >
> > > I'm not 100% happy with effectively having one solution for CXL
> > > and probably a different one for DOEs on switch ports
> > > (which I just hacked into a port service driver to convince
> > > myself there was at least one plausible way of doing that) but if
> > > this effectively separates the two discussions then I guess I can
> > > live with it for now ;)
> >
> > I took some time this morning to mull this over and talk to Dan...
> >
> > :-(
> >
> > Truthfully the aux driver does very little except provide a way for admins to
> > trigger the driver to stop/start accessing the Mailbox.
> >
> > I suppose a simple sysfs interface could be done to do the same?
> >
> > I'll let Dan weigh in here.
>
> I wonder if best short term option is to not provide a means of
> removing it at all (separate from the PCI driver that is).
> Then we can take our time to decide on the interface if we ever
> get much demand for one.
>
> >
> > >
> > > Once this is merged we can start the discussion about how to
> > > handle switch ports with DOEs both for CDAT and SPDM.
> >
> > I'm ok with that too.  However, I was thinking that this was not a user ABI.
> > But it really is.  If user space starts writing script to unbind drivers and
> > then we drop the aux driver support it will break them...
> >
> > >
> > > I'll send out an RFC that is so hideous it will get people to
> > > suggestion how to do it better!
> >
> > I think I'd like to see that.
>
> Fair enough. It may muddy the waters a bit :( I'll send an RFC
> next week.  I've not looked at how the CXL region code etc would
> actually get to the latency / bandwidth info from the driver yet
> it just goes as far as reading a CDAT length. I also want to actually
> hook up some decent switch CDAT emulation in the QEMU code
> (right now it's giving the same default table as for a type 3 device).
>
> I just hope we don't bikeshed around the RFC in a fashion that slows
> this series moving forwards.

I think we have time in the sense that the worst that happens is that
tooling picks the wrong CFMWS to dynamically create a region and the
performance ends up being sub-optimal. That's tolerable to work around
in userspace in the near term. I want to get some wider confidence in
the DOE ABI with respect to the known protocols and what to do about
the vendor-specific protocols that may conflict and will be driven
from userspace issued config-cycles. That likely means that no DOE ABI
is the best ABI to start which means not moving forward with
aux-devices so scripts do not become attached to something that is not
fully committed to being carried forward.

I still want to refresh the request_config_region() support for at
least having the kernel warn on userspace conflicting configuration
writes to config areas claimed by a driver.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox
  2022-04-29 15:55   ` Jonathan Cameron
@ 2022-04-29 17:20     ` Ira Weiny
  2022-05-03 15:32       ` Jonathan Cameron
  0 siblings, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2022-04-29 17:20 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Fri, Apr 29, 2022 at 04:55:02PM +0100, Jonathan Cameron wrote:
> On Thu, 14 Apr 2022 13:32:31 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Ira Weiny <ira.weiny@intel.com>
> > 

[snip]

> > +
> > +/**
> > + * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
> > + *				mailboxes found
> > + *
> > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > + *
> > + * There is no coresponding destroy of these devices.  This function associates
> > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > + * association is device managed (devm_*) such that the DOE auxiliary device
> > + * lifetime is always less than or equal to the lifetime of the pci_dev.
> > + *
> > + * RETURNS: 0 on success -ERRNO on failure.
> > + */
> > +static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > +{
> > +	struct device *dev = &pdev->dev;
> > +	bool use_irq = true;
> > +	int irqs = 0;
> > +	u16 off = 0;
> > +	int rc;
> > +
> > +	pci_doe_for_each_off(pdev, off)
> > +		irqs++;
> I believe this is insufficient because there may be other irqs in use
> on the device.

I did not think that was true for any current CXL device.  From what I could
tell all CXL devices would be covered by this simple calculation.  I left it to
the reader to determine if a new CXL device came along which needed other irq's
to lift this somewhere to cover those allocations.  I probably should have made
some comment.  Sorry.

> In a similar fashion to that done in pcie/portdrv_core.c
> I think we need to instead find the maximum msi/msix vector number
> by reading the config space.

I was not aware I could do that...

> Then we request one more vector
> than that max value to ensure the vector we need has been requested.

Yea at some point I figured this would be lifted out of here as part of a
larger 'allocate all the vectors for the device' function.

But for now this is the only place that needs irqs so I punted.  I can lift
this into something like

cxl_pci_alloc_irq_vectors(...) and then pass use_irq here.

But to move this series forward I would propose that
cxl_pci_alloc_irq_vectors() do what I'm doing here for now.

Ira

> 
> Jonathan
> 
> > +	pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> > +
> > +	/*
> > +	 * Allocate enough vectors for the DOE's
> > +	 */
> > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> > +						     PCI_IRQ_MSIX);
> > +	if (rc != irqs) {
> > +		pci_err(pdev,
> > +			"Not enough interrupts for all the DOEs; use polling\n");
> > +		use_irq = false;
> > +		/* Some got allocated; clean them up */
> > +		if (rc > 0)
> > +			cxl_pci_free_irq_vectors(pdev);
> > +	} else {
> > +		/*
> > +		 * Enabling bus mastering is require for MSI/MSIx.  It could be
> > +		 * done later within the DOE initialization, but as it
> > +		 * potentially has other impacts keep it here when setting up
> > +		 * the IRQ's.
> > +		 */
> > +		pci_set_master(pdev);
> > +		rc = devm_add_action_or_reset(dev,
> > +					      cxl_pci_free_irq_vectors,
> > +					      pdev);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> > +	pci_doe_for_each_off(pdev, off) {
> > +		struct auxiliary_device *adev;
> > +		struct cxl_doe_dev *new_dev;
> > +		int id;
> > +
> > +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> > +		if (!new_dev)
> > +			return -ENOMEM;
> > +
> > +		new_dev->pdev = pdev;
> > +		new_dev->cap_offset = off;
> > +		new_dev->use_irq = use_irq;
> > +
> > +		/* Set up struct auxiliary_device */
> > +		adev = &new_dev->adev;
> > +		id = ida_alloc(&pci_doe_adev_ida, GFP_KERNEL);
> > +		if (id < 0) {
> > +			kfree(new_dev);
> > +			return -ENOMEM;
> > +		}
> > +
> > +		adev->id = id;
> > +		adev->name = DOE_DEV_NAME;
> > +		adev->dev.release = cxl_pci_doe_dev_release;
> > +		adev->dev.parent = dev;
> > +
> > +		if (auxiliary_device_init(adev)) {
> > +			cxl_pci_doe_dev_release(&adev->dev);
> > +			return -EIO;
> > +		}
> > +
> > +		if (auxiliary_device_add(adev)) {
> > +			auxiliary_device_uninit(adev);
> > +			return -EIO;
> > +		}
> > +
> > +		rc = devm_add_action_or_reset(dev, cxl_pci_doe_destroy_device,
> > +					      adev);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  {
> >  	struct cxl_register_map map;
> > @@ -630,6 +753,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  	if (rc)
> >  		return rc;
> >  
> > +	rc = cxl_pci_create_doe_devices(pdev);
> > +	if (rc)
> > +		return rc;
> > +
> >  	cxl_dvsec_ranges(cxlds);
> >  
> >  	cxlmd = devm_cxl_add_memdev(cxlds);
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-04-28 21:27   ` Bjorn Helgaas
@ 2022-05-02  5:36     ` ira.weiny
  0 siblings, 0 replies; 42+ messages in thread
From: ira.weiny @ 2022-05-02  5:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Dan Williams, Bjorn Helgaas, Jonathan Cameron, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Thu, Apr 28, 2022 at 04:27:23PM -0500, Bjorn Helgaas wrote:
> On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > Introduced in a PCI v6.0[1], DOE provides a config space based mailbox
>  
>   Introduced in PCIe r6.0, sec 6.30, DOE ...

Fixed.  Sorry I missed the fact that PCI uses 'r'.

> 
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> > 
> > Each DOE mailbox must support the DOE discovery protocol in addition to
> > any number of additional protocols.
> > 
> > Define core PCI functionality to manage a single PCI DOE mailbox at a
> > defined config space offset.  Functionality includes creating, supported
> > protocol queries, submit tasks to, and destroying the new state objects.
> > 
> > If interrupts are desired, interrupts vectors should be allocated prior
> > to asking for irq's when creating a mailbox object.
>                 IRQs

Fixed.

> 
> > [1] https://members.pcisig.com/wg/PCI-SIG/document/16609
> 
> The link is only useful for PCI-SIG members, and only as long as the
> SIG maintains this structure.  I think "PCIe r6.0" is sufficient.

Removed.  thanks.

> 
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>

thanks!

> 
> Minor comments below.
> 

[snip]

> > ---
> >  drivers/pci/Kconfig           |   3 +
> >  drivers/pci/Makefile          |   1 +
> >  drivers/pci/pci-doe.c         | 607 ++++++++++++++++++++++++++++++++++
> >  include/linux/pci-doe.h       | 119 +++++++
> >  include/uapi/linux/pci_regs.h |  29 +-
> >  5 files changed, 758 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/pci/pci-doe.c
> >  create mode 100644 include/linux/pci-doe.h
> > 
> > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> > index 133c73207782..b2f2e588a817 100644
> > --- a/drivers/pci/Kconfig
> > +++ b/drivers/pci/Kconfig
> > @@ -121,6 +121,9 @@ config XEN_PCIDEV_FRONTEND
> >  config PCI_ATS
> >  	bool
> >  
> > +config PCI_DOE
> > +	bool
> > +
> >  config PCI_ECAM
> >  	bool
> >  
> > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> > index 0da6b1ebc694..b609d8ad813d 100644
> > --- a/drivers/pci/Makefile
> > +++ b/drivers/pci/Makefile
> > @@ -31,6 +31,7 @@ obj-$(CONFIG_PCI_ECAM)		+= ecam.o
> >  obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
> >  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
> >  obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
> > +obj-$(CONFIG_PCI_DOE)		+= pci-doe.o
> 
> Is there value in the "pci-" prefix?

Not really.

> There are a few other files with
> the prefix, but I can't remember why.  Most things are of the form
> "doe.c".

Because of the mixed bag I was not sure which was preferred.  I chose poorly.
I'll remove the prefix.

> 
> >  # Endpoint library must be initialized before its users
> >  obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
> > diff --git a/drivers/pci/pci-doe.c b/drivers/pci/pci-doe.c
> > new file mode 100644
> > index 000000000000..ccf936421d2a
> > --- /dev/null
> > +++ b/drivers/pci/pci-doe.c
> > @@ -0,0 +1,607 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Data Object Exchange
> > + * https://members.pcisig.com/wg/PCI-SIG/document/16609
> 
> I think "PCIe r6.0, sec 6.30" is more generally useful.

Done.

> 
> > + * Copyright (C) 2021 Huawei
> > + *	Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + *
> > + * Copyright (C) 2022 Intel Corporation
> > + *	Ira Weiny <ira.weiny@intel.com>
> > + */
> > +
> > +#include <linux/bitfield.h>
> > +#include <linux/delay.h>
> > +#include <linux/jiffies.h>
> > +#include <linux/mutex.h>
> > +#include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> > +#include <linux/workqueue.h>
> > +
> > +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> > +
> > +#define PCI_DOE_BUSY_MAX_RETRIES 16
> > +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> > +
> > +/* Timeout of 1 second from 6.30.2 Operation, PCI Spec v6.0 */
> 
> Pedantic: the PCI specs use "r6.0", not "v6.0".  Reverse for ACPI.  Of
> course.  Even worse, PCI uses "Revision" for the major, "Version" for
> the minor.

Thanks for letting me know.

> 
> > +#define PCI_DOE_TIMEOUT HZ
> > +
> > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > +{
> > +	struct pci_doe_mb *doe_mb = data;
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +	/* Leave the error case to be handled outside IRQ */
> 
> s/irq/IRQ/ for similar comment and log message uses below.

Done.

> 
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > +					PCI_DOE_STATUS_INT_STATUS);
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}
> > +
> > +/*
> > + * Only called when safe to directly access the DOE from
> > + * doe_statemachine_work().  Outside access is not protected.  Users who
> > + * perform such access are left with the pieces.
> > + */
> > +static void pci_doe_abort_start(struct pci_doe_mb *doe_mb)
> > +{
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +
> > +	val = PCI_DOE_CTRL_ABORT;
> > +	if (doe_mb->irq >= 0)
> > +		val |= PCI_DOE_CTRL_INT_EN;
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +
> > +	doe_mb->timeout_jiffies = jiffies + HZ;
> > +	schedule_delayed_work(&doe_mb->statemachine, HZ);
> > +}
> > +
> > +static int pci_doe_send_req(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> 
> Wrap to fit in 80 columns, like the rest of drivers/pci/*

Done.

> 
> > +{
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +	int i;
> > +
> > +	/*
> > +	 * Check the DOE busy bit is not set. If it is set, this could indicate
> > +	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
> > +	 * it is expected that firmware and OS will negotiate access rights via
> > +	 * an, as yet to be defined method.
> > +	 */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> > +		return -EBUSY;
> > +
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +		return -EIO;
> > +
> > +	/* Write DOE Header */
> > +	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, task->prot.vid) |
> > +		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, task->prot.type);
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> > +	/* Length is 2 DW of header + length of payload in DW */
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> > +					  2 + task->request_pl_sz /
> > +						sizeof(u32)));
> > +	for (i = 0; i < task->request_pl_sz / sizeof(u32); i++)
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +				       task->request_pl[i]);
> > +
> > +	val = PCI_DOE_CTRL_GO;
> > +	if (doe_mb->irq >= 0)
> > +		val |= PCI_DOE_CTRL_INT_EN;
> > +
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +	/* Request is sent - now wait for poll or IRQ */
> > +	return 0;
> > +}
> > +
> > +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> > +{
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	size_t length;
> > +	u32 val;
> > +	int i;
> > +
> > +	/* Read the first dword to get the protocol */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) ||
> > +	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) {
> > +		pci_err(pdev,
> > +			"Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",
> 
> Needs a "DOE" or similar hint for where to look when you see this
> error.
> 
> "%04x" is the typical way we print a PCI vendor ID.  I guess a leading
> "0x" isn't a disaster, but I do think it should be 4 hex digits so it
> doesn't look like a length or something.

I'll match it up.  Using '%04x' for vendor ID and '%02x' for type.

> 
> It looks like you anticipate multiple DOE capabilities.  Maybe
> messages should include the capability offset to be specific?

Great idea, I've added 'DOE [%x]' as a prefix to all messages.

> I see you print with %u in pci_doe_create_mb().  I'm pretty sure
> "lspci -vv" prints them in hex, and it's probably useful to match.

Yes matching is good.  lspci does print in hex and without the 0x.

> 
> > +			task->prot.vid, task->prot.type,
> > +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> > +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> > +		return -EIO;
> > +	}
> > +
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	/* Read the second dword to get the length */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +
> > +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > +	if (length > SZ_1M || length < 2)
> > +		return -EIO;
> > +
> > +	/* First 2 dwords have already been read */
> > +	length -= 2;
> > +	/* Read the rest of the response payload */
> > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > +				      &task->response_pl[i]);
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	}
> > +
> > +	/* Flush excess length */
> > +	for (; i < length; i++) {
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	}
> > +	/* Final error check to pick up on any since Data Object Ready */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +		return -EIO;
> > +
> > +	return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32);
> > +}
> > +
> > +static void signal_task_complete(struct pci_doe_task *task, int rv)
> > +{
> > +	task->rv = rv;
> > +	task->complete(task);
> > +}
> > +
> > +static void retire_cur_task(struct pci_doe_mb *doe_mb)
> > +{
> > +	mutex_lock(&doe_mb->task_lock);
> > +	doe_mb->cur_task = NULL;
> > +	mutex_unlock(&doe_mb->task_lock);
> > +	wake_up_interruptible(&doe_mb->wq);
> > +}
> > +
> > +static void doe_statemachine_work(struct work_struct *work)
> > +{
> > +	struct delayed_work *w = to_delayed_work(work);
> > +	struct pci_doe_mb *doe_mb = container_of(w, struct pci_doe_mb,
> > +						 statemachine);
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	struct pci_doe_task *task;
> > +	u32 val;
> > +	int rc;
> > +
> > +	mutex_lock(&doe_mb->task_lock);
> > +	task = doe_mb->cur_task;
> > +	mutex_unlock(&doe_mb->task_lock);
> > +
> > +	if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) {
> > +		/*
> > +		 * Currently only used during init - care needed if
> > +		 * pci_doe_abort() is generally exposed as it would impact
> > +		 * queries in flight.
> > +		 */
> > +		if (task)
> > +			pr_err("Aborting with active task!\n");
> 
> This should be a dev_err() and probably include "DOE" or some kind of
> hint about what this applies to.

I think you mean pci_err()?

I've updated all the pci_*() messages with 'DOE @ [offset]:' prefix.

> 
> > +		doe_mb->state = DOE_WAIT_ABORT;
> > +		pci_doe_abort_start(doe_mb);
> > +		return;
> > +	}
> > +
> > +	switch (doe_mb->state) {
> > +	case DOE_IDLE:
> > +		if (task == NULL)
> > +			return;
> > +
> > +		rc = pci_doe_send_req(doe_mb, task);
> 
> Add blank line before block comment.

done

> 
> > +		/*
> > +		 * The specification does not provide any guidance on how long
> > +		 * some other entity could keep the DOE busy, so try for 1
> > +		 * second then fail. Busy handling is best effort only, because
> > +		 * there is no way of avoiding racing against another user of
> > +		 * the DOE.
> > +		 */
> > +		if (rc == -EBUSY) {
> > +			doe_mb->busy_retries++;
> > +			if (doe_mb->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> > +				/* Long enough, fail this request */
> > +				pci_warn(pdev,
> > +					"DOE busy for too long (> 1 sec)\n");
> > +				doe_mb->busy_retries = 0;
> > +				goto err_busy;
> > +			}
> > +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> > +			return;
> > +		}
> > +		if (rc)
> > +			goto err_abort;
> > +		doe_mb->busy_retries = 0;
> > +
> > +		doe_mb->state = DOE_WAIT_RESP;
> > +		doe_mb->timeout_jiffies = jiffies + HZ;
> > +		/* Now poll or wait for IRQ with timeout */
> > +		if (doe_mb->irq >= 0)
> > +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> > +		else
> > +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +		return;
> > +
> > +	case DOE_WAIT_RESP:
> > +		/* Not possible to get here with NULL task */
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +			rc = -EIO;
> > +			goto err_abort;
> > +		}
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> > +			/* If not yet at timeout reschedule otherwise abort */
> > +			if (time_after(jiffies, doe_mb->timeout_jiffies)) {
> > +				rc = -ETIMEDOUT;
> > +				goto err_abort;
> > +			}
> > +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +			return;
> > +		}
> > +
> > +		rc  = pci_doe_recv_resp(doe_mb, task);
> > +		if (rc < 0)
> > +			goto err_abort;
> > +
> > +		doe_mb->state = DOE_IDLE;
> > +
> > +		retire_cur_task(doe_mb);
> > +		/* Set the return value to the length of received payload */
> > +		signal_task_complete(task, rc);
> > +
> > +		return;
> > +
> > +	case DOE_WAIT_ABORT:
> > +	case DOE_WAIT_ABORT_ON_ERR:
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > +			/* Back to normal state - carry on */
> > +			retire_cur_task(doe_mb);
> > +
> > +			/*
> > +			 * For deliberately triggered abort, someone is
> > +			 * waiting.
> > +			 */
> > +			if (doe_mb->state == DOE_WAIT_ABORT) {
> > +				if (task)
> > +					signal_task_complete(task, -EFAULT);
> > +				complete(&doe_mb->abort_c);
> > +			}
> > +
> > +			doe_mb->state = DOE_IDLE;
> > +			return;
> > +		}
> > +		if (time_after(jiffies, doe_mb->timeout_jiffies)) {
> > +			/* Task has timed out and is dead - abort */
> > +			pci_err(pdev, "DOE ABORT timed out\n");
> > +			set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> > +			retire_cur_task(doe_mb);
> > +
> > +			if (doe_mb->state == DOE_WAIT_ABORT) {
> > +				if (task)
> > +					signal_task_complete(task, -EFAULT);
> > +				complete(&doe_mb->abort_c);
> > +			}
> > +		}
> > +		return;
> > +	}
> > +
> > +err_abort:
> > +	doe_mb->state = DOE_WAIT_ABORT_ON_ERR;
> > +	pci_doe_abort_start(doe_mb);
> > +err_busy:
> > +	signal_task_complete(task, rc);
> > +	if (doe_mb->state == DOE_IDLE)
> > +		retire_cur_task(doe_mb);
> > +}
> > +
> > +static void pci_doe_task_complete(struct pci_doe_task *task)
> > +{
> > +	complete(task->private);
> > +}
> > +
> > +static int pci_doe_discovery(struct pci_doe_mb *doe_mb, u8 *index, u16 *vid,
> > +			     u8 *protocol)
> > +{
> > +	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
> > +				    *index);
> > +	u32 response_pl;
> > +	DECLARE_COMPLETION_ONSTACK(c);
> > +	struct pci_doe_task task = {
> > +		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
> > +		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> > +		.request_pl = &request_pl,
> > +		.request_pl_sz = sizeof(request_pl),
> > +		.response_pl = &response_pl,
> > +		.response_pl_sz = sizeof(response_pl),
> > +		.complete = pci_doe_task_complete,
> > +		.private = &c,
> > +	};
> > +	int ret;
> > +
> > +	ret = pci_doe_submit_task(doe_mb, &task);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	wait_for_completion(&c);
> > +
> > +	if (task.rv != sizeof(response_pl))
> > +		return -EIO;
> > +
> > +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > +			      response_pl);
> > +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > +			   response_pl);
> > +
> > +	return 0;
> > +}
> > +
> > +static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb)
> > +{
> > +	u8 index = 0;
> > +	int num_prots;
> > +	int rc;
> > +
> > +	/* Discovery protocol must always be supported and must report itself */
> > +	num_prots = 1;
> > +
> > +	doe_mb->prots = kcalloc(num_prots, sizeof(*doe_mb->prots), GFP_KERNEL);
> > +	if (!doe_mb->prots)
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> 
> "pci_doe_free_mb()" to match other function name references.

oops...  done.

> 
> > +	 * error if pci_doe_cache_protocols() fails past this point.
> > +	 */
> > +	do {
> > +		struct pci_doe_protocol *prot;
> > +
> > +		prot = &doe_mb->prots[num_prots - 1];
> > +		rc = pci_doe_discovery(doe_mb, &index, &prot->vid, &prot->type);
> > +		if (rc)
> > +			return rc;
> > +
> > +		if (index) {
> > +			struct pci_doe_protocol *prot_new;
> > +
> > +			num_prots++;
> > +			prot_new = krealloc(doe_mb->prots,
> > +					    sizeof(*doe_mb->prots) * num_prots,
> > +					    GFP_KERNEL);
> > +			if (!prot_new)
> > +				return -ENOMEM;
> > +
> > +			doe_mb->prots = prot_new;
> > +		}
> > +	} while (index);
> > +
> > +	doe_mb->num_prots = num_prots;
> > +	return 0;
> > +}
> > +
> > +static int pci_doe_abort(struct pci_doe_mb *doe_mb)
> > +{
> > +	reinit_completion(&doe_mb->abort_c);
> > +	set_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags);
> > +	schedule_delayed_work(&doe_mb->statemachine, 0);
> > +	wait_for_completion(&doe_mb->abort_c);
> > +
> > +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags))
> > +		return -EIO;
> > +
> > +	return 0;
> > +}
> > +
> > +static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> > +{
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	int doe_irq, rc;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> > +
> > +	if (!FIELD_GET(PCI_DOE_CAP_INT, val))
> > +		return -EOPNOTSUPP;
> > +
> > +	doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
> > +	rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
> > +			     NULL, doe_mb,
> > +			     "DOE[%d:%s]", doe_irq, pci_name(pdev));
> > +	if (rc)
> > +		return rc;
> > +
> > +	doe_mb->irq = doe_irq;
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> > +			       PCI_DOE_CTRL_INT_EN);
> > +	return 0;
> > +}
> > +
> > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> > +{
> > +	if (doe_mb->irq >= 0)
> > +		pci_free_irq(doe_mb->pdev, doe_mb->irq, doe_mb);
> > +	kfree(doe_mb->prots);
> > +	kfree(doe_mb);
> > +}
> > +
> > +/**
> > + * pci_doe_create_mb() - Create a DOE mailbox object
> > + *
> > + * @pdev: PCI device to create the DOE mailbox for
> > + * @cap_offset: Offset of the DOE mailbox
> > + * @use_irq: Should the state machine use an irq
> > + *
> > + * Create a single mailbox object to manage the mailbox protocol at the
> > + * cap_offset specified.
> > + *
> > + * Caller should allocate PCI irq vectors before setting use_irq.
> > + *
> > + * RETURNS: created mailbox object on success
> > + *	    ERR_PTR(-errno) on failure
> > + */
> > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > +				     bool use_irq)
> > +{
> > +	struct pci_doe_mb *doe_mb;
> > +	int rc;
> > +
> > +	doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL);
> > +	if (!doe_mb)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	doe_mb->pdev = pdev;
> > +	init_completion(&doe_mb->abort_c);
> > +	doe_mb->irq = -1;
> > +	doe_mb->cap_offset = cap_offset;
> > +
> > +	init_waitqueue_head(&doe_mb->wq);
> > +	mutex_init(&doe_mb->task_lock);
> > +	INIT_DELAYED_WORK(&doe_mb->statemachine, doe_statemachine_work);
> > +	doe_mb->state = DOE_IDLE;
> > +
> > +	if (use_irq) {
> > +		rc = pci_doe_request_irq(doe_mb);
> > +		if (rc)
> > +			pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",

This and below are changed to print:

"DOE [%x] ..." rather than print the offset at the end.

This makes these consistent with the other errors per your feedback.

It also made these messages more concise.  :-D

> > +				cap_offset, rc);
> > +	}
> > +
> > +	/* Reset the mailbox by issuing an abort */
> > +	rc = pci_doe_abort(doe_mb);
> > +	if (rc) {
> > +		pci_err(pdev, "DOE failed to reset the mailbox @ %u : %d\n",
> > +			cap_offset, rc);
> > +		pci_doe_free_mb(doe_mb);
> > +		return ERR_PTR(rc);
> > +	}
> > +
> > +	rc = pci_doe_cache_protocols(doe_mb);
> > +	if (rc) {
> > +		pci_err(pdev, "DOE failed to cache protocols for mailbox @ %u : %d\n",
> > +			cap_offset, rc);
> > +		pci_doe_free_mb(doe_mb);
> > +		return ERR_PTR(rc);
> > +	}
> > +
> > +	return doe_mb;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_create_mb);
> > +
> > +/**
> > + * pci_doe_supports_prot() - Return if the DOE instance supports the given
> > + *			     protocol
> > + * @doe_mb: DOE mailbox capability to query
> > + * @vid: Protocol Vendor ID
> > + * @type: protocol type
>              Protocol

Done.

> > + *
> > + * RETURNS: True if the DOE device supports the protocol specified
> 
> I think you typically use "DOE mailbox", not "DOE device".

Yep.  That was left over from the Auxiliary device stuff.  Fixed.

> 
> > + */
> > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type)
> > +{
> > +	int i;
> > +
> > +	/* The discovery protocol must always be supported */
> > +	if (vid == PCI_VENDOR_ID_PCI_SIG && type == PCI_DOE_PROTOCOL_DISCOVERY)
> > +		return true;
> > +
> > +	for (i = 0; i < doe_mb->num_prots; i++)
> > +		if ((doe_mb->prots[i].vid == vid) &&
> > +		    (doe_mb->prots[i].type == type))
> > +			return true;
> > +
> > +	return false;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> > +
> > +/**
> > + * pci_doe_submit_task() - Submit a task to be processed by the state machine
> > + *
> > + * @doe_mb: DOE mailbox capability to submit to
> > + * @task: task to be queued
> > + *
> > + * Submit a DOE task (request/response) to the DOE mailbox to be processed.
> > + * Returns upon queueing the task object.  If the queue is full this function
> > + * will sleep until there is room in the queue.
> > + *
> > + * task->complete will be called when the state machine is done processing this
> > + * task.
> > + *
> > + * Excess data will be discarded.
> > + *
> > + * RETURNS: 0 when task has been successful queued, -ERRNO on error
> > + */
> > +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> > +{
> > +	if (!pci_doe_supports_prot(doe_mb, task->prot.vid, task->prot.type))
> > +		return -EINVAL;
> > +
> > +	/* DOE requests must be a whole number of DW */
> > +	if (task->request_pl_sz % sizeof(u32))
> > +		return -EINVAL;
> > +
> > +again:
> > +	mutex_lock(&doe_mb->task_lock);
> > +	if (doe_mb->cur_task) {
> > +		mutex_unlock(&doe_mb->task_lock);
> > +		wait_event_interruptible(doe_mb->wq, doe_mb->cur_task == NULL);
> > +		goto again;
> > +	}
> > +
> > +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> > +		mutex_unlock(&doe_mb->task_lock);
> > +		return -EIO;
> > +	}
> > +	doe_mb->cur_task = task;
> > +	mutex_unlock(&doe_mb->task_lock);
> > +	schedule_delayed_work(&doe_mb->statemachine, 0);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_submit_task);
> > +
> > +/**
> > + * pci_doe_destroy_mb() - Destroy a DOE mailbox object created with
> > + * pci_doe_create_mb()
> > + *
> > + * @doe_mb: DOE mailbox capability structure to destroy
> > + *
> > + * The mailbox becomes invalid and should not be used after this call.
> > + */
> > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb)
> > +{
> > +	/* abort any work in progress */
> > +	pci_doe_abort(doe_mb);
> > +
> > +	/* halt the state machine */
> > +	cancel_delayed_work_sync(&doe_mb->statemachine);
> > +
> > +	pci_doe_free_mb(doe_mb);
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_destroy_mb);
> > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> > new file mode 100644
> > index 000000000000..7e6ebaf9930a
> > --- /dev/null
> > +++ b/include/linux/pci-doe.h
> > @@ -0,0 +1,119 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Data Object Exchange
> > + * https://members.pcisig.com/wg/PCI-SIG/document/16609
> 
> Ditto.

Done.

Thanks for the in depth review,
Ira

> 
> > + *
> > + * Copyright (C) 2021 Huawei
> > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + *
> > + * Copyright (C) 2022 Intel Corporation
> > + *	Ira Weiny <ira.weiny@intel.com>
> > + */
> > +
> > +#ifndef LINUX_PCI_DOE_H
> > +#define LINUX_PCI_DOE_H
> > +
> > +#include <linux/completion.h>
> > +
> > +enum pci_doe_state {
> > +	DOE_IDLE,
> > +	DOE_WAIT_RESP,
> > +	DOE_WAIT_ABORT,
> > +	DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +#define PCI_DOE_FLAG_ABORT	0
> > +#define PCI_DOE_FLAG_DEAD	1
> > +
> > +struct pci_doe_protocol {
> > +	u16 vid;
> > +	u8 type;
> > +};
> > +
> > +/**
> > + * struct pci_doe_task - represents a single query/response
> > + *
> > + * @prot: DOE Protocol
> > + * @request_pl: The request payload
> > + * @request_pl_sz: Size of the request payload
> > + * @response_pl: The response payload
> > + * @response_pl_sz: Size of the response payload
> > + * @rv: Return value.  Length of received response or error
> > + * @complete: Called when task is complete
> > + * @private: Private data for the consumer
> > + */
> > +struct pci_doe_task {
> > +	struct pci_doe_protocol prot;
> > +	u32 *request_pl;
> > +	size_t request_pl_sz;
> > +	u32 *response_pl;
> > +	size_t response_pl_sz;
> > +	int rv;
> > +	void (*complete)(struct pci_doe_task *task);
> > +	void *private;
> > +};
> > +
> > +/**
> > + * struct pci_doe_mb - State for a single DOE mailbox
> > + *
> > + * This state is used to manage a single DOE mailbox capability.  All fields
> > + * should be considered opaque to the consumers and the structure passed into
> > + * the helpers below after being created by devm_pci_doe_create()
> > + *
> > + * @pdev: PCI device this belongs to mailbox belongs to
> > + * @abort_c: Completion used for initial abort handling
> > + * @irq: Interrupt used for signaling DOE ready or abort
> > + * @prots: Array of protocols supported on this DOE
> > + * @num_prots: Size of prots array
> > + * @cap_offset: Capability offset
> > + * @wq: Wait queue to wait on if a query is in progress
> > + * @cur_task: Current task the state machine is working on
> > + * @task_lock: Protect cur_task
> > + * @statemachine: Work item for the DOE state machine
> > + * @state: Current state of this DOE
> > + * @timeout_jiffies: 1 second after GO set
> > + * @busy_retries: Count of retry attempts
> > + * @flags: Bit array of PCI_DOE_FLAG_* flags
> > + *
> > + * Note: prots can't be allocated with struct size because the number of
> > + * protocols is not known until after this structure is in use.  However, the
> > + * single discovery protocol is always required to query for the number of
> > + * protocols.
> > + */
> > +struct pci_doe_mb {
> > +	struct pci_dev *pdev;
> > +	struct completion abort_c;
> > +	int irq;
> > +	struct pci_doe_protocol *prots;
> > +	int num_prots;
> > +	u16 cap_offset;
> > +
> > +	wait_queue_head_t wq;
> > +	struct pci_doe_task *cur_task;
> > +	struct mutex task_lock;
> > +	struct delayed_work statemachine;
> > +	enum pci_doe_state state;
> > +	unsigned long timeout_jiffies;
> > +	unsigned int busy_retries;
> > +	unsigned long flags;
> > +};
> > +
> > +/**
> > + * pci_doe_for_each_off - Iterate each DOE capability
> > + * @pdev: struct pci_dev to iterate
> > + * @off: u16 of config space offset of each mailbox capability found
> > + */
> > +#define pci_doe_for_each_off(pdev, off) \
> > +	for (off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE); \
> > +		off > 0; \
> > +		off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE))
> > +
> > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > +				     bool use_irq);
> > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);
> > +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task);
> > +
> > +#endif
> > diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> > index bee1a9ed6e66..4e96b45ee36d 100644
> > --- a/include/uapi/linux/pci_regs.h
> > +++ b/include/uapi/linux/pci_regs.h
> > @@ -736,7 +736,8 @@
> >  #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
> >  #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
> >  #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
> > -#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
> > +#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
> > +#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
> >  
> >  #define PCI_EXT_CAP_DSN_SIZEOF	12
> >  #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
> > @@ -1102,4 +1103,30 @@
> >  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
> >  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
> >  
> > +/* Data Object Exchange */
> > +#define PCI_DOE_CAP		0x04    /* DOE Capabilities Register */
> > +#define  PCI_DOE_CAP_INT			0x00000001  /* Interrupt Support */
> > +#define  PCI_DOE_CAP_IRQ			0x00000ffe  /* Interrupt Message Number */
> > +#define PCI_DOE_CTRL		0x08    /* DOE Control Register */
> > +#define  PCI_DOE_CTRL_ABORT			0x00000001  /* DOE Abort */
> > +#define  PCI_DOE_CTRL_INT_EN			0x00000002  /* DOE Interrupt Enable */
> > +#define  PCI_DOE_CTRL_GO			0x80000000  /* DOE Go */
> > +#define PCI_DOE_STATUS		0x0c    /* DOE Status Register */
> > +#define  PCI_DOE_STATUS_BUSY			0x00000001  /* DOE Busy */
> > +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */
> > +#define  PCI_DOE_STATUS_ERROR			0x00000004  /* DOE Error */
> > +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
> > +#define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
> > +#define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
> > +
> > +/* DOE Data Object - note not actually registers */
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> > +
> > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
> > +
> >  #endif /* LINUX_PCI_REGS_H */
> > -- 
> > 2.35.1
> > 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox
  2022-04-29 17:20     ` Ira Weiny
@ 2022-05-03 15:32       ` Jonathan Cameron
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Cameron @ 2022-05-03 15:32 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Fri, 29 Apr 2022 10:20:54 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> On Fri, Apr 29, 2022 at 04:55:02PM +0100, Jonathan Cameron wrote:
> > On Thu, 14 Apr 2022 13:32:31 -0700
> > ira.weiny@intel.com wrote:
> >   
> > > From: Ira Weiny <ira.weiny@intel.com>
> > >   
> 
> [snip]
> 
> > > +
> > > +/**
> > > + * cxl_pci_create_doe_devices - Create auxiliary bus DOE devices for all DOE
> > > + *				mailboxes found
> > > + *
> > > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > > + *
> > > + * There is no coresponding destroy of these devices.  This function associates
> > > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > > + * association is device managed (devm_*) such that the DOE auxiliary device
> > > + * lifetime is always less than or equal to the lifetime of the pci_dev.
> > > + *
> > > + * RETURNS: 0 on success -ERRNO on failure.
> > > + */
> > > +static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > > +{
> > > +	struct device *dev = &pdev->dev;
> > > +	bool use_irq = true;
> > > +	int irqs = 0;
> > > +	u16 off = 0;
> > > +	int rc;
> > > +
> > > +	pci_doe_for_each_off(pdev, off)
> > > +		irqs++;  
> > I believe this is insufficient because there may be other irqs in use
> > on the device.  
> 
> I did not think that was true for any current CXL device.  From what I could
> tell all CXL devices would be covered by this simple calculation.  I left it to
> the reader to determine if a new CXL device came along which needed other irq's
> to lift this somewhere to cover those allocations.  I probably should have made
> some comment.  Sorry.
> 
> > In a similar fashion to that done in pcie/portdrv_core.c
> > I think we need to instead find the maximum msi/msix vector number
> > by reading the config space.  
> 
> I was not aware I could do that...
> 
> > Then we request one more vector
> > than that max value to ensure the vector we need has been requested.  
> 
> Yea at some point I figured this would be lifted out of here as part of a
> larger 'allocate all the vectors for the device' function.
> 
> But for now this is the only place that needs irqs so I punted.  I can lift
> this into something like
> 
> cxl_pci_alloc_irq_vectors(...) and then pass use_irq here.
> 
> But to move this series forward I would propose that
> cxl_pci_alloc_irq_vectors() do what I'm doing here for now.

Handling this right is pretty simple code e.g. equivalent of
https://elixir.bootlin.com/linux/latest/source/drivers/pci/pcie/portdrv_core.c#L62
So my inclination would be to fix it now rather than leaving chance
of some odd failures later.

Jonathan

> 
> Ira
> 
> > 
> > Jonathan
> >   
> > > +	pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> > > +
> > > +	/*
> > > +	 * Allocate enough vectors for the DOE's
> > > +	 */
> > > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> > > +						     PCI_IRQ_MSIX);
> > > +	if (rc != irqs) {
> > > +		pci_err(pdev,
> > > +			"Not enough interrupts for all the DOEs; use polling\n");
> > > +		use_irq = false;
> > > +		/* Some got allocated; clean them up */
> > > +		if (rc > 0)
> > > +			cxl_pci_free_irq_vectors(pdev);
> > > +	} else {
> > > +		/*
> > > +		 * Enabling bus mastering is require for MSI/MSIx.  It could be
> > > +		 * done later within the DOE initialization, but as it
> > > +		 * potentially has other impacts keep it here when setting up
> > > +		 * the IRQ's.
> > > +		 */
> > > +		pci_set_master(pdev);
> > > +		rc = devm_add_action_or_reset(dev,
> > > +					      cxl_pci_free_irq_vectors,
> > > +					      pdev);
> > > +		if (rc)
> > > +			return rc;
> > > +	}
> > > +
> > > +	pci_doe_for_each_off(pdev, off) {
> > > +		struct auxiliary_device *adev;
> > > +		struct cxl_doe_dev *new_dev;
> > > +		int id;
> > > +
> > > +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> > > +		if (!new_dev)
> > > +			return -ENOMEM;
> > > +
> > > +		new_dev->pdev = pdev;
> > > +		new_dev->cap_offset = off;
> > > +		new_dev->use_irq = use_irq;
> > > +
> > > +		/* Set up struct auxiliary_device */
> > > +		adev = &new_dev->adev;
> > > +		id = ida_alloc(&pci_doe_adev_ida, GFP_KERNEL);
> > > +		if (id < 0) {
> > > +			kfree(new_dev);
> > > +			return -ENOMEM;
> > > +		}
> > > +
> > > +		adev->id = id;
> > > +		adev->name = DOE_DEV_NAME;
> > > +		adev->dev.release = cxl_pci_doe_dev_release;
> > > +		adev->dev.parent = dev;
> > > +
> > > +		if (auxiliary_device_init(adev)) {
> > > +			cxl_pci_doe_dev_release(&adev->dev);
> > > +			return -EIO;
> > > +		}
> > > +
> > > +		if (auxiliary_device_add(adev)) {
> > > +			auxiliary_device_uninit(adev);
> > > +			return -EIO;
> > > +		}
> > > +
> > > +		rc = devm_add_action_or_reset(dev, cxl_pci_doe_destroy_device,
> > > +					      adev);
> > > +		if (rc)
> > > +			return rc;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > >  {
> > >  	struct cxl_register_map map;
> > > @@ -630,6 +753,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > >  	if (rc)
> > >  		return rc;
> > >  
> > > +	rc = cxl_pci_create_doe_devices(pdev);
> > > +	if (rc)
> > > +		return rc;
> > > +
> > >  	cxl_dvsec_ranges(cxlds);
> > >  
> > >  	cxlmd = devm_cxl_add_memdev(cxlds);  
> >   


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox
  2022-04-29 17:01         ` Dan Williams
@ 2022-05-03 16:14           ` Jonathan Cameron
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Cameron @ 2022-05-03 16:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Weiny, Ira, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Fri, 29 Apr 2022 10:01:09 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Fri, Apr 29, 2022 at 9:39 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Thu, 28 Apr 2022 14:09:38 -0700
> > ira.weiny@intel.com wrote:
> >  
> > > On Wed, Apr 27, 2022 at 06:19:42PM +0100, Jonathan Cameron wrote:  
> > > > On Thu, 14 Apr 2022 13:32:31 -0700
> > > > ira.weiny@intel.com wrote:
> > > >  
> > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > >
> > > > > CXL kernel drivers optionally need to access DOE mailbox capabilities.
> > > > > Access to mailboxes for things such as CDAT, SPDM, and IDE are needed by
> > > > > the kernel while other access is designed towards user space usage.  An
> > > > > example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
> > > > > Compliance Mode DOE) which offers a mechanism to set different test
> > > > > modes for a device.
> > > > >
> > > > > There is no anticipated need for the kernel to share an individual
> > > > > mailbox with user space.  Thus developing an interface to marshal access
> > > > > between the kernel and user space for a single mailbox is unnecessary
> > > > > overhead.  However, having the kernel relinquish some mailboxes to be
> > > > > controlled by user space is a reasonable compromise to share access to
> > > > > the device.
> > > > >
> > > > > The auxiliary bus provides an elegant solution for this.  Each DOE
> > > > > capability is given its own auxiliary device.  This device is controlled
> > > > > by a kernel driver by default which restricts access to the mailbox.
> > > > > Unbinding the driver from a single auxiliary device (DOE mailbox
> > > > > capability) frees the mailbox for user space access.  This architecture
> > > > > also allows a clear picture on which mailboxes are kernel controlled vs
> > > > > not.
> > > > >
> > > > > Iterate each DOE mailbox capability and create auxiliary bus devices.
> > > > > Follow on patches will define a driver for the newly created devices.
> > > > >
> > > > > sysfs shows the devices.
> > > > >
> > > > > $ ls -l /sys/bus/auxiliary/devices/
> > > > > total 0
> > > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.0 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.0
> > > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.1 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.1
> > > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.2 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.2
> > > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.3 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.3
> > > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.4 -> ../../../devices/pci0000:35/0000:35:00.0/0000:36:00.0/cxl_pci.doe.4
> > > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.5 -> ../../../devices/pci0000:bf/0000:bf:00.0/0000:c0:00.0/cxl_pci.doe.5
> > > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.6 -> ../../../devices/pci0000:35/0000:35:01.0/0000:37:00.0/cxl_pci.doe.6
> > > > > lrwxrwxrwx 1 root root 0 Mar 24 10:47 cxl_pci.doe.7 -> ../../../devices/pci0000:bf/0000:bf:01.0/0000:c1:00.0/cxl_pci.doe.7
> > > > >
> > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>  
> > > >
> > > > I'm not 100% happy with effectively having one solution for CXL
> > > > and probably a different one for DOEs on switch ports
> > > > (which I just hacked into a port service driver to convince
> > > > myself there was at least one plausible way of doing that) but if
> > > > this effectively separates the two discussions then I guess I can
> > > > live with it for now ;)  
> > >
> > > I took some time this morning to mull this over and talk to Dan...
> > >
> > > :-(
> > >
> > > Truthfully the aux driver does very little except provide a way for admins to
> > > trigger the driver to stop/start accessing the Mailbox.
> > >
> > > I suppose a simple sysfs interface could be done to do the same?
> > >
> > > I'll let Dan weigh in here.  
> >
> > I wonder if best short term option is to not provide a means of
> > removing it at all (separate from the PCI driver that is).
> > Then we can take our time to decide on the interface if we ever
> > get much demand for one.
> >  
> > >  
> > > >
> > > > Once this is merged we can start the discussion about how to
> > > > handle switch ports with DOEs both for CDAT and SPDM.  
> > >
> > > I'm ok with that too.  However, I was thinking that this was not a user ABI.
> > > But it really is.  If user space starts writing script to unbind drivers and
> > > then we drop the aux driver support it will break them...
> > >  
> > > >
> > > > I'll send out an RFC that is so hideous it will get people to
> > > > suggestion how to do it better!  
> > >
> > > I think I'd like to see that.  
> >
> > Fair enough. It may muddy the waters a bit :( I'll send an RFC
> > next week.  I've not looked at how the CXL region code etc would
> > actually get to the latency / bandwidth info from the driver yet
> > it just goes as far as reading a CDAT length. I also want to actually
> > hook up some decent switch CDAT emulation in the QEMU code
> > (right now it's giving the same default table as for a type 3 device).
> >
> > I just hope we don't bikeshed around the RFC in a fashion that slows
> > this series moving forwards.  
> 
> I think we have time in the sense that the worst that happens is that
> tooling picks the wrong CFMWS to dynamically create a region and the
> performance ends up being sub-optimal. That's tolerable to work around
> in userspace in the near term. I want to get some wider confidence in
> the DOE ABI with respect to the known protocols and what to do about
> the vendor-specific protocols that may conflict and will be driven
> from userspace issued config-cycles. 
Hi Dan,

Ok, though I would like to try to push the conversation forwards beyond where
we got to last year. IIRC the general conclusion was that all protocols
sharing a DOE instance (and more generally because they all share
discovery) must be mediated by the kernel and that we would not be
supporting a generic access interface (there was one in a much earlier
version of this patch set).  Whilst it's far from the only thing that needs
resolving, this DOE question is also a blocker on getting anywhere with
CMA/SPDM.  Unbinding a driver as the means to stop the kernel accessing
a DOE is reliant on the host driver not deciding to do any more protocol
discovery - we can probably rely on that but it's not pretty.

> That likely means that no DOE ABI
> is the best ABI to start which means not moving forward with
> aux-devices so scripts do not become attached to something that is not
> fully committed to being carried forward.

This isn't really DOE ABI we are currently discussing, it's a CXL subsystem ABI.
If we decided to only expose the DOE as an internal implementation
detail (calls made directly from appropriate existing CXL driver) then
there wouldn't be an ABI question. We would be limiting DOE access
for other protocols but personally I don't see that as a problem
in the short to medium term.

Things may be more 'interesting' for the PCIe port services driver though
(see RFC just sent out)
https://lore.kernel.org/linux-cxl/20220503153449.4088-1-Jonathan.Cameron@huawei.com/T/#t
Perhaps if we can resolve what that should look like, the CXL EP side
of things will be much easier to figure out?

> 
> I still want to refresh the request_config_region() support for at
> least having the kernel warn on userspace conflicting configuration
> writes to config areas claimed by a driver.

Great, that seems like a sensible step to do in parallel.

Jonathan



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 06/10] cxl/pci: Find the DOE mailbox which supports CDAT
  2022-04-27 17:49   ` Jonathan Cameron
@ 2022-05-09 21:25     ` Ira Weiny
  0 siblings, 0 replies; 42+ messages in thread
From: Ira Weiny @ 2022-05-09 21:25 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Wed, Apr 27, 2022 at 06:49:56PM +0100, Jonathan Cameron wrote:
> On Thu, 14 Apr 2022 13:32:33 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Each CXL device may have multiple DOE mailbox capabilities and each
> > mailbox may support multiple protocols.
> > 
> > Search the DOE auxiliary devices for one which supports the CDAT
> > protocol.  Cache that device to be used for future queries.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> One question inline.  With that addressed (or convincing argument why not)
> the rest looks good.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks!

> 
> > 
> > ---
> > Changes from V7
> > 	Minor code clean ups
> > 
> > Changes from V6
> > 	Adjust for aux devices being a CXL only concept
> > 	Update commit msg.
> > 	Ensure devices iterated by auxiliary_find_device() are checked
> > 		to be DOE devices prior to checking for the CDAT
> > 		protocol
> > 	From Ben
> > 		Ensure reference from auxiliary_find_device() is dropped
> > ---
> >  drivers/cxl/cxl.h    |  2 ++
> >  drivers/cxl/cxlmem.h |  2 ++
> >  drivers/cxl/pci.c    | 76 +++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 79 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 990b6670222e..50817fd2c912 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -90,6 +90,8 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
> >  #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
> >  #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
> >  
> > +#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
> > +
> >  /*
> >   * Using struct_group() allows for per register-block-type helper routines,
> >   * without requiring block-type agnostic code to include the prefix.
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 7235d2f976e5..0dc6988afb30 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -168,6 +168,7 @@ struct cxl_endpoint_dvsec_info {
> >   * Currently only memory devices are represented.
> >   *
> >   * @dev: The device associated with this CXL state
> > + * @cdat_doe: Auxiliary DOE device capabile of reading CDAT
> >   * @regs: Parsed register blocks
> >   * @cxl_dvsec: Offset to the PCIe device DVSEC
> >   * @payload_size: Size of space for payload
> > @@ -200,6 +201,7 @@ struct cxl_endpoint_dvsec_info {
> >  struct cxl_dev_state {
> >  	struct device *dev;
> >  
> > +	struct cxl_doe_dev *cdat_doe;
> >  	struct cxl_regs regs;
> >  	int cxl_dvsec;
> >  
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 0dec1f1a3f38..622cac925262 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -706,6 +706,80 @@ static int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> >  	return 0;
> >  }
> >  
> > +static bool cxl_doe_dev_supports_prot(struct cxl_doe_dev *doe_dev, u16 vid, u16 pid)
> > +{
> > +	struct cxl_doe_drv_state *doe_ds;
> > +	bool ret;
> > +
> > +	doe_ds = cxl_pci_doe_get_drv(doe_dev);
> > +	if (!doe_ds) {
> > +		cxl_pci_doe_put_drv(doe_dev);
> 
> That means the get_drv has a rather unexpected side effect.
> Could you move the NULL check into the function so that we don't retain
> a reference if it fails?

Indeed.  I struggled a bit with this because get_* is usually taking a
reference on the subject of the get but in this case we are really getting a
reference on the device not the driver...

I originally open coded this and the functions were 'get_device'...

:-/

I think this also may have been why the device was originally part of the
driver state.

I think this is the better way to code these access functions:

static struct cxl_doe_drv_state *cxl_pci_doe_get_drv(struct cxl_doe_dev *doe_dev)                                               
{
        struct cxl_doe_drv_state *cxl_ds;                                                                                       
 
        down_read(&doe_dev->driver_access);                                                                                     
        cxl_ds = auxiliary_get_drvdata(&doe_dev->adev);
        if (!cxl_ds)
                up_read(&doe_dev->driver_access);                                                                               
        return cxl_ds;
} 
  
static void cxl_pci_doe_put_drv(struct cxl_doe_drv_state *cxl_ds)                                                               
{ 
        if (!cxl_ds)
                return;
        up_read(&cxl_ds->doe_dev->driver_access);
} 

Ira

> 
> > +		return false;
> > +	}
> > +	ret = pci_doe_supports_prot(doe_ds->doe_mb, vid, pid);
> > +	cxl_pci_doe_put_drv(doe_dev);
> > +	return ret;
> > +}
> > +
> > +static int cxl_match_cdat_doe_device(struct device *dev, const void *data)
> > +{
> > +	const struct cxl_dev_state *cxlds = data;
> > +	struct auxiliary_device *adev = to_auxiliary_dev(dev);
> > +	struct cxl_doe_dev *doe_dev;
> > +
> > +	/* Ensure this is a DOE device */
> > +	if (strcmp(DOE_DEV_NAME, adev->name))
> > +		return 0;
> > +
> > +	/* Determine if this auxiliary device belongs to the cxlds */
> > +	if (cxlds->dev != dev->parent)
> > +		return 0;
> > +
> > +	doe_dev = container_of(adev, struct cxl_doe_dev, adev);
> > +
> > +	/* If it is one of ours check for the CDAT protocol */
> > +	if (!cxl_doe_dev_supports_prot(doe_dev, PCI_DVSEC_VENDOR_ID_CXL,
> > +				   CXL_DOE_PROTOCOL_TABLE_ACCESS))
> > +		return 0;
> > +
> > +	return 1;
> > +}
> > +
> > +static void drop_cdat_doe_ref(void *data)
> > +{
> > +	struct cxl_doe_dev *cdat_doe = data;
> > +
> > +	put_device(&cdat_doe->adev.dev);
> > +}
> > +
> > +static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> > +{
> > +	struct device *dev = cxlds->dev;
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	struct auxiliary_device *adev;
> > +	int rc;
> > +
> > +	rc = cxl_pci_create_doe_devices(pdev);
> > +	if (rc)
> > +		return rc;
> > +
> > +	adev = auxiliary_find_device(NULL, cxlds, &cxl_match_cdat_doe_device);
> > +
> > +	if (adev) {
> > +		struct cxl_doe_dev *doe_dev = container_of(adev,
> > +							   struct cxl_doe_dev,
> > +							   adev);
> > +
> > +		/* auxiliary_find_device() takes the reference */
> > +		rc = devm_add_action_or_reset(dev, drop_cdat_doe_ref, doe_dev);
> > +		if (rc)
> > +			return rc;
> > +		cxlds->cdat_doe = doe_dev;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  {
> >  	struct cxl_register_map map;
> > @@ -772,7 +846,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  	if (rc)
> >  		return rc;
> >  
> > -	rc = cxl_pci_create_doe_devices(pdev);
> > +	rc = cxl_setup_doe_devices(cxlds);
> >  	if (rc)
> >  		return rc;
> >  
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-04-14 20:32 ` [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes ira.weiny
  2022-04-28 21:27   ` Bjorn Helgaas
@ 2022-05-30 19:06   ` Lukas Wunner
  2022-05-31 10:33     ` Jonathan Cameron
  2022-05-31 23:43     ` Ira Weiny
  1 sibling, 2 replies; 42+ messages in thread
From: Lukas Wunner @ 2022-05-30 19:06 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Jonathan Cameron, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
[...]
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	/* Read the second dword to get the length */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +
> +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> +	if (length > SZ_1M || length < 2)
> +		return -EIO;
> +

This API requires consumers to know the size of the response in advance.
That's not always the case, as exemplified by SPDM VERSION responses.
Jonathan uses a kludge in his SPDM patch which submits a first request
with a fixed size of 2 versions and, if that turns out to be too small,
submits a second request.

It would be nice if consumers are allowed to set request_pl to NULL.
Then a payload can be allocated here in pci_doe_recv_resp() with the
size retrieved above.

A flag may be necessary to indicate that the response is heap-allocated
and needs to be freed upon destruction of the pci_doe_task.


> +	/* First 2 dwords have already been read */
> +	length -= 2;
> +	/* Read the rest of the response payload */
> +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> +				      &task->response_pl[i]);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}

You need to check the Data Object Ready bit.  The device may clear the
bit prematurely (e.g. as a result of a concurrent FLR or Conventional
Reset).  You'll continue reading zero dwords from the mailbox and
pretend success to the caller even though the response is truncated.

If you're concerned about performance when checking the bit on every
loop iteration, checking it only on the last but one iteration should
be sufficient to detect truncation.


> +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> +			      response_pl);
> +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> +			   response_pl);

The fact that you need line breaks here is an indication that the
macros are too long.


> +/* DOE Data Object - note not actually registers */
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> +
> +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000

I'd get rid of "DATA_OBJECT_" everywhere, it's redundant with the
"Data Object" in "DOE".


> +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */

Another redundancy, I would get rid of the second "_STATUS".


> +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */

I would shorten to PCI_DOE_STATUS_READY.


> 		Simplify submitting work to the mailbox
> 			Replace pci_doe_exchange_sync() with
> 			pci_doe_submit_task() Consumers of the mailbox
> 			are now responsible for setting up callbacks
> 			within a task object and submitting them to the
> 			mailbox to be processed.

I honestly think that's a mistake.  All consumers both in the CDAT
as well as in the SPDM series just want to wait for completion of
the task.  They have no need for an arbitrary callback and shouldn't
be burdended with providing one.  It just unnecessarily complicates
the API.

So only providing pci_doe_exchange_sync() and doing away with
pci_doe_submit_task() would seem like a more appropriate approach.


> +/**
> + * pci_doe_for_each_off - Iterate each DOE capability
> + * @pdev: struct pci_dev to iterate
> + * @off: u16 of config space offset of each mailbox capability found
> + */
> +#define pci_doe_for_each_off(pdev, off) \
> +	for (off = pci_find_next_ext_capability(pdev, off, \
> +					PCI_EXT_CAP_ID_DOE); \
> +		off > 0; \
> +		off = pci_find_next_ext_capability(pdev, off, \
> +					PCI_EXT_CAP_ID_DOE))
> +
> +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> +				     bool use_irq);
> +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);

Drivers should not be concerned with the intricacies of DOE
capabilities and mailboxes.

Moreover, the above API doesn't allow different drivers to access
the same DOE mailbox concurrently, e.g. if that mailbox supports
multiple protocols.  There's no locking to serialize access to the
mailbox by the drivers.

This should be moved to the PCI core instead:  In pci_init_capabilities(),
add a new call pci_doe_init() which enumerates all DOE capabilities.
Add a list_head to struct pci_dev and add each DOE instance found
to that list.  Destroy the list elements in pci_destroy_dev().
No locking needed for the list_head, you only ever modify the list
on device enumeration and destruction.

Then provide a pci_doe_find_mailbox() library function which drivers
call to retrieve the pci_doe_mb for a given pci_dev/vid/type tuple.
That avoids the need to traverse the list time and again.


> +/**
> + * struct pci_doe_mb - State for a single DOE mailbox

We generally use the same terms as the spec to make it easier for
readers to connect the language in the spec to the implementation.

The spec uniformly refers to "DOE instance".  I guess "mailbox"
is slightly more concise, so keep that, but please at least mention
the term "instance" in the struct's kernel-doc.

This implementation uses the term "task" for one request/response.
That term is not mentioned in the spec at all.  The spec refers to
"exchange" and "transfer" on several occasions, so I would have chosen
either one of those instead of the somewhat unfortunate "task".


> + * This state is used to manage a single DOE mailbox capability.  All fields
> + * should be considered opaque to the consumers and the structure passed into
> + * the helpers below after being created by devm_pci_doe_create()

If the struct is considered opaque, why is it exposed in a public
header file?  Just use a forward declaration in the header
so that consumers can pass around pointers to the struct,
and hide the declaration proper in doe.c.


> + * @pdev: PCI device this belongs to mailbox belongs to
                             ^^^^^^^^^^
Typo.

> + * @prots: Array of protocols supported on this DOE
> + * @num_prots: Size of prots array

Use @prots instead of prots everywhere in the kernel-doc.


> +	/*
> +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> +	 * error if pci_doe_cache_protocols() fails past this point.
> +	 */

s/doe_mb_prots/doe_mb->prots/
s/pci_doe_free_mb/pci_doe_free_mb()/


> +	/* DOE requests must be a whole number of DW */
> +	if (task->request_pl_sz % sizeof(u32))
> +		return -EINVAL;

It would be nice if this restriction could be lifted.  SPDM uses
requests which are not padded to dword-length.  It can run over other
transports which may not impose such restrictions.  The SPDM layer
should not need to worry about quirks of the transport layer.


> +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> +{
> +	struct pci_doe_mb *doe_mb = data;
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +	/* Leave the error case to be handled outside IRQ */
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> +					PCI_DOE_STATUS_INT_STATUS);
> +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	return IRQ_NONE;
> +}

PCIe 6.0, table 7-316 says that an interrupt is also raised when
"the DOE Busy bit has been Cleared", yet such an interrupt is
not handled here.  It is incorrectly treated as a spurious
interrupt by returning IRQ_NONE.  The right thing to do
is probably to wake the state machine in case it's polling
for the Busy flag to clear.


> +enum pci_doe_state {
> +	DOE_IDLE,
> +	DOE_WAIT_RESP,
> +	DOE_WAIT_ABORT,
> +	DOE_WAIT_ABORT_ON_ERR,
> +};
> +
> +#define PCI_DOE_FLAG_ABORT	0
> +#define PCI_DOE_FLAG_DEAD	1

That's all internal and should live in doe.c, not the header file.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-05-30 19:06   ` Lukas Wunner
@ 2022-05-31 10:33     ` Jonathan Cameron
  2022-06-01  2:59       ` Ira Weiny
  2022-05-31 23:43     ` Ira Weiny
  1 sibling, 1 reply; 42+ messages in thread
From: Jonathan Cameron @ 2022-05-31 10:33 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: ira.weiny, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Mon, 30 May 2022 21:06:57 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)  
> [...]
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	/* Read the second dword to get the length */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +
> > +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > +	if (length > SZ_1M || length < 2)
> > +		return -EIO;
> > +  
> 
> This API requires consumers to know the size of the response in advance.
> That's not always the case, as exemplified by SPDM VERSION responses.
> Jonathan uses a kludge in his SPDM patch which submits a first request
> with a fixed size of 2 versions and, if that turns out to be too small,
> submits a second request.
> 
> It would be nice if consumers are allowed to set request_pl to NULL.
> Then a payload can be allocated here in pci_doe_recv_resp() with the
> size retrieved above.
> 
> A flag may be necessary to indicate that the response is heap-allocated
> and needs to be freed upon destruction of the pci_doe_task.

If possible I'd make that the callers problem. It should know it provided
NULL and expected a response.

I'd suggest making this a 'future' feature just to keep this initial
version simple.  Won't be hard to add later.

> 
> 
> > +	/* First 2 dwords have already been read */
> > +	length -= 2;
> > +	/* Read the rest of the response payload */
> > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > +				      &task->response_pl[i]);
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	}  
> 
> You need to check the Data Object Ready bit.  The device may clear the
> bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> Reset).  You'll continue reading zero dwords from the mailbox and
> pretend success to the caller even though the response is truncated.
> 
> If you're concerned about performance when checking the bit on every
> loop iteration, checking it only on the last but one iteration should
> be sufficient to detect truncation.

Good catch - I hate corner cases.  Thankfully this one is trivial to
check for.

> 
> 
> > +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > +			      response_pl);
> > +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > +			   response_pl);  
> 
> The fact that you need line breaks here is an indication that the
> macros are too long.
> 
> 
> > +/* DOE Data Object - note not actually registers */
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> > +
> > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000  
> 
> I'd get rid of "DATA_OBJECT_" everywhere, it's redundant with the
> "Data Object" in "DOE".

We need something to make it clear these aren't DVSEC headers for
example, but definitely could be shorter.

PCI_DOE_OBJ_... 

perhaps?

> 
> 
> > +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */  
> 
> Another redundancy, I would get rid of the second "_STATUS".

Hmm. I'll leave this one to Ira's discretion but I'm not keen on cropping
too much of naming given loss of alignment with the spec.

> 
> 
> > +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */  
> 
> I would shorten to PCI_DOE_STATUS_READY.

Seems reasonable though always some burden in moving
away from full spec names.

> 
> 
> > 		Simplify submitting work to the mailbox
> > 			Replace pci_doe_exchange_sync() with
> > 			pci_doe_submit_task() Consumers of the mailbox
> > 			are now responsible for setting up callbacks
> > 			within a task object and submitting them to the
> > 			mailbox to be processed.  
> 
> I honestly think that's a mistake.  All consumers both in the CDAT
> as well as in the SPDM series just want to wait for completion of
> the task.  They have no need for an arbitrary callback and shouldn't
> be burdended with providing one.  It just unnecessarily complicates
> the API.
> 
> So only providing pci_doe_exchange_sync() and doing away with
> pci_doe_submit_task() would seem like a more appropriate approach.

We've gone around the houses with this.  At this stage I don't
care strongly either way and it's all in kernel code so we
can refine it later once we have lots of examples.

> 
> 
> > +/**
> > + * pci_doe_for_each_off - Iterate each DOE capability
> > + * @pdev: struct pci_dev to iterate
> > + * @off: u16 of config space offset of each mailbox capability found
> > + */
> > +#define pci_doe_for_each_off(pdev, off) \
> > +	for (off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE); \
> > +		off > 0; \
> > +		off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE))
> > +
> > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > +				     bool use_irq);
> > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);  
> 
> Drivers should not be concerned with the intricacies of DOE
> capabilities and mailboxes.
> 
> Moreover, the above API doesn't allow different drivers to access
> the same DOE mailbox concurrently, e.g. if that mailbox supports
> multiple protocols.  There's no locking to serialize access to the
> mailbox by the drivers.
> 
> This should be moved to the PCI core instead:  In pci_init_capabilities(),
> add a new call pci_doe_init() which enumerates all DOE capabilities.
> Add a list_head to struct pci_dev and add each DOE instance found
> to that list.  Destroy the list elements in pci_destroy_dev().
> No locking needed for the list_head, you only ever modify the list
> on device enumeration and destruction.
> 
> Then provide a pci_doe_find_mailbox() library function which drivers
> call to retrieve the pci_doe_mb for a given pci_dev/vid/type tuple.
> That avoids the need to traverse the list time and again.

I've lost track a bit as this thread has been running a long time,  but
I think we are moving back to something that looks more like this
(and the much earlier versions of the set which did more or less what you
describe - though they had their own issues resolved in the meantime)

There is an exciting corner around interrupts though that complicates
things.  Currently I think I'm right in saying the pci core never
sets up interrupts, that is always a job for the drivers.

My first thought is to do something similar to the hack I did for
the switch ports, and use polled mode to query support protocols.
Then later call from driver that will attempt to enable interrupts
if desired for a given DOE instance.


> 
> 
> > +/**
> > + * struct pci_doe_mb - State for a single DOE mailbox  
> 
> We generally use the same terms as the spec to make it easier for
> readers to connect the language in the spec to the implementation.
> 
> The spec uniformly refers to "DOE instance".  I guess "mailbox"
> is slightly more concise, so keep that, but please at least mention
> the term "instance" in the struct's kernel-doc.
> 
> This implementation uses the term "task" for one request/response.
> That term is not mentioned in the spec at all.  The spec refers to
> "exchange" and "transfer" on several occasions, so I would have chosen
> either one of those instead of the somewhat unfortunate "task".
> 
> 
> > + * This state is used to manage a single DOE mailbox capability.  All fields
> > + * should be considered opaque to the consumers and the structure passed into
> > + * the helpers below after being created by devm_pci_doe_create()  
> 
> If the struct is considered opaque, why is it exposed in a public
> header file?  Just use a forward declaration in the header
> so that consumers can pass around pointers to the struct,
> and hide the declaration proper in doe.c.
> 
> 
> > + * @pdev: PCI device this belongs to mailbox belongs to  
>                              ^^^^^^^^^^
> Typo.
> 
> > + * @prots: Array of protocols supported on this DOE
> > + * @num_prots: Size of prots array  
> 
> Use @prots instead of prots everywhere in the kernel-doc.
> 
> 
> > +	/*
> > +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> > +	 * error if pci_doe_cache_protocols() fails past this point.
> > +	 */  
> 
> s/doe_mb_prots/doe_mb->prots/
> s/pci_doe_free_mb/pci_doe_free_mb()/
> 
> 
> > +	/* DOE requests must be a whole number of DW */
> > +	if (task->request_pl_sz % sizeof(u32))
> > +		return -EINVAL;  
> 
> It would be nice if this restriction could be lifted.  SPDM uses
> requests which are not padded to dword-length.  It can run over other
> transports which may not impose such restrictions.  The SPDM layer
> should not need to worry about quirks of the transport layer.

This is a pain.  DOE absolutely requires 32 bit padding and
reads need to be 32 bit.   We can obviously do something nasty
with dealing with the tail via a bounce buffer though.

I'd pencil that in as a future feature though rather than worry
about it before we have any support at all in place.

> 
> 
> > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > +{
> > +	struct pci_doe_mb *doe_mb = data;
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +	/* Leave the error case to be handled outside IRQ */
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > +					PCI_DOE_STATUS_INT_STATUS);
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}  
> 
> PCIe 6.0, table 7-316 says that an interrupt is also raised when
> "the DOE Busy bit has been Cleared", yet such an interrupt is
> not handled here.  It is incorrectly treated as a spurious
> interrupt by returning IRQ_NONE.  The right thing to do
> is probably to wake the state machine in case it's polling
> for the Busy flag to clear.

Ah. I remember testing this via a lot of hacking on the QEMU code
to inject the various races that can occur (it was really ugly to do).

Guess we lost the handling at some point.  I think your fix
is the right one.

> 
> 
> > +enum pci_doe_state {
> > +	DOE_IDLE,
> > +	DOE_WAIT_RESP,
> > +	DOE_WAIT_ABORT,
> > +	DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +#define PCI_DOE_FLAG_ABORT	0
> > +#define PCI_DOE_FLAG_DEAD	1  
> 
> That's all internal and should live in doe.c, not the header file.
> 
> Thanks,
> 
> Lukas
> 

Thanks for the review and thanks again to Ira for taking this forwards.

Jonathan



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-05-30 19:06   ` Lukas Wunner
  2022-05-31 10:33     ` Jonathan Cameron
@ 2022-05-31 23:43     ` Ira Weiny
  1 sibling, 0 replies; 42+ messages in thread
From: Ira Weiny @ 2022-05-31 23:43 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Dan Williams, Bjorn Helgaas, Jonathan Cameron, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Mon, May 30, 2022 at 09:06:57PM +0200, Lukas Wunner wrote:
> On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> [...]
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	/* Read the second dword to get the length */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +
> > +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > +	if (length > SZ_1M || length < 2)
> > +		return -EIO;
> > +
> 
> This API requires consumers to know the size of the response in advance.
> That's not always the case, as exemplified by SPDM VERSION responses.
> Jonathan uses a kludge in his SPDM patch which submits a first request
> with a fixed size of 2 versions and, if that turns out to be too small,
> submits a second request.
> 
> It would be nice if consumers are allowed to set request_pl to NULL.
> Then a payload can be allocated here in pci_doe_recv_resp() with the
> size retrieved above.
> 
> A flag may be necessary to indicate that the response is heap-allocated
> and needs to be freed upon destruction of the pci_doe_task.

I'm not comfortable with PCI allocating memory.  Especially optionally
allocating memory.  And I'm not sure this really saves the extra queries.  The
PCI layer will have to query first to determine response length.

If the community does want PCI to allocate memory then I think the interface
should be to always allocate it.

> 
> 
> > +	/* First 2 dwords have already been read */
> > +	length -= 2;
> > +	/* Read the rest of the response payload */
> > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > +				      &task->response_pl[i]);
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	}
> 
> You need to check the Data Object Ready bit.  The device may clear the
> bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> Reset).  You'll continue reading zero dwords from the mailbox and
> pretend success to the caller even though the response is truncated.

If an FLR happens then truncation does not seem like a problem.

> 
> If you're concerned about performance when checking the bit on every
> loop iteration, checking it only on the last but one iteration should
> be sufficient to detect truncation.

I'll review.  I don't think this is a fast path.

> 
> 
> > +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > +			      response_pl);
> > +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > +			   response_pl);
> 
> The fact that you need line breaks here is an indication that the
> macros are too long.
> 
> 
> > +/* DOE Data Object - note not actually registers */
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> > +
> > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
> 
> I'd get rid of "DATA_OBJECT_" everywhere, it's redundant with the
> "Data Object" in "DOE".

I'm ok with renaming.  Jonathan chose the names and no one has complained yet.

> 
> 
> > +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */
> 
> Another redundancy, I would get rid of the second "_STATUS".
> 
> 
> > +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
> 
> I would shorten to PCI_DOE_STATUS_READY.
> 
> 
> > 		Simplify submitting work to the mailbox
> > 			Replace pci_doe_exchange_sync() with
> > 			pci_doe_submit_task() Consumers of the mailbox
> > 			are now responsible for setting up callbacks
> > 			within a task object and submitting them to the
> > 			mailbox to be processed.
> 
> I honestly think that's a mistake.  All consumers both in the CDAT
> as well as in the SPDM series just want to wait for completion of
> the task.  They have no need for an arbitrary callback and shouldn't
> be burdended with providing one.  It just unnecessarily complicates
> the API.
> 
> So only providing pci_doe_exchange_sync() and doing away with
> pci_doe_submit_task() would seem like a more appropriate approach.

This was discussed a couple of times and it was decided to keep the
synchronicity out of the library functions and allow users to decide how to
handle it.

> 
> 
> > +/**
> > + * pci_doe_for_each_off - Iterate each DOE capability
> > + * @pdev: struct pci_dev to iterate
> > + * @off: u16 of config space offset of each mailbox capability found
> > + */
> > +#define pci_doe_for_each_off(pdev, off) \
> > +	for (off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE); \
> > +		off > 0; \
> > +		off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE))
> > +
> > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > +				     bool use_irq);
> > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);
> 
> Drivers should not be concerned with the intricacies of DOE
> capabilities and mailboxes.

I think it depends on which driver one is considering.

> 
> Moreover, the above API doesn't allow different drivers to access
> the same DOE mailbox concurrently, e.g. if that mailbox supports
> multiple protocols.  There's no locking to serialize access to the
> mailbox by the drivers.

Correct it is up to the higher level code to do this.

> 
> This should be moved to the PCI core instead:  In pci_init_capabilities(),
> add a new call pci_doe_init() which enumerates all DOE capabilities.
> Add a list_head to struct pci_dev and add each DOE instance found
> to that list.  Destroy the list elements in pci_destroy_dev().
> No locking needed for the list_head, you only ever modify the list
> on device enumeration and destruction.
> 
> Then provide a pci_doe_find_mailbox() library function which drivers
> call to retrieve the pci_doe_mb for a given pci_dev/vid/type tuple.
> That avoids the need to traverse the list time and again.

All that could be done above these calls.  I've abandoned the aux bus stuff
after discussions with Dan.  So perhaps burring this away from the CXL drivers
is a good idea.  But I still think this can serve as a base within the PCIe
code.

> 
> 
> > +/**
> > + * struct pci_doe_mb - State for a single DOE mailbox
> 
> We generally use the same terms as the spec to make it easier for
> readers to connect the language in the spec to the implementation.
> 
> The spec uniformly refers to "DOE instance".  I guess "mailbox"
> is slightly more concise, so keep that, but please at least mention
> the term "instance" in the struct's kernel-doc.
> 
> This implementation uses the term "task" for one request/response.
> That term is not mentioned in the spec at all.  The spec refers to
> "exchange" and "transfer" on several occasions, so I would have chosen
> either one of those instead of the somewhat unfortunate "task".

Again this was discussed and we settled on task.  I'm not married to the term
and s/task/exchange/ is fine with me.  But IFRC Dan was not happy with
exchange.

> 
> 
> > + * This state is used to manage a single DOE mailbox capability.  All fields
> > + * should be considered opaque to the consumers and the structure passed into
> > + * the helpers below after being created by devm_pci_doe_create()
> 
> If the struct is considered opaque, why is it exposed in a public
> header file?  Just use a forward declaration in the header
> so that consumers can pass around pointers to the struct,
> and hide the declaration proper in doe.c.

I could do that.  It just did not seem important enough to make it that strict.

> 
> 
> > + * @pdev: PCI device this belongs to mailbox belongs to
>                              ^^^^^^^^^^
> Typo.

oops... yea slipped by in v9 as well.

>
> 
> > + * @prots: Array of protocols supported on this DOE
> > + * @num_prots: Size of prots array
> 
> Use @prots instead of prots everywhere in the kernel-doc.

Ok.

> 
> 
> > +	/*
> > +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> > +	 * error if pci_doe_cache_protocols() fails past this point.
> > +	 */
> 
> s/doe_mb_prots/doe_mb->prots/
> s/pci_doe_free_mb/pci_doe_free_mb()/

This note has changed.

> 
> 
> > +	/* DOE requests must be a whole number of DW */
> > +	if (task->request_pl_sz % sizeof(u32))
> > +		return -EINVAL;
> 
> It would be nice if this restriction could be lifted.  SPDM uses
> requests which are not padded to dword-length.  It can run over other
> transports which may not impose such restrictions.  The SPDM layer
> should not need to worry about quirks of the transport layer.

Are you suggesting that this layer PAD the request?  I hear your concern but
users of this layer will need to understand this in the request payload
allocated anyway.

> 
> 
> > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > +{
> > +	struct pci_doe_mb *doe_mb = data;
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +	/* Leave the error case to be handled outside IRQ */
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > +					PCI_DOE_STATUS_INT_STATUS);
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}
> 
> PCIe 6.0, table 7-316 says that an interrupt is also raised when
> "the DOE Busy bit has been Cleared", yet such an interrupt is
> not handled here.  It is incorrectly treated as a spurious
> interrupt by returning IRQ_NONE.  The right thing to do
> is probably to wake the state machine in case it's polling
> for the Busy flag to clear.

I thought Jonathan and I had work through how this was correct.  I'll have to
review those threads.

> 
> 
> > +enum pci_doe_state {
> > +	DOE_IDLE,
> > +	DOE_WAIT_RESP,
> > +	DOE_WAIT_ABORT,
> > +	DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +#define PCI_DOE_FLAG_ABORT	0
> > +#define PCI_DOE_FLAG_DEAD	1
> 
> That's all internal and should live in doe.c, not the header file.

Sure if struct pci_doe_mb is moved.

Lets see how V9 looks from other perspectives.  I've incorporated some of this
feedback for V10.

Ira

> Thanks,
> 
> Lukas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-05-31 10:33     ` Jonathan Cameron
@ 2022-06-01  2:59       ` Ira Weiny
  2022-06-01  7:18         ` Lukas Wunner
  0 siblings, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2022-06-01  2:59 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Lukas Wunner, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:
> On Mon, 30 May 2022 21:06:57 +0200
> Lukas Wunner <lukas@wunner.de> wrote:
> 
> > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > > +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)  
> > [...]
> > > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > +	/* Read the second dword to get the length */
> > > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > +
> > > +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > > +	if (length > SZ_1M || length < 2)
> > > +		return -EIO;
> > > +  
> > 
> > This API requires consumers to know the size of the response in advance.
> > That's not always the case, as exemplified by SPDM VERSION responses.
> > Jonathan uses a kludge in his SPDM patch which submits a first request
> > with a fixed size of 2 versions and, if that turns out to be too small,
> > submits a second request.
> > 
> > It would be nice if consumers are allowed to set request_pl to NULL.
> > Then a payload can be allocated here in pci_doe_recv_resp() with the
> > size retrieved above.
> > 
> > A flag may be necessary to indicate that the response is heap-allocated
> > and needs to be freed upon destruction of the pci_doe_task.
> 
> If possible I'd make that the callers problem. It should know it provided
> NULL and expected a response.
> 
> I'd suggest making this a 'future' feature just to keep this initial
> version simple.  Won't be hard to add later.

Agreed.

> 
> > 
> > 
> > > +	/* First 2 dwords have already been read */
> > > +	length -= 2;
> > > +	/* Read the rest of the response payload */
> > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > +				      &task->response_pl[i]);
> > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > +	}  
> > 
> > You need to check the Data Object Ready bit.  The device may clear the
> > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > Reset).  You'll continue reading zero dwords from the mailbox and
> > pretend success to the caller even though the response is truncated.
> > 
> > If you're concerned about performance when checking the bit on every
> > loop iteration, checking it only on the last but one iteration should
> > be sufficient to detect truncation.
> 
> Good catch - I hate corner cases.  Thankfully this one is trivial to
> check for.

Ok looking at the spec:  Strictly speaking this needs to happen multiple times
both in doe_statemachine_work() and inside pci_doe_recv_resp(); not just in
this loop.  :-(

This is because, the check in doe_statemachine_work() only covers the 1st dword
read IIUC.  Therefore, we should put this into the statemachine properly unless
it is agreed to busy wait for the Data Object Ready bit once the first one is
seen?

I'm not sure I really like that because I'm not sure how hardware is going to
behave.  Will it potentially delay mid response payload?

I think complete correctness requires pci_doe_recv_resp() to be split up such
that the state machine could be scheduled on any dword read...  :-(  And I
suspect that anything else is a hack.  That is going to take a bit of rework.
But I think it may be the best thing to do.

Before I embark on this rework, tell me I'm not crazy?

> 
> > 
> > 
> > > +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > > +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > > +			      response_pl);
> > > +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > > +			   response_pl);  
> > 
> > The fact that you need line breaks here is an indication that the
> > macros are too long.
> > 
> > 
> > > +/* DOE Data Object - note not actually registers */
> > > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> > > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> > > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> > > +
> > > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> > > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> > > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> > > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000  
> > 
> > I'd get rid of "DATA_OBJECT_" everywhere, it's redundant with the
> > "Data Object" in "DOE".
> 
> We need something to make it clear these aren't DVSEC headers for
> example, but definitely could be shorter.
> 
> PCI_DOE_OBJ_... 
> 
> perhaps?

FWIW I'm inclined to leave them for now.

> 
> > 
> > 
> > > +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */  
> > 
> > Another redundancy, I would get rid of the second "_STATUS".
> 
> Hmm. I'll leave this one to Ira's discretion but I'm not keen on cropping
> too much of naming given loss of alignment with the spec.

Again I'm inclined to leave it.

> 
> > 
> > 
> > > +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */  
> > 
> > I would shorten to PCI_DOE_STATUS_READY.
> 
> Seems reasonable though always some burden in moving
> away from full spec names.

I think this is fine.

> 
> > 
> > 
> > > 		Simplify submitting work to the mailbox
> > > 			Replace pci_doe_exchange_sync() with
> > > 			pci_doe_submit_task() Consumers of the mailbox
> > > 			are now responsible for setting up callbacks
> > > 			within a task object and submitting them to the
> > > 			mailbox to be processed.  
> > 
> > I honestly think that's a mistake.  All consumers both in the CDAT
> > as well as in the SPDM series just want to wait for completion of
> > the task.  They have no need for an arbitrary callback and shouldn't
> > be burdended with providing one.  It just unnecessarily complicates
> > the API.
> > 
> > So only providing pci_doe_exchange_sync() and doing away with
> > pci_doe_submit_task() would seem like a more appropriate approach.
> 
> We've gone around the houses with this.  At this stage I don't
> care strongly either way and it's all in kernel code so we
> can refine it later once we have lots of examples.

Agreed.

> 
> > 
> > 
> > > +/**
> > > + * pci_doe_for_each_off - Iterate each DOE capability
> > > + * @pdev: struct pci_dev to iterate
> > > + * @off: u16 of config space offset of each mailbox capability found
> > > + */
> > > +#define pci_doe_for_each_off(pdev, off) \
> > > +	for (off = pci_find_next_ext_capability(pdev, off, \
> > > +					PCI_EXT_CAP_ID_DOE); \
> > > +		off > 0; \
> > > +		off = pci_find_next_ext_capability(pdev, off, \
> > > +					PCI_EXT_CAP_ID_DOE))
> > > +
> > > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > > +				     bool use_irq);
> > > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> > > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);  
> > 
> > Drivers should not be concerned with the intricacies of DOE
> > capabilities and mailboxes.
> > 
> > Moreover, the above API doesn't allow different drivers to access
> > the same DOE mailbox concurrently, e.g. if that mailbox supports
> > multiple protocols.  There's no locking to serialize access to the
> > mailbox by the drivers.
> > 
> > This should be moved to the PCI core instead:  In pci_init_capabilities(),
> > add a new call pci_doe_init() which enumerates all DOE capabilities.
> > Add a list_head to struct pci_dev and add each DOE instance found
> > to that list.  Destroy the list elements in pci_destroy_dev().
> > No locking needed for the list_head, you only ever modify the list
> > on device enumeration and destruction.
> > 
> > Then provide a pci_doe_find_mailbox() library function which drivers
> > call to retrieve the pci_doe_mb for a given pci_dev/vid/type tuple.
> > That avoids the need to traverse the list time and again.
> 
> I've lost track a bit as this thread has been running a long time,  but
> I think we are moving back to something that looks more like this
> (and the much earlier versions of the set which did more or less what you
> describe - though they had their own issues resolved in the meantime)
> 
> There is an exciting corner around interrupts though that complicates
> things.  Currently I think I'm right in saying the pci core never
> sets up interrupts, that is always a job for the drivers.
> 
> My first thought is to do something similar to the hack I did for
> the switch ports, and use polled mode to query support protocols.
> Then later call from driver that will attempt to enable interrupts
> if desired for a given DOE instance.
> 

V9 is already getting respun but lets look there for the move toward this.  I
don't think the CXL code should get any of the DOE mailbox code directly.  But
I'm unclear how to tie it together with the PCI side ATM.

For this series CDAT is only useful for CXL and as the only user I think having
the CXL PCI code instantiate the mailboxes for now is fine.  But having a more
general home for that is needed I think for other protocols and users.

> 
> > 
> > 
> > > +/**
> > > + * struct pci_doe_mb - State for a single DOE mailbox  
> > 
> > We generally use the same terms as the spec to make it easier for
> > readers to connect the language in the spec to the implementation.
> > 
> > The spec uniformly refers to "DOE instance".  I guess "mailbox"
> > is slightly more concise, so keep that, but please at least mention
> > the term "instance" in the struct's kernel-doc.
> > 
> > This implementation uses the term "task" for one request/response.
> > That term is not mentioned in the spec at all.  The spec refers to
> > "exchange" and "transfer" on several occasions, so I would have chosen
> > either one of those instead of the somewhat unfortunate "task".
> > 
> > 
> > > + * This state is used to manage a single DOE mailbox capability.  All fields
> > > + * should be considered opaque to the consumers and the structure passed into
> > > + * the helpers below after being created by devm_pci_doe_create()  
> > 
> > If the struct is considered opaque, why is it exposed in a public
> > header file?  Just use a forward declaration in the header
> > so that consumers can pass around pointers to the struct,
> > and hide the declaration proper in doe.c.
> > 
> > 
> > > + * @pdev: PCI device this belongs to mailbox belongs to  
> >                              ^^^^^^^^^^
> > Typo.
> > 
> > > + * @prots: Array of protocols supported on this DOE
> > > + * @num_prots: Size of prots array  
> > 
> > Use @prots instead of prots everywhere in the kernel-doc.
> > 
> > 
> > > +	/*
> > > +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> > > +	 * error if pci_doe_cache_protocols() fails past this point.
> > > +	 */  
> > 
> > s/doe_mb_prots/doe_mb->prots/
> > s/pci_doe_free_mb/pci_doe_free_mb()/
> > 
> > 
> > > +	/* DOE requests must be a whole number of DW */
> > > +	if (task->request_pl_sz % sizeof(u32))
> > > +		return -EINVAL;  
> > 
> > It would be nice if this restriction could be lifted.  SPDM uses
> > requests which are not padded to dword-length.  It can run over other
> > transports which may not impose such restrictions.  The SPDM layer
> > should not need to worry about quirks of the transport layer.
> 
> This is a pain.  DOE absolutely requires 32 bit padding and
> reads need to be 32 bit.   We can obviously do something nasty
> with dealing with the tail via a bounce buffer though.
> 
> I'd pencil that in as a future feature though rather than worry
> about it before we have any support at all in place.

I think any missmatch between something like SPDM and DOE will need to be
handled by SPDM to DOE code.  Not the DOE layer.

> 
> > 
> > 
> > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > +{
> > > +	struct pci_doe_mb *doe_mb = data;
> > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > +	int offset = doe_mb->cap_offset;
> > > +	u32 val;
> > > +
> > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > +
> > > +	/* Leave the error case to be handled outside IRQ */
> > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +
> > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > +					PCI_DOE_STATUS_INT_STATUS);
> > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +
> > > +	return IRQ_NONE;
> > > +}  
> > 
> > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > not handled here.  It is incorrectly treated as a spurious
> > interrupt by returning IRQ_NONE.  The right thing to do
> > is probably to wake the state machine in case it's polling
> > for the Busy flag to clear.
> 
> Ah. I remember testing this via a lot of hacking on the QEMU code
> to inject the various races that can occur (it was really ugly to do).
> 
> Guess we lost the handling at some point.  I think your fix
> is the right one.

Perhaps I am missing something but digging into this more.  I disagree that the
handler fails to handle this case.  If I read the spec correctly DOE Interrupt
Status must be set when an interrupt is generated.  The handler wakes the state
machine in that case.  The state machine then checks for busy if there is work
to be done.

Normally we would not even need to check for status error.  But that is special
cased because clearing that status is left to the state machine.

> 
> > 
> > 
> > > +enum pci_doe_state {
> > > +	DOE_IDLE,
> > > +	DOE_WAIT_RESP,
> > > +	DOE_WAIT_ABORT,
> > > +	DOE_WAIT_ABORT_ON_ERR,
> > > +};
> > > +
> > > +#define PCI_DOE_FLAG_ABORT	0
> > > +#define PCI_DOE_FLAG_DEAD	1  
> > 
> > That's all internal and should live in doe.c, not the header file.
> > 
> > Thanks,
> > 
> > Lukas
> > 
> 
> Thanks for the review and thanks again to Ira for taking this forwards.

Thanks!
Ira

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-06-01  2:59       ` Ira Weiny
@ 2022-06-01  7:18         ` Lukas Wunner
  2022-06-01 14:23           ` Jonathan Cameron
  2022-06-01 17:16           ` Ira Weiny
  0 siblings, 2 replies; 42+ messages in thread
From: Lukas Wunner @ 2022-06-01  7:18 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jonathan Cameron, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:
> On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:
> > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:
> > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > > > +	/* First 2 dwords have already been read */
> > > > +	length -= 2;
> > > > +	/* Read the rest of the response payload */
> > > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > > +				      &task->response_pl[i]);
> > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > > +	}  
> > > 
> > > You need to check the Data Object Ready bit.  The device may clear the
> > > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > > Reset).  You'll continue reading zero dwords from the mailbox and
> > > pretend success to the caller even though the response is truncated.
> > > 
> > > If you're concerned about performance when checking the bit on every
> > > loop iteration, checking it only on the last but one iteration should
> > > be sufficient to detect truncation.
> > 
> > Good catch - I hate corner cases.  Thankfully this one is trivial to
> > check for.
> 
> Ok looking at the spec:  Strictly speaking this needs to happen multiple
> times both in doe_statemachine_work() and inside pci_doe_recv_resp();
> not just in this loop.  :-(
> 
> This is because, the check in doe_statemachine_work() only covers the
> 1st dword read IIUC.

The spec says "this bit indicates the DOE instance has a *data object*
available to be read by system firmware/software".

So, the entire object is available for reading, not just one dword.

You've already got checks in place for the first two dwords which
cover reading an "all zeroes" response.  No need to amend them.

You only need to re-check the Data Object Ready bit on the last-but-one
dword in case the function was reset concurrently.  Per sec. 6.30.2,
"An FLR to a Function must result in the aborting of any DOE transfer
in progress."


> > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > +{
> > > > +	struct pci_doe_mb *doe_mb = data;
> > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > +	int offset = doe_mb->cap_offset;
> > > > +	u32 val;
> > > > +
> > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > +
> > > > +	/* Leave the error case to be handled outside IRQ */
> > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > +		return IRQ_HANDLED;
> > > > +	}
> > > > +
> > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > +		return IRQ_HANDLED;
> > > > +	}
> > > > +
> > > > +	return IRQ_NONE;
> > > > +}  
> > > 
> > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > not handled here.  It is incorrectly treated as a spurious
> > > interrupt by returning IRQ_NONE.  The right thing to do
> > > is probably to wake the state machine in case it's polling
> > > for the Busy flag to clear.
> > 
> > Ah. I remember testing this via a lot of hacking on the QEMU code
> > to inject the various races that can occur (it was really ugly to do).
> > 
> > Guess we lost the handling at some point.  I think your fix
> > is the right one.
> 
> Perhaps I am missing something but digging into this more.  I disagree
> that the handler fails to handle this case.  If I read the spec correctly
> DOE Interrupt Status must be set when an interrupt is generated.
> The handler wakes the state machine in that case.  The state machine
> then checks for busy if there is work to be done.

Right, I was mistaken, sorry for the noise.


> Normally we would not even need to check for status error.  But that is
> special cased because clearing that status is left to the state machine.

That however looks wrong because the DOE Interrupt Status bit is never
cleared after a DOE Error is signaled.  The state machine performs an
explicit abort upon an error by setting the DOE Abort bit, but that
doesn't seem to clear DOE Interrupt Status:

Per section 6.30.2, "At any time, the system firmware/software is
permitted to set the DOE Abort bit in the DOE Control Register,
and the DOE instance must Clear the Data Object Ready bit,
if not already Clear, and Clear the DOE Error bit, if already Set,
in the DOE Status Register, within 1 second."

No mention of the DOE Interrupt Status bit, so we cannot assume that
it's cleared by a DOE Abort and we must clear it explicitly.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-06-01  7:18         ` Lukas Wunner
@ 2022-06-01 14:23           ` Jonathan Cameron
  2022-06-01 17:16           ` Ira Weiny
  1 sibling, 0 replies; 42+ messages in thread
From: Jonathan Cameron @ 2022-06-01 14:23 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Ira Weiny, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Wed, 1 Jun 2022 09:18:08 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:
> > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:  
> > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:  
> > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:  
> > > > > +	/* First 2 dwords have already been read */
> > > > > +	length -= 2;
> > > > > +	/* Read the rest of the response payload */
> > > > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > > > +				      &task->response_pl[i]);
> > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > > > +	}    
> > > > 
> > > > You need to check the Data Object Ready bit.  The device may clear the
> > > > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > > > Reset).  You'll continue reading zero dwords from the mailbox and
> > > > pretend success to the caller even though the response is truncated.
> > > > 
> > > > If you're concerned about performance when checking the bit on every
> > > > loop iteration, checking it only on the last but one iteration should
> > > > be sufficient to detect truncation.  
> > > 
> > > Good catch - I hate corner cases.  Thankfully this one is trivial to
> > > check for.  
> > 
> > Ok looking at the spec:  Strictly speaking this needs to happen multiple
> > times both in doe_statemachine_work() and inside pci_doe_recv_resp();
> > not just in this loop.  :-(
> > 
> > This is because, the check in doe_statemachine_work() only covers the
> > 1st dword read IIUC.  
> 
> The spec says "this bit indicates the DOE instance has a *data object*
> available to be read by system firmware/software".
> 
> So, the entire object is available for reading, not just one dword.

Agreed

> 
> You've already got checks in place for the first two dwords which
> cover reading an "all zeroes" response.  No need to amend them.
> 
> You only need to re-check the Data Object Ready bit on the last-but-one
> dword in case the function was reset concurrently.  Per sec. 6.30.2,
> "An FLR to a Function must result in the aborting of any DOE transfer
> in progress."

Ouch, isn't that racy as you can only check it slightly before reading the
last dword and a reset could occur in between?

> 
> 
> > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > +{
> > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > +	int offset = doe_mb->cap_offset;
> > > > > +	u32 val;
> > > > > +
> > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > +
> > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > +		return IRQ_HANDLED;
> > > > > +	}
> > > > > +
> > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > +		return IRQ_HANDLED;
> > > > > +	}
> > > > > +
> > > > > +	return IRQ_NONE;
> > > > > +}    
> > > > 
> > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > not handled here.  It is incorrectly treated as a spurious
> > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > is probably to wake the state machine in case it's polling
> > > > for the Busy flag to clear.  
> > > 
> > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > to inject the various races that can occur (it was really ugly to do).
> > > 
> > > Guess we lost the handling at some point.  I think your fix
> > > is the right one.  
> > 
> > Perhaps I am missing something but digging into this more.  I disagree
> > that the handler fails to handle this case.  If I read the spec correctly
> > DOE Interrupt Status must be set when an interrupt is generated.
> > The handler wakes the state machine in that case.  The state machine
> > then checks for busy if there is work to be done.  
> 
> Right, I was mistaken, sorry for the noise.

Ah. Makes sense - managed to confuse me too ;)
I really don't like the absence of a status bit for the DOE busy bit has
cleared interrupt source, but glad we did handle it.

> 
> 
> > Normally we would not even need to check for status error.  But that is
> > special cased because clearing that status is left to the state machine.  
> 
> That however looks wrong because the DOE Interrupt Status bit is never
> cleared after a DOE Error is signaled.  The state machine performs an
> explicit abort upon an error by setting the DOE Abort bit, but that
> doesn't seem to clear DOE Interrupt Status:
> 
> Per section 6.30.2, "At any time, the system firmware/software is
> permitted to set the DOE Abort bit in the DOE Control Register,
> and the DOE instance must Clear the Data Object Ready bit,
> if not already Clear, and Clear the DOE Error bit, if already Set,
> in the DOE Status Register, within 1 second."
> 
> No mention of the DOE Interrupt Status bit, so we cannot assume that
> it's cleared by a DOE Abort and we must clear it explicitly.

Gah.

Jonathan

> 
> Thanks,
> 
> Lukas
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-06-01  7:18         ` Lukas Wunner
  2022-06-01 14:23           ` Jonathan Cameron
@ 2022-06-01 17:16           ` Ira Weiny
  2022-06-01 17:56             ` Lukas Wunner
  2022-06-06 14:46             ` Jonathan Cameron
  1 sibling, 2 replies; 42+ messages in thread
From: Ira Weiny @ 2022-06-01 17:16 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Jonathan Cameron, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:
> > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:
> > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:
> > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > > > > +	/* First 2 dwords have already been read */
> > > > > +	length -= 2;
> > > > > +	/* Read the rest of the response payload */
> > > > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > > > +				      &task->response_pl[i]);
> > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > > > +	}  
> > > > 
> > > > You need to check the Data Object Ready bit.  The device may clear the
> > > > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > > > Reset).  You'll continue reading zero dwords from the mailbox and
> > > > pretend success to the caller even though the response is truncated.
> > > > 
> > > > If you're concerned about performance when checking the bit on every
> > > > loop iteration, checking it only on the last but one iteration should
> > > > be sufficient to detect truncation.
> > > 
> > > Good catch - I hate corner cases.  Thankfully this one is trivial to
> > > check for.
> > 
> > Ok looking at the spec:  Strictly speaking this needs to happen multiple
> > times both in doe_statemachine_work() and inside pci_doe_recv_resp();
> > not just in this loop.  :-(
> > 
> > This is because, the check in doe_statemachine_work() only covers the
> > 1st dword read IIUC.
> 
> The spec says "this bit indicates the DOE instance has a *data object*
> available to be read by system firmware/software".

Ok yea.  I got confused by step 6 in sec 6.30.2.

> 
> So, the entire object is available for reading, not just one dword.

Yes cool!

> 
> You've already got checks in place for the first two dwords which
> cover reading an "all zeroes" response.  No need to amend them.
> 
> You only need to re-check the Data Object Ready bit on the last-but-one
> dword in case the function was reset concurrently.  Per sec. 6.30.2,
> "An FLR to a Function must result in the aborting of any DOE transfer
> in progress."

I think I disagree.  Even if we do that and an FLR comes before the last read
the last read could be 0's.

I think the interpretation of the data needs to happen above this and if an FLR
happens during this read we are going to have other issues.

But I can add the check if you feel strongly about it.

> 
> 
> > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > +{
> > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > +	int offset = doe_mb->cap_offset;
> > > > > +	u32 val;
> > > > > +
> > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > +
> > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > +		return IRQ_HANDLED;
> > > > > +	}
> > > > > +
> > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > +		return IRQ_HANDLED;
> > > > > +	}
> > > > > +
> > > > > +	return IRQ_NONE;
> > > > > +}  
> > > > 
> > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > not handled here.  It is incorrectly treated as a spurious
> > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > is probably to wake the state machine in case it's polling
> > > > for the Busy flag to clear.
> > > 
> > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > to inject the various races that can occur (it was really ugly to do).
> > > 
> > > Guess we lost the handling at some point.  I think your fix
> > > is the right one.
> > 
> > Perhaps I am missing something but digging into this more.  I disagree
> > that the handler fails to handle this case.  If I read the spec correctly
> > DOE Interrupt Status must be set when an interrupt is generated.
> > The handler wakes the state machine in that case.  The state machine
> > then checks for busy if there is work to be done.
> 
> Right, I was mistaken, sorry for the noise.

NP I'm not always following this either.

> 
> 
> > Normally we would not even need to check for status error.  But that is
> > special cased because clearing that status is left to the state machine.
> 
> That however looks wrong because the DOE Interrupt Status bit is never
> cleared after a DOE Error is signaled.  The state machine performs an
> explicit abort upon an error by setting the DOE Abort bit, but that
> doesn't seem to clear DOE Interrupt Status:
> 
> Per section 6.30.2, "At any time, the system firmware/software is
> permitted to set the DOE Abort bit in the DOE Control Register,
> and the DOE instance must Clear the Data Object Ready bit,
> if not already Clear, and Clear the DOE Error bit, if already Set,
> in the DOE Status Register, within 1 second."

I thought that meant the hardware (the DOE instance) must clear those bits
within 1 second?

> 
> No mention of the DOE Interrupt Status bit, so we cannot assume that
> it's cleared by a DOE Abort and we must clear it explicitly.

Oh...  yea.  Jonathan?  We discussed this before and I was convinced it worked
but I think Lukas is correct here.

Should we drop the special case in pci_doe_irq_handler() and just clear the
status always?  Or should we wait and clear it is pci_doe_abort_start?

Ira

> 
> Thanks,
> 
> Lukas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-06-01 17:16           ` Ira Weiny
@ 2022-06-01 17:56             ` Lukas Wunner
  2022-06-01 20:17               ` Ira Weiny
  2022-06-06 14:46             ` Jonathan Cameron
  1 sibling, 1 reply; 42+ messages in thread
From: Lukas Wunner @ 2022-06-01 17:56 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jonathan Cameron, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Wed, Jun 01, 2022 at 10:16:15AM -0700, Ira Weiny wrote:
> On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> > You only need to re-check the Data Object Ready bit on the last-but-one
> > dword in case the function was reset concurrently.  Per sec. 6.30.2,
> > "An FLR to a Function must result in the aborting of any DOE transfer
> > in progress."
> 
> I think I disagree.  Even if we do that and an FLR comes before the last read
> the last read could be 0's.

PCIe r6.0, Table 7-316 says:

  "If there is no additional data object ready for transfer, the
   DOE instance must clear this bit after the entire data object has been
   transferred, as indicated by software writing to the DOE Read Data
   Mailbox Register after reading the final DW of the data object."

Remember that you *read* a dword from the mailbox and then acknowledge
reception to the mailbox by *writing* a dword to the mailbox.

So you check that the Data Object Ready bit is set before acknowledging
the final dword with a register write.  That's race-free.

(I realize me talking about the "last-but-one dword" above was quite
unclear, sorry about that.)

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-06-01 17:56             ` Lukas Wunner
@ 2022-06-01 20:17               ` Ira Weiny
  0 siblings, 0 replies; 42+ messages in thread
From: Ira Weiny @ 2022-06-01 20:17 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Jonathan Cameron, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Wed, Jun 01, 2022 at 07:56:47PM +0200, Lukas Wunner wrote:
> On Wed, Jun 01, 2022 at 10:16:15AM -0700, Ira Weiny wrote:
> > On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> > > You only need to re-check the Data Object Ready bit on the last-but-one
> > > dword in case the function was reset concurrently.  Per sec. 6.30.2,
> > > "An FLR to a Function must result in the aborting of any DOE transfer
> > > in progress."
> > 
> > I think I disagree.  Even if we do that and an FLR comes before the last read
> > the last read could be 0's.
> 
> PCIe r6.0, Table 7-316 says:
> 
>   "If there is no additional data object ready for transfer, the
>    DOE instance must clear this bit after the entire data object has been
>    transferred, as indicated by software writing to the DOE Read Data
>    Mailbox Register after reading the final DW of the data object."
> 
> Remember that you *read* a dword from the mailbox and then acknowledge
> reception to the mailbox by *writing* a dword to the mailbox.
> 
> So you check that the Data Object Ready bit is set before acknowledging
> the final dword with a register write.  That's race-free.

Ok.

> 
> (I realize me talking about the "last-but-one dword" above was quite
> unclear, sorry about that.)
> 

Ah yes.  Ok, I'll put in a check before the final write.

Ira

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-06-01 17:16           ` Ira Weiny
  2022-06-01 17:56             ` Lukas Wunner
@ 2022-06-06 14:46             ` Jonathan Cameron
  2022-06-06 19:56               ` Ira Weiny
  1 sibling, 1 reply; 42+ messages in thread
From: Jonathan Cameron @ 2022-06-06 14:46 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Lukas Wunner, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Wed, 1 Jun 2022 10:16:15 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> > On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:  
> > > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:  
> > > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:  
> > > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:  
> > > > > > +	/* First 2 dwords have already been read */
> > > > > > +	length -= 2;
> > > > > > +	/* Read the rest of the response payload */
> > > > > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > > > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > > > > +				      &task->response_pl[i]);
> > > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > > > > +	}    
> > > > > 
> > > > > You need to check the Data Object Ready bit.  The device may clear the
> > > > > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > > > > Reset).  You'll continue reading zero dwords from the mailbox and
> > > > > pretend success to the caller even though the response is truncated.
> > > > > 
> > > > > If you're concerned about performance when checking the bit on every
> > > > > loop iteration, checking it only on the last but one iteration should
> > > > > be sufficient to detect truncation.  
> > > > 
> > > > Good catch - I hate corner cases.  Thankfully this one is trivial to
> > > > check for.  
> > > 
> > > Ok looking at the spec:  Strictly speaking this needs to happen multiple
> > > times both in doe_statemachine_work() and inside pci_doe_recv_resp();
> > > not just in this loop.  :-(
> > > 
> > > This is because, the check in doe_statemachine_work() only covers the
> > > 1st dword read IIUC.  
> > 
> > The spec says "this bit indicates the DOE instance has a *data object*
> > available to be read by system firmware/software".  
> 
> Ok yea.  I got confused by step 6 in sec 6.30.2.
> 
> > 
> > So, the entire object is available for reading, not just one dword.  
> 
> Yes cool!
> 
> > 
> > You've already got checks in place for the first two dwords which
> > cover reading an "all zeroes" response.  No need to amend them.
> > 
> > You only need to re-check the Data Object Ready bit on the last-but-one
> > dword in case the function was reset concurrently.  Per sec. 6.30.2,
> > "An FLR to a Function must result in the aborting of any DOE transfer
> > in progress."  
> 
> I think I disagree.  Even if we do that and an FLR comes before the last read
> the last read could be 0's.
> 
> I think the interpretation of the data needs to happen above this and if an FLR
> happens during this read we are going to have other issues.
> 
> But I can add the check if you feel strongly about it.
> 
> > 
> >   
> > > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > > +{
> > > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > > +	int offset = doe_mb->cap_offset;
> > > > > > +	u32 val;
> > > > > > +
> > > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > > +
> > > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > +		return IRQ_HANDLED;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > +		return IRQ_HANDLED;
> > > > > > +	}
> > > > > > +
> > > > > > +	return IRQ_NONE;
> > > > > > +}    
> > > > > 
> > > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > > not handled here.  It is incorrectly treated as a spurious
> > > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > > is probably to wake the state machine in case it's polling
> > > > > for the Busy flag to clear.  
> > > > 
> > > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > > to inject the various races that can occur (it was really ugly to do).
> > > > 
> > > > Guess we lost the handling at some point.  I think your fix
> > > > is the right one.  
> > > 
> > > Perhaps I am missing something but digging into this more.  I disagree
> > > that the handler fails to handle this case.  If I read the spec correctly
> > > DOE Interrupt Status must be set when an interrupt is generated.
> > > The handler wakes the state machine in that case.  The state machine
> > > then checks for busy if there is work to be done.  
> > 
> > Right, I was mistaken, sorry for the noise.  
> 
> NP I'm not always following this either.
> 
> > 
> >   
> > > Normally we would not even need to check for status error.  But that is
> > > special cased because clearing that status is left to the state machine.  
> > 
> > That however looks wrong because the DOE Interrupt Status bit is never
> > cleared after a DOE Error is signaled.  The state machine performs an
> > explicit abort upon an error by setting the DOE Abort bit, but that
> > doesn't seem to clear DOE Interrupt Status:
> > 
> > Per section 6.30.2, "At any time, the system firmware/software is
> > permitted to set the DOE Abort bit in the DOE Control Register,
> > and the DOE instance must Clear the Data Object Ready bit,
> > if not already Clear, and Clear the DOE Error bit, if already Set,
> > in the DOE Status Register, within 1 second."  
> 
> I thought that meant the hardware (the DOE instance) must clear those bits
> within 1 second?
> 
> > 
> > No mention of the DOE Interrupt Status bit, so we cannot assume that
> > it's cleared by a DOE Abort and we must clear it explicitly.  
> 
> Oh...  yea.  Jonathan?  We discussed this before and I was convinced it worked
> but I think Lukas is correct here.

Hmm. I thought we were good as well, but Lukas is correct in saying
the interrupt status bit isn't cleared (which is 'novel' give the associated
bit to tell you what the interrupt means will be cleared). 

I'm not sure I want to think around the race conditions that result...

> 
> Should we drop the special case in pci_doe_irq_handler() and just clear the
> status always?  Or should we wait and clear it is pci_doe_abort_start?

I don't think it matters.  pci_doe_irq_handler() seems a little cleaner.

I've not figured out completely if there are races however...

It is set when no already set and we get transitions of any of the following:
- DOE error bit set (this can't happen until abort so no race here)

- Data Object Ready bit is set: Can this happen with the DOE error set? I don't
  immediately see language saying it can't. However, I don't think it can
  for any of the challenge response protocols yet defined (and there are other
  problems if anyone wants to implement unsolicited messages)

- DOE busy bit has cleared - can definitely happen after an abort (which is
  fine as nothing to do anyway, so we'll handle a pointless interrupt).
  Could it in theory happen when error is set? I think not but only because
  of the statement  "Clear this bit when it is able to receive a new data
  object."

So I think we are fine doing it preabort, but wouldn't put it past a hardware
designer to find some path through that which results in a bonus interrupt
and potentially us resetting twice.

If we clear it at the end of abort instead, what happens?
Definitely no interrupts until we clear it. As we are doing query response
protocols only, no new data until state machine moves on, so fine there.

So what about just doing it unconditionally..

+	case DOE_WAIT_ABORT:
+	case DOE_WAIT_ABORT_ON_ERR:
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
+		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {

	here...

+			/* Back to normal state - carry on */
+			retire_cur_task(doe_mb);

This feels a little bit more 'standard' as we are allowing new interrupts
only after everything is back to a nice state.

> 
> Ira
> 
> > 
> > Thanks,
> > 
> > Lukas  


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-06-06 14:46             ` Jonathan Cameron
@ 2022-06-06 19:56               ` Ira Weiny
  2022-06-07  9:58                 ` Jonathan Cameron
  0 siblings, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2022-06-06 19:56 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Lukas Wunner, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Mon, Jun 06, 2022 at 03:46:46PM +0100, Jonathan Cameron wrote:
> On Wed, 1 Jun 2022 10:16:15 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> > > On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:  
> > > > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:  
> > > > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:  
> > > > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:  
> > 
> > > 
> > >   
> > > > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > > > +{
> > > > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > > > +	int offset = doe_mb->cap_offset;
> > > > > > > +	u32 val;
> > > > > > > +
> > > > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > > > +
> > > > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > > +		return IRQ_HANDLED;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > > +		return IRQ_HANDLED;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	return IRQ_NONE;
> > > > > > > +}    
> > > > > > 
> > > > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > > > not handled here.  It is incorrectly treated as a spurious
> > > > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > > > is probably to wake the state machine in case it's polling
> > > > > > for the Busy flag to clear.  
> > > > > 
> > > > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > > > to inject the various races that can occur (it was really ugly to do).
> > > > > 
> > > > > Guess we lost the handling at some point.  I think your fix
> > > > > is the right one.  
> > > > 
> > > > Perhaps I am missing something but digging into this more.  I disagree
> > > > that the handler fails to handle this case.  If I read the spec correctly
> > > > DOE Interrupt Status must be set when an interrupt is generated.
> > > > The handler wakes the state machine in that case.  The state machine
> > > > then checks for busy if there is work to be done.  
> > > 
> > > Right, I was mistaken, sorry for the noise.  
> > 
> > NP I'm not always following this either.
> > 
> > > 
> > >   
> > > > Normally we would not even need to check for status error.  But that is
> > > > special cased because clearing that status is left to the state machine.  
> > > 
> > > That however looks wrong because the DOE Interrupt Status bit is never
> > > cleared after a DOE Error is signaled.  The state machine performs an
> > > explicit abort upon an error by setting the DOE Abort bit, but that
> > > doesn't seem to clear DOE Interrupt Status:
> > > 
> > > Per section 6.30.2, "At any time, the system firmware/software is
> > > permitted to set the DOE Abort bit in the DOE Control Register,
> > > and the DOE instance must Clear the Data Object Ready bit,
> > > if not already Clear, and Clear the DOE Error bit, if already Set,
> > > in the DOE Status Register, within 1 second."  
> > 
> > I thought that meant the hardware (the DOE instance) must clear those bits
> > within 1 second?
> > 
> > > 
> > > No mention of the DOE Interrupt Status bit, so we cannot assume that
> > > it's cleared by a DOE Abort and we must clear it explicitly.  
> > 
> > Oh...  yea.  Jonathan?  We discussed this before and I was convinced it worked
> > but I think Lukas is correct here.
> 
> Hmm. I thought we were good as well, but Lukas is correct in saying
> the interrupt status bit isn't cleared (which is 'novel' give the associated
> bit to tell you what the interrupt means will be cleared). 
> 
> I'm not sure I want to think around the race conditions that result...
> 
> > 
> > Should we drop the special case in pci_doe_irq_handler() and just clear the
> > status always?  Or should we wait and clear it is pci_doe_abort_start?
> 
> I don't think it matters.  pci_doe_irq_handler() seems a little cleaner.

I agree and that is what V10 does.

> 
> I've not figured out completely if there are races however...

This is why I reworked the handling of cur_task in those error cases.

> 
> It is set when no already set and we get transitions of any of the following:
> - DOE error bit set (this can't happen until abort so no race here)
> 
> - Data Object Ready bit is set: Can this happen with the DOE error set? I don't
>   immediately see language saying it can't. However, I don't think it can
>   for any of the challenge response protocols yet defined (and there are other
>   problems if anyone wants to implement unsolicited messages)
> 
> - DOE busy bit has cleared - can definitely happen after an abort (which is
>   fine as nothing to do anyway, so we'll handle a pointless interrupt).
>   Could it in theory happen when error is set? I think not but only because
>   of the statement  "Clear this bit when it is able to receive a new data
>   object."
> 
> So I think we are fine doing it preabort,

That is what I though for V10 especially after reworking the cur_task locking.
An extra interrupt would either start processing the next task or return with
nothing to do.

> but wouldn't put it past a hardware
> designer to find some path through that which results in a bonus interrupt
> and potentially us resetting twice.
> 
> If we clear it at the end of abort instead, what happens?
> Definitely no interrupts until we clear it. As we are doing query response
> protocols only, no new data until state machine moves on, so fine there.
> 
> So what about just doing it unconditionally..
> 
> +	case DOE_WAIT_ABORT:
> +	case DOE_WAIT_ABORT_ON_ERR:
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> 
> 	here...
> 
> +			/* Back to normal state - carry on */
> +			retire_cur_task(doe_mb);
> 
> This feels a little bit more 'standard' as we are allowing new interrupts
> only after everything is back to a nice state.

As I reworked the cur_task locking I really thought about locking cur_task
throughout doe_statemachine_work().  It seems a lot safer for a lot of reasons.
Doing so would make the extra work item no big deal.

So I looked at this again because you got me worried.  If mod_delayed_work()
can cause doe_statemachine_work() while another thread is in the middle of
processing the interrupt there is a chance that signal_task_complete() is
called a second time on a given task pointer.

However, I _don't_ _think_ that can happen.  Because I don't think
mod_delayed_work() can cause the work item to run while it is already running.

So unless I misunderstand how mod_delayed_work() works we are guaranteed that
the extra interrupt will see the correct mailbox state and do the right thing.

Ira

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes.
  2022-06-06 19:56               ` Ira Weiny
@ 2022-06-07  9:58                 ` Jonathan Cameron
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Cameron @ 2022-06-07  9:58 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Lukas Wunner, Dan Williams, Bjorn Helgaas, Christoph Hellwig,
	Alison Schofield, Vishal Verma, Ben Widawsky, linux-kernel,
	linux-cxl, linux-pci

On Mon, 6 Jun 2022 12:56:05 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> On Mon, Jun 06, 2022 at 03:46:46PM +0100, Jonathan Cameron wrote:
> > On Wed, 1 Jun 2022 10:16:15 -0700
> > Ira Weiny <ira.weiny@intel.com> wrote:
> >   
> > > On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:  
> > > > On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:    
> > > > > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:    
> > > > > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:    
> > > > > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:    
> > >   
> > > > 
> > > >     
> > > > > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > > > > +{
> > > > > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > > > > +	int offset = doe_mb->cap_offset;
> > > > > > > > +	u32 val;
> > > > > > > > +
> > > > > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > > > > +
> > > > > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > > > +		return IRQ_HANDLED;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > > > +		return IRQ_HANDLED;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	return IRQ_NONE;
> > > > > > > > +}      
> > > > > > > 
> > > > > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > > > > not handled here.  It is incorrectly treated as a spurious
> > > > > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > > > > is probably to wake the state machine in case it's polling
> > > > > > > for the Busy flag to clear.    
> > > > > > 
> > > > > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > > > > to inject the various races that can occur (it was really ugly to do).
> > > > > > 
> > > > > > Guess we lost the handling at some point.  I think your fix
> > > > > > is the right one.    
> > > > > 
> > > > > Perhaps I am missing something but digging into this more.  I disagree
> > > > > that the handler fails to handle this case.  If I read the spec correctly
> > > > > DOE Interrupt Status must be set when an interrupt is generated.
> > > > > The handler wakes the state machine in that case.  The state machine
> > > > > then checks for busy if there is work to be done.    
> > > > 
> > > > Right, I was mistaken, sorry for the noise.    
> > > 
> > > NP I'm not always following this either.
> > >   
> > > > 
> > > >     
> > > > > Normally we would not even need to check for status error.  But that is
> > > > > special cased because clearing that status is left to the state machine.    
> > > > 
> > > > That however looks wrong because the DOE Interrupt Status bit is never
> > > > cleared after a DOE Error is signaled.  The state machine performs an
> > > > explicit abort upon an error by setting the DOE Abort bit, but that
> > > > doesn't seem to clear DOE Interrupt Status:
> > > > 
> > > > Per section 6.30.2, "At any time, the system firmware/software is
> > > > permitted to set the DOE Abort bit in the DOE Control Register,
> > > > and the DOE instance must Clear the Data Object Ready bit,
> > > > if not already Clear, and Clear the DOE Error bit, if already Set,
> > > > in the DOE Status Register, within 1 second."    
> > > 
> > > I thought that meant the hardware (the DOE instance) must clear those bits
> > > within 1 second?
> > >   
> > > > 
> > > > No mention of the DOE Interrupt Status bit, so we cannot assume that
> > > > it's cleared by a DOE Abort and we must clear it explicitly.    
> > > 
> > > Oh...  yea.  Jonathan?  We discussed this before and I was convinced it worked
> > > but I think Lukas is correct here.  
> > 
> > Hmm. I thought we were good as well, but Lukas is correct in saying
> > the interrupt status bit isn't cleared (which is 'novel' give the associated
> > bit to tell you what the interrupt means will be cleared). 
> > 
> > I'm not sure I want to think around the race conditions that result...
> >   
> > > 
> > > Should we drop the special case in pci_doe_irq_handler() and just clear the
> > > status always?  Or should we wait and clear it is pci_doe_abort_start?  
> > 
> > I don't think it matters.  pci_doe_irq_handler() seems a little cleaner.  
> 
> I agree and that is what V10 does.
> 
> > 
> > I've not figured out completely if there are races however...  
> 
> This is why I reworked the handling of cur_task in those error cases.
> 
> > 
> > It is set when no already set and we get transitions of any of the following:
> > - DOE error bit set (this can't happen until abort so no race here)
> > 
> > - Data Object Ready bit is set: Can this happen with the DOE error set? I don't
> >   immediately see language saying it can't. However, I don't think it can
> >   for any of the challenge response protocols yet defined (and there are other
> >   problems if anyone wants to implement unsolicited messages)
> > 
> > - DOE busy bit has cleared - can definitely happen after an abort (which is
> >   fine as nothing to do anyway, so we'll handle a pointless interrupt).
> >   Could it in theory happen when error is set? I think not but only because
> >   of the statement  "Clear this bit when it is able to receive a new data
> >   object."
> > 
> > So I think we are fine doing it preabort,  
> 
> That is what I though for V10 especially after reworking the cur_task locking.
> An extra interrupt would either start processing the next task or return with
> nothing to do.
> 
> > but wouldn't put it past a hardware
> > designer to find some path through that which results in a bonus interrupt
> > and potentially us resetting twice.
> > 
> > If we clear it at the end of abort instead, what happens?
> > Definitely no interrupts until we clear it. As we are doing query response
> > protocols only, no new data until state machine moves on, so fine there.
> > 
> > So what about just doing it unconditionally..
> > 
> > +	case DOE_WAIT_ABORT:
> > +	case DOE_WAIT_ABORT_ON_ERR:
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > 
> > 	here...
> > 
> > +			/* Back to normal state - carry on */
> > +			retire_cur_task(doe_mb);
> > 
> > This feels a little bit more 'standard' as we are allowing new interrupts
> > only after everything is back to a nice state.  
> 
> As I reworked the cur_task locking I really thought about locking cur_task
> throughout doe_statemachine_work().  It seems a lot safer for a lot of reasons.
> Doing so would make the extra work item no big deal.
> 
> So I looked at this again because you got me worried.  If mod_delayed_work()
> can cause doe_statemachine_work() while another thread is in the middle of
> processing the interrupt there is a chance that signal_task_complete() is
> called a second time on a given task pointer.
> 
> However, I _don't_ _think_ that can happen.  Because I don't think
> mod_delayed_work() can cause the work item to run while it is already running.

You are correct. I remember looking into that exact question for
a different project a while ago.

> 
> So unless I misunderstand how mod_delayed_work() works we are guaranteed that
> the extra interrupt will see the correct mailbox state and do the right thing.

Agreed.  Far as I can tell we are fine.  More eyes always good though if anyone
else wants to take a look!

Jonathan

p.s. I liked the original heavy weight queuing the whole thing on a mutex as it
was a lot easier to reason about :)  Was ugly though!


> 
> Ira


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2022-06-07  9:58 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-14 20:32 [PATCH V8 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
2022-04-14 20:32 ` [PATCH V8 01/10] PCI: Add vendor ID for the PCI SIG ira.weiny
2022-04-14 20:32 ` [PATCH V8 02/10] PCI: Replace magic constant for PCI Sig Vendor ID ira.weiny
2022-04-14 20:32 ` [PATCH V8 03/10] PCI: Create PCI library functions in support of DOE mailboxes ira.weiny
2022-04-28 21:27   ` Bjorn Helgaas
2022-05-02  5:36     ` ira.weiny
2022-05-30 19:06   ` Lukas Wunner
2022-05-31 10:33     ` Jonathan Cameron
2022-06-01  2:59       ` Ira Weiny
2022-06-01  7:18         ` Lukas Wunner
2022-06-01 14:23           ` Jonathan Cameron
2022-06-01 17:16           ` Ira Weiny
2022-06-01 17:56             ` Lukas Wunner
2022-06-01 20:17               ` Ira Weiny
2022-06-06 14:46             ` Jonathan Cameron
2022-06-06 19:56               ` Ira Weiny
2022-06-07  9:58                 ` Jonathan Cameron
2022-05-31 23:43     ` Ira Weiny
2022-04-14 20:32 ` [PATCH V8 04/10] cxl/pci: Create auxiliary devices for each DOE mailbox ira.weiny
2022-04-27 17:19   ` Jonathan Cameron
2022-04-28 21:09     ` ira.weiny
2022-04-29 16:38       ` Jonathan Cameron
2022-04-29 17:01         ` Dan Williams
2022-05-03 16:14           ` Jonathan Cameron
2022-04-29 15:55   ` Jonathan Cameron
2022-04-29 17:20     ` Ira Weiny
2022-05-03 15:32       ` Jonathan Cameron
2022-04-14 20:32 ` [PATCH V8 05/10] cxl/pci: Create DOE auxiliary driver ira.weiny
2022-04-27 17:43   ` Jonathan Cameron
2022-04-28 14:48     ` ira.weiny
2022-04-28 15:17       ` Jonathan Cameron
2022-04-14 20:32 ` [PATCH V8 06/10] cxl/pci: Find the DOE mailbox which supports CDAT ira.weiny
2022-04-27 17:49   ` Jonathan Cameron
2022-05-09 21:25     ` Ira Weiny
2022-04-14 20:32 ` [PATCH V8 07/10] cxl/mem: Read CDAT table ira.weiny
2022-04-27 17:55   ` Jonathan Cameron
2022-04-14 20:32 ` [PATCH V8 08/10] cxl/cdat: Introduce cxl_cdat_valid() ira.weiny
2022-04-27 17:56   ` Jonathan Cameron
2022-04-14 20:32 ` [PATCH V8 09/10] cxl/mem: Retry reading CDAT on failure ira.weiny
2022-04-27 17:57   ` Jonathan Cameron
2022-04-14 20:32 ` [PATCH V8 10/10] cxl/port: Parse out DSMAS data from CDAT table ira.weiny
2022-04-27 18:01   ` Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).