linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] CXL: Read CDAT and DSMAS data from the device.
@ 2021-11-05 23:50 ira.weiny
  2021-11-05 23:50 ` [PATCH 1/5] PCI: Add vendor ID for the PCI SIG ira.weiny
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: ira.weiny @ 2021-11-05 23:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, Jonathan Cameron, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

This work was built on Jonathan's V4 series here[1].  The big change is a
conversion to an Auxiliary bus infrastructure which allows the DOE code to be
in a separate driver object which is attached to any DOE devices created by any
device.

The series creates a new DOE auxiliary bus driver.  The CXL devices are
modified to create DOE auxiliary devices to be driven by the new DOE driver.

After the devices are created and the driver attaches, CDAT data is read from
the device and DSMAS information parsed from that CDAT blob for use later.

This work was tested using qemu with additional patches.[2, 3]

[1] https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com
[2] https://lore.kernel.org/qemu-devel/20210202005948.241655-1-ben.widawsky@intel.com/
[3] https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/

Ira Weiny (1):
  cxl/cdat: Parse out DSMAS data from CDAT table

Jonathan Cameron (4):
  PCI: Add vendor ID for the PCI SIG
  PCI/DOE: Add Data Object Exchange Aux Driver
  cxl/pci: Add DOE Auxiliary Devices
  cxl/mem: Add CDAT table reading from DOE

 drivers/cxl/Kconfig           |   1 +
 drivers/cxl/cdat.h            |  81 ++++
 drivers/cxl/core/memdev.c     | 157 ++++++++
 drivers/cxl/cxl.h             |  20 +
 drivers/cxl/cxlmem.h          |  48 +++
 drivers/cxl/pci.c             | 212 ++++++++++
 drivers/pci/Kconfig           |  10 +
 drivers/pci/Makefile          |   3 +
 drivers/pci/doe.c             | 701 ++++++++++++++++++++++++++++++++++
 include/linux/pci-doe.h       |  63 +++
 include/linux/pci_ids.h       |   1 +
 include/uapi/linux/pci_regs.h |  29 +-
 12 files changed, 1325 insertions(+), 1 deletion(-)
 create mode 100644 drivers/cxl/cdat.h
 create mode 100644 drivers/pci/doe.c
 create mode 100644 include/linux/pci-doe.h

-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/5] PCI: Add vendor ID for the PCI SIG
  2021-11-05 23:50 [PATCH 0/5] CXL: Read CDAT and DSMAS data from the device ira.weiny
@ 2021-11-05 23:50 ` ira.weiny
  2021-11-17 21:50   ` Bjorn Helgaas
  2021-11-05 23:50 ` [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: ira.weiny @ 2021-11-05 23:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Alison Schofield, Vishal Verma, Ira Weiny,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

This ID is used in DOE headers to identify protocols that are defined
within the PCI Express Base Specification.

Specified in Table 7-x2 of the Data Object Exchange ECN (approved 12 March
2020) available from https://members.pcisig.com/wg/PCI-SIG/document/14143

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 include/linux/pci_ids.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 011f2f1ea5bb..849f514cd7db 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -149,6 +149,7 @@
 #define PCI_CLASS_OTHERS		0xff
 
 /* Vendors and devices.  Sort key: vendor first, device next. */
+#define PCI_VENDOR_ID_PCI_SIG		0x0001
 
 #define PCI_VENDOR_ID_LOONGSON		0x0014
 
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver
  2021-11-05 23:50 [PATCH 0/5] CXL: Read CDAT and DSMAS data from the device ira.weiny
  2021-11-05 23:50 ` [PATCH 1/5] PCI: Add vendor ID for the PCI SIG ira.weiny
@ 2021-11-05 23:50 ` ira.weiny
  2021-11-08 12:15   ` Jonathan Cameron
  2021-11-16 23:48   ` Bjorn Helgaas
  2021-11-05 23:50 ` [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices ira.weiny
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 37+ messages in thread
From: ira.weiny @ 2021-11-05 23:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Ira Weiny, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Introduced in a PCI ECN [1], DOE provides a config space based mailbox
with standard protocol discovery.  Each mailbox is accessed through a
DOE Extended Capability.

Define an auxiliary device driver which control DOE auxiliary devices
registered on the auxiliary bus.

A DOE mailbox is allowed to support any number of protocols while some
DOE protocol specifications apply additional restrictions.

The protocols supported are queried and cached.  pci_doe_supports_prot()
can be used to determine if the DOE device supports the protocol
specified.

A synchronous interface is provided in pci_doe_exchange_sync() to
perform a single query / response exchange from the driver through the
device specified.

Testing was conducted against QEMU using:

https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/

This code is based on Jonathan's V4 series here:

https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/

[1] https://members.pcisig.com/wg/PCI-SIG/document/14143
    Data Object Exchange (DOE) - Approved 12 March 2020

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

---
Changes from Jonathan's V4
	Move the DOE MB code into the DOE auxiliary driver
	Remove Task List in favor of a wait queue

Changes from Ben
	remove CXL references
	propagate rc from pci functions on error
---
 drivers/pci/Kconfig           |  10 +
 drivers/pci/Makefile          |   3 +
 drivers/pci/doe.c             | 701 ++++++++++++++++++++++++++++++++++
 include/linux/pci-doe.h       |  63 +++
 include/uapi/linux/pci_regs.h |  29 +-
 5 files changed, 805 insertions(+), 1 deletion(-)
 create mode 100644 drivers/pci/doe.c
 create mode 100644 include/linux/pci-doe.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 0c473d75e625..b512295538ba 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND
 	  The PCI device frontend driver allows the kernel to import arbitrary
 	  PCI devices from a PCI backend to support PCI driver domains.
 
+config PCI_DOE_DRIVER
+	tristate "PCI Data Object Exchange (DOE) driver"
+	select AUXILIARY_BUS
+	help
+	  Driver for DOE auxiliary devices.
+
+	  DOE provides a simple mailbox in PCI config space that is used by a
+	  number of different protocols.  DOE is defined in the Data Object
+	  Exchange ECN to the PCIe r5.0 spec.
+
 config PCI_ATS
 	bool
 
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index d62c4ac4ae1b..afd9d7bd2b82 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB)		+= pci-stub.o
 obj-$(CONFIG_PCI_PF_STUB)	+= pci-pf-stub.o
 obj-$(CONFIG_PCI_ECAM)		+= ecam.o
 obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
+obj-$(CONFIG_PCI_DOE_DRIVER)	+= pci-doe.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 
+pci-doe-y := doe.o
+
 # Endpoint library must be initialized before its users
 obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
 
diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
new file mode 100644
index 000000000000..2e702fdc7879
--- /dev/null
+++ b/drivers/pci/doe.c
@@ -0,0 +1,701 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Data Object Exchange ECN
+ * https://members.pcisig.com/wg/PCI-SIG/document/14143
+ *
+ * Copyright (C) 2021 Huawei
+ *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
+ */
+
+#include <linux/bitfield.h>
+#include <linux/delay.h>
+#include <linux/jiffies.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci-doe.h>
+#include <linux/workqueue.h>
+#include <linux/module.h>
+
+#define PCI_DOE_PROTOCOL_DISCOVERY 0
+
+#define PCI_DOE_BUSY_MAX_RETRIES 16
+#define PCI_DOE_POLL_INTERVAL (HZ / 128)
+
+/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
+#define PCI_DOE_TIMEOUT HZ
+
+enum pci_doe_state {
+	DOE_IDLE,
+	DOE_WAIT_RESP,
+	DOE_WAIT_ABORT,
+	DOE_WAIT_ABORT_ON_ERR,
+};
+
+/*
+ * struct pci_doe_task - description of a query / response task
+ * @ex: The details of the task to be done
+ * @rv: Return value.  Length of received response or error
+ * @cb: Callback for completion of task
+ * @private: Private data passed to callback on completion
+ */
+struct pci_doe_task {
+	struct pci_doe_exchange *ex;
+	int rv;
+	void (*cb)(void *private);
+	void *private;
+};
+
+/**
+ * struct pci_doe - A single DOE mailbox driver
+ *
+ * @doe_dev: The DOE Auxiliary device being driven
+ * @abort_c: Completion used for initial abort handling
+ * @irq: Interrupt used for signaling DOE ready or abort
+ * @irq_name: Name used to identify the irq for a particular DOE
+ * @prots: Array of identifiers for protocols supported
+ * @num_prots: Size of prots array
+ * @cur_task: Current task the state machine is working on
+ * @wq: Wait queue to wait on if a query is in progress
+ * @state_lock: Protect the state of cur_task, abort, and dead
+ * @statemachine: Work item for the DOE state machine
+ * @state: Current state of this DOE
+ * @timeout_jiffies: 1 second after GO set
+ * @busy_retries: Count of retry attempts
+ * @abort: Request a manual abort (e.g. on init)
+ * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
+ *        will immediately be aborted with error
+ */
+struct pci_doe {
+	struct pci_doe_dev *doe_dev;
+	struct completion abort_c;
+	int irq;
+	char *irq_name;
+	struct pci_doe_protocol *prots;
+	int num_prots;
+
+	struct pci_doe_task *cur_task;
+	wait_queue_head_t wq;
+	struct mutex state_lock;
+	struct delayed_work statemachine;
+	enum pci_doe_state state;
+	unsigned long timeout_jiffies;
+	unsigned int busy_retries;
+	unsigned int abort:1;
+	unsigned int dead:1;
+};
+
+static irqreturn_t pci_doe_irq(int irq, void *data)
+{
+	struct pci_doe *doe = data;
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	u32 val;
+
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
+		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val);
+		mod_delayed_work(system_wq, &doe->statemachine, 0);
+		return IRQ_HANDLED;
+	}
+	/* Leave the error case to be handled outside IRQ */
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
+		mod_delayed_work(system_wq, &doe->statemachine, 0);
+		return IRQ_HANDLED;
+	}
+
+	/*
+	 * Busy being cleared can result in an interrupt, but as
+	 * the original Busy may not have been detected, there is no
+	 * way to separate such an interrupt from a spurious interrupt.
+	 */
+	return IRQ_HANDLED;
+}
+
+/*
+ * Only call when safe to directly access the DOE, either because no tasks yet
+ * queued, or called from doe_statemachine_work() which has exclusive access to
+ * the DOE config space.
+ */
+static void pci_doe_abort_start(struct pci_doe *doe)
+{
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	u32 val;
+
+	val = PCI_DOE_CTRL_ABORT;
+	if (doe->irq)
+		val |= PCI_DOE_CTRL_INT_EN;
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
+
+	doe->timeout_jiffies = jiffies + HZ;
+	schedule_delayed_work(&doe->statemachine, HZ);
+}
+
+static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)
+{
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	u32 val;
+	int i;
+
+	/*
+	 * Check the DOE busy bit is not set. If it is set, this could indicate
+	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
+	 * it is expected that firmware and OS will negotiate access rights via
+	 * an, as yet to be defined method.
+	 */
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
+		return -EBUSY;
+
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+		return -EIO;
+
+	/* Write DOE Header */
+	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) |
+		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type);
+	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
+	/* Length is 2 DW of header + length of payload in DW */
+	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
+					  2 + ex->request_pl_sz / sizeof(u32)));
+	for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++)
+		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+				       ex->request_pl[i]);
+
+	val = PCI_DOE_CTRL_GO;
+	if (doe->irq)
+		val |= PCI_DOE_CTRL_INT_EN;
+
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
+	/* Request is sent - now wait for poll or IRQ */
+	return 0;
+}
+
+static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex)
+{
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	size_t length;
+	u32 val;
+	int i;
+
+	/* Read the first dword to get the protocol */
+	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) ||
+	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) {
+		pci_err(pdev,
+			"Expected [VID, Protocol] = [%x, %x], got [%x, %x]\n",
+			ex->prot.vid, ex->prot.type,
+			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
+			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
+		return -EIO;
+	}
+
+	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	/* Read the second dword to get the length */
+	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+
+	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
+	if (length > SZ_1M || length < 2)
+		return -EIO;
+
+	/* First 2 dwords have already been read */
+	length -= 2;
+	/* Read the rest of the response payload */
+	for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) {
+		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
+				      &ex->response_pl[i]);
+		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	}
+
+	/* Flush excess length */
+	for (; i < length; i++) {
+		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	}
+	/* Final error check to pick up on any since Data Object Ready */
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+		return -EIO;
+
+	return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32);
+}
+
+static void pci_doe_task_complete(void *private)
+{
+	complete(private);
+}
+
+static void doe_statemachine_work(struct work_struct *work)
+{
+	struct delayed_work *w = to_delayed_work(work);
+	struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	struct pci_doe_task *task;
+	bool abort;
+	u32 val;
+	int rc;
+
+	mutex_lock(&doe->state_lock);
+	task = doe->cur_task;
+	abort = doe->abort;
+	doe->abort = false;
+	mutex_unlock(&doe->state_lock);
+
+	if (abort) {
+		/*
+		 * Currently only used during init - care needed if
+		 * pci_doe_abort() is generally exposed as it would impact
+		 * queries in flight.
+		 */
+		WARN_ON(task);
+		doe->state = DOE_WAIT_ABORT;
+		pci_doe_abort_start(doe);
+		return;
+	}
+
+	switch (doe->state) {
+	case DOE_IDLE:
+		if (task == NULL)
+			return;
+
+		/* Nothing currently in flight so queue a task */
+		rc = pci_doe_send_req(doe, task->ex);
+		/*
+		 * The specification does not provide any guidance on how long
+		 * some other entity could keep the DOE busy, so try for 1
+		 * second then fail. Busy handling is best effort only, because
+		 * there is no way of avoiding racing against another user of
+		 * the DOE.
+		 */
+		if (rc == -EBUSY) {
+			doe->busy_retries++;
+			if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
+				/* Long enough, fail this request */
+				pci_WARN(pdev, true, "DOE busy for too long\n");
+				doe->busy_retries = 0;
+				goto err_busy;
+			}
+			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
+			return;
+		}
+		if (rc)
+			goto err_abort;
+		doe->busy_retries = 0;
+
+		doe->state = DOE_WAIT_RESP;
+		doe->timeout_jiffies = jiffies + HZ;
+		/* Now poll or wait for IRQ with timeout */
+		if (doe->irq > 0)
+			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
+		else
+			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
+		return;
+
+	case DOE_WAIT_RESP:
+		/* Not possible to get here with NULL task */
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
+			rc = -EIO;
+			goto err_abort;
+		}
+
+		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
+			/* If not yet at timeout reschedule otherwise abort */
+			if (time_after(jiffies, doe->timeout_jiffies)) {
+				rc = -ETIMEDOUT;
+				goto err_abort;
+			}
+			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
+			return;
+		}
+
+		rc  = pci_doe_recv_resp(doe, task->ex);
+		if (rc < 0)
+			goto err_abort;
+
+		doe->state = DOE_IDLE;
+
+		mutex_lock(&doe->state_lock);
+		doe->cur_task = NULL;
+		mutex_unlock(&doe->state_lock);
+		wake_up_interruptible(&doe->wq);
+
+		/* Set the return value to the length of received payload */
+		task->rv = rc;
+		task->cb(task->private);
+
+		return;
+
+	case DOE_WAIT_ABORT:
+	case DOE_WAIT_ABORT_ON_ERR:
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
+		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
+			/* Back to normal state - carry on */
+			mutex_lock(&doe->state_lock);
+			doe->cur_task = NULL;
+			mutex_unlock(&doe->state_lock);
+			wake_up_interruptible(&doe->wq);
+
+			/*
+			 * For deliberately triggered abort, someone is
+			 * waiting.
+			 */
+			if (doe->state == DOE_WAIT_ABORT)
+				complete(&doe->abort_c);
+
+			doe->state = DOE_IDLE;
+			return;
+		}
+		if (time_after(jiffies, doe->timeout_jiffies)) {
+			/* Task has timed out and is dead - abort */
+			pci_err(pdev, "DOE ABORT timed out\n");
+			mutex_lock(&doe->state_lock);
+			doe->dead = true;
+			doe->cur_task = NULL;
+			mutex_unlock(&doe->state_lock);
+			wake_up_interruptible(&doe->wq);
+
+			if (doe->state == DOE_WAIT_ABORT)
+				complete(&doe->abort_c);
+		}
+		return;
+	}
+
+err_abort:
+	doe->state = DOE_WAIT_ABORT_ON_ERR;
+	pci_doe_abort_start(doe);
+err_busy:
+	task->rv = rc;
+	task->cb(task->private);
+	/* If here via err_busy, signal the task done. */
+	if (doe->state == DOE_IDLE) {
+		mutex_lock(&doe->state_lock);
+		doe->cur_task = NULL;
+		mutex_unlock(&doe->state_lock);
+		wake_up_interruptible(&doe->wq);
+	}
+}
+
+/**
+ * pci_doe_exchange_sync() - Send a request, then wait for and receive a response
+ * @doe: DOE mailbox state structure
+ * @ex: Description of the buffers and Vendor ID + type used in this
+ *      request/response pair
+ *
+ * Excess data will be discarded.
+ *
+ * RETURNS: payload in bytes on success, < 0 on error
+ */
+int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex)
+{
+	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
+	struct pci_doe_task task;
+	DECLARE_COMPLETION_ONSTACK(c);
+
+	if (!doe)
+		return -EAGAIN;
+
+	/* DOE requests must be a whole number of DW */
+	if (ex->request_pl_sz % sizeof(u32))
+		return -EINVAL;
+
+	task.ex = ex;
+	task.cb = pci_doe_task_complete;
+	task.private = &c;
+
+again:
+	mutex_lock(&doe->state_lock);
+	if (doe->cur_task) {
+		mutex_unlock(&doe->state_lock);
+		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
+		goto again;
+	}
+
+	if (doe->dead) {
+		mutex_unlock(&doe->state_lock);
+		return -EIO;
+	}
+	doe->cur_task = &task;
+	schedule_delayed_work(&doe->statemachine, 0);
+	mutex_unlock(&doe->state_lock);
+
+	wait_for_completion(&c);
+
+	return task.rv;
+}
+EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
+
+/**
+ * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol
+ * @pdev: Device on which to find the DOE instance
+ * @vid: Protocol Vendor ID
+ * @type: protocol type
+ *
+ * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox
+ * exchange through that DOE mailbox.
+ *
+ * RETURNS: True if the DOE device supports the protocol specified
+ */
+bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
+{
+	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
+	int i;
+
+	if (!doe)
+		return false;
+
+	for (i = 0; i < doe->num_prots; i++)
+		if ((doe->prots[i].vid == vid) &&
+		    (doe->prots[i].type == type))
+			return true;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
+
+static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid,
+			     u8 *protocol)
+{
+	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX, *index);
+	u32 response_pl;
+	struct pci_doe_exchange ex = {
+		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
+		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
+		.request_pl = &request_pl,
+		.request_pl_sz = sizeof(request_pl),
+		.response_pl = &response_pl,
+		.response_pl_sz = sizeof(response_pl),
+	};
+	int ret;
+
+	ret = pci_doe_exchange_sync(doe->doe_dev, &ex);
+	if (ret < 0)
+		return ret;
+
+	if (ret != sizeof(response_pl))
+		return -EIO;
+
+	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
+	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL, response_pl);
+	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX, response_pl);
+
+	return 0;
+}
+
+static int pci_doe_cache_protocols(struct pci_doe *doe)
+{
+	u8 index = 0;
+	int rc;
+
+	/* Discovery protocol must always be supported and must report itself */
+	doe->num_prots = 1;
+	doe->prots = kcalloc(doe->num_prots, sizeof(*doe->prots), GFP_KERNEL);
+	if (doe->prots == NULL)
+		return -ENOMEM;
+
+	do {
+		struct pci_doe_protocol *prot;
+
+		prot = &doe->prots[doe->num_prots - 1];
+		rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
+		if (rc)
+			goto err_free_prots;
+
+		if (index) {
+			struct pci_doe_protocol *prot_new;
+
+			doe->num_prots++;
+			prot_new = krealloc(doe->prots,
+					    sizeof(*doe->prots) * doe->num_prots,
+					    GFP_KERNEL);
+			if (prot_new == NULL) {
+				rc = -ENOMEM;
+				goto err_free_prots;
+			}
+			doe->prots = prot_new;
+		}
+	} while (index);
+
+	return 0;
+
+err_free_prots:
+	kfree(doe->prots);
+	doe->num_prots = 0;
+	doe->prots = NULL;
+	return rc;
+}
+
+static int pci_doe_abort(struct pci_doe *doe)
+{
+	reinit_completion(&doe->abort_c);
+	mutex_lock(&doe->state_lock);
+	doe->abort = true;
+	mutex_unlock(&doe->state_lock);
+	schedule_delayed_work(&doe->statemachine, 0);
+	wait_for_completion(&doe->abort_c);
+
+	if (doe->dead)
+		return -EIO;
+
+	return 0;
+}
+
+static void pci_doe_release_irq(struct pci_doe *doe)
+{
+	if (doe->irq > 0)
+		free_irq(doe->irq, doe);
+}
+
+static int pci_doe_register(struct pci_doe *doe)
+{
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	bool poll = !pci_dev_msi_enabled(pdev);
+	int offset = doe->doe_dev->cap_offset;
+	int rc, irq;
+	u32 val;
+
+	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
+
+	if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
+		irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
+		if (irq < 0)
+			return irq;
+
+		doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]",
+					  doe->doe_dev->adev.name);
+		if (!doe->irq_name)
+			return -ENOMEM;
+
+		rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe);
+		if (rc)
+			goto err_free_name;
+
+		doe->irq = irq;
+		pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
+				       PCI_DOE_CTRL_INT_EN);
+	}
+
+	/* Reset the mailbox by issuing an abort */
+	rc = pci_doe_abort(doe);
+	if (rc)
+		goto err_free_irqs;
+
+	/* Ensure the pci device remains until this driver is done with it */
+	get_device(&pdev->dev);
+
+	return 0;
+
+err_free_irqs:
+	pci_doe_release_irq(doe);
+err_free_name:
+	kfree(doe->irq_name);
+	return rc;
+}
+
+static void pci_doe_unregister(struct pci_doe *doe)
+{
+	pci_doe_release_irq(doe);
+	kfree(doe->irq_name);
+	put_device(&doe->doe_dev->pdev->dev);
+}
+
+/*
+ * pci_doe_probe() - Set up the Mailbox
+ * @aux_dev: Auxiliary Device
+ * @id: Auxiliary device ID
+ *
+ * Probe the mailbox found for all protocols and set up the Mailbox
+ *
+ * RETURNS: 0 on success, < 0 on error
+ */
+static int pci_doe_probe(struct auxiliary_device *aux_dev,
+			 const struct auxiliary_device_id *id)
+{
+	struct pci_doe_dev *doe_dev = container_of(aux_dev,
+					struct pci_doe_dev,
+					adev);
+	struct pci_doe *doe;
+	int rc;
+
+	doe = kzalloc(sizeof(*doe), GFP_KERNEL);
+	if (!doe)
+		return -ENOMEM;
+
+	mutex_init(&doe->state_lock);
+	init_completion(&doe->abort_c);
+	doe->doe_dev = doe_dev;
+	init_waitqueue_head(&doe->wq);
+	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
+	dev_set_drvdata(&aux_dev->dev, doe);
+
+	rc = pci_doe_register(doe);
+	if (rc)
+		goto err_free;
+
+	rc = pci_doe_cache_protocols(doe);
+	if (rc) {
+		pci_doe_unregister(doe);
+		goto err_free;
+	}
+
+	return 0;
+
+err_free:
+	kfree(doe);
+	return rc;
+}
+
+static void pci_doe_remove(struct auxiliary_device *aux_dev)
+{
+	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
+
+	/* First halt the state machine */
+	cancel_delayed_work_sync(&doe->statemachine);
+	kfree(doe->prots);
+	pci_doe_unregister(doe);
+	kfree(doe);
+}
+
+static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
+	{.name = "cxl_pci.doe", },
+	{},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
+
+struct auxiliary_driver pci_doe_auxiliary_drv = {
+	.name = "pci_doe_drv",
+	.id_table = pci_doe_auxiliary_id_table,
+	.probe = pci_doe_probe,
+	.remove = pci_doe_remove
+};
+
+static int __init pci_doe_init_module(void)
+{
+	int ret;
+
+	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
+	if (ret) {
+		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
+		       ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit pci_doe_exit_module(void)
+{
+	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
+}
+
+module_init(pci_doe_init_module);
+module_exit(pci_doe_exit_module);
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
new file mode 100644
index 000000000000..8380b7ad33d4
--- /dev/null
+++ b/include/linux/pci-doe.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
+ *
+ * Copyright (C) 2021 Huawei
+ *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
+ */
+
+#include <linux/completion.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/auxiliary_bus.h>
+
+#ifndef LINUX_PCI_DOE_H
+#define LINUX_PCI_DOE_H
+
+#define DOE_DEV_NAME "doe"
+
+struct pci_doe_protocol {
+	u16 vid;
+	u8 type;
+};
+
+/**
+ * struct pci_doe_exchange - represents a single query/response
+ *
+ * @prot: DOE Protocol
+ * @request_pl: The request payload
+ * @request_pl_sz: Size of the request payload
+ * @response_pl: The response payload
+ * @response_pl_sz: Size of the response payload
+ */
+struct pci_doe_exchange {
+	struct pci_doe_protocol prot;
+	u32 *request_pl;
+	size_t request_pl_sz;
+	u32 *response_pl;
+	size_t response_pl_sz;
+};
+
+/**
+ * struct pci_doe_dev - DOE mailbox device
+ *
+ * @adrv: Auxiliary Driver data
+ * @pdev: PCI device this belongs to
+ * @offset: Capability offset
+ *
+ * This represents a single DOE mailbox device.  Devices should create this
+ * device and register it on the Auxiliary bus for the DOE driver to maintain.
+ *
+ */
+struct pci_doe_dev {
+	struct auxiliary_device adev;
+	struct pci_dev *pdev;
+	int cap_offset;
+};
+
+/* Library operations */
+int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
+				 struct pci_doe_exchange *ex);
+bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type);
+
+#endif
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index e709ae8235e7..1073cd1916e1 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -730,7 +730,8 @@
 #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
 #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
 #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
-#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
+#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
+#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
 
 #define PCI_EXT_CAP_DSN_SIZEOF	12
 #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
@@ -1092,4 +1093,30 @@
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
 
+/* Data Object Exchange */
+#define PCI_DOE_CAP            0x04    /* DOE Capabilities Register */
+#define  PCI_DOE_CAP_INT                       0x00000001  /* Interrupt Support */
+#define  PCI_DOE_CAP_IRQ                       0x00000ffe  /* Interrupt Message Number */
+#define PCI_DOE_CTRL           0x08    /* DOE Control Register */
+#define  PCI_DOE_CTRL_ABORT                    0x00000001  /* DOE Abort */
+#define  PCI_DOE_CTRL_INT_EN                   0x00000002  /* DOE Interrupt Enable */
+#define  PCI_DOE_CTRL_GO                       0x80000000  /* DOE Go */
+#define PCI_DOE_STATUS         0x0c    /* DOE Status Register */
+#define  PCI_DOE_STATUS_BUSY                   0x00000001  /* DOE Busy */
+#define  PCI_DOE_STATUS_INT_STATUS             0x00000002  /* DOE Interrupt Status */
+#define  PCI_DOE_STATUS_ERROR                  0x00000004  /* DOE Error */
+#define  PCI_DOE_STATUS_DATA_OBJECT_READY      0x80000000  /* Data Object Ready */
+#define PCI_DOE_WRITE          0x10    /* DOE Write Data Mailbox Register */
+#define PCI_DOE_READ           0x14    /* DOE Read Data Mailbox Register */
+
+/* DOE Data Object - note not actually registers */
+#define PCI_DOE_DATA_OBJECT_HEADER_1_VID       0x0000ffff
+#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE      0x00ff0000
+#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH    0x0003ffff
+
+#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX   0x000000ff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID     0x0000ffff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL        0x00ff0000
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX 0xff000000
+
 #endif /* LINUX_PCI_REGS_H */
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-05 23:50 [PATCH 0/5] CXL: Read CDAT and DSMAS data from the device ira.weiny
  2021-11-05 23:50 ` [PATCH 1/5] PCI: Add vendor ID for the PCI SIG ira.weiny
  2021-11-05 23:50 ` [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
@ 2021-11-05 23:50 ` ira.weiny
  2021-11-08 13:09   ` Jonathan Cameron
  2021-11-16 23:48   ` Bjorn Helgaas
  2021-11-05 23:50 ` [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE ira.weiny
  2021-11-05 23:50 ` [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
  4 siblings, 2 replies; 37+ messages in thread
From: ira.weiny @ 2021-11-05 23:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Ira Weiny, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

CXL devices have DOE mailboxes.  Create auxiliary devices which can be
driven by the generic DOE auxiliary driver.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

---
Changes from V4:
	Make this an Auxiliary Driver rather than library functions
	Split this out into it's own patch
	Base on the new cxl_dev_state structure

Changes from Ben
	s/CXL_DOE_DEV_NAME/DOE_DEV_NAME/
---
 drivers/cxl/Kconfig |   1 +
 drivers/cxl/cxl.h   |  13 +++++
 drivers/cxl/pci.c   | 120 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 134 insertions(+)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 67c91378f2dd..9d53720bea07 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -16,6 +16,7 @@ if CXL_BUS
 config CXL_MEM
 	tristate "CXL.mem: Memory Devices"
 	default CXL_BUS
+	select PCI_DOE_DRIVER
 	help
 	  The CXL.mem protocol allows a device to act as a provider of
 	  "System RAM" and/or "Persistent Memory" that is fully coherent
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 5e2e93451928..f1241a7f2b7b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -75,6 +75,19 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
 #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
 #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
 
+/*
+ * Address space properties derived from:
+ * CXL 2.0 8.2.5.12.7 CXL HDM Decoder 0 Control Register
+ */
+#define CXL_ADDRSPACE_RAM   BIT(0)
+#define CXL_ADDRSPACE_PMEM  BIT(1)
+#define CXL_ADDRSPACE_TYPE2 BIT(2)
+#define CXL_ADDRSPACE_TYPE3 BIT(3)
+#define CXL_ADDRSPACE_MASK  GENMASK(3, 0)
+
+#define CXL_DOE_PROTOCOL_COMPLIANCE 0
+#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
+
 #define CXL_COMPONENT_REGS() \
 	void __iomem *hdm_decoder
 
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 8dc91fd3396a..df524b74f1d2 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -6,6 +6,7 @@
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/pci.h>
+#include <linux/pci-doe.h>
 #include <linux/io.h>
 #include "cxlmem.h"
 #include "pci.h"
@@ -471,6 +472,120 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
 	return rc;
 }
 
+static void cxl_mem_free_irq_vectors(void *data)
+{
+	pci_free_irq_vectors(data);
+}
+
+static void cxl_destroy_doe_device(void *ad)
+{
+	struct auxiliary_device *adev = ad;
+
+	auxiliary_device_delete(adev);
+	auxiliary_device_uninit(adev);
+}
+
+static DEFINE_IDA(cxl_doe_adev_ida);
+static void __doe_dev_release(struct auxiliary_device *adev)
+{
+	struct pci_doe_dev *doe_dev = container_of(adev, struct pci_doe_dev,
+						   adev);
+
+	ida_free(&cxl_doe_adev_ida, adev->id);
+	kfree(doe_dev);
+}
+
+static void cxl_doe_dev_release(struct device *dev)
+{
+	struct auxiliary_device *adev = container_of(dev,
+						struct auxiliary_device,
+						dev);
+	__doe_dev_release(adev);
+}
+
+static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
+{
+	struct device *dev = cxlds->dev;
+	struct pci_dev *pdev = to_pci_dev(dev);
+	int irqs, rc;
+	u16 pos = 0;
+
+	/*
+	 * An implementation of a cxl type3 device may support an unknown
+	 * number of interrupts. Assume that number is not that large and
+	 * request them all.
+	 */
+	irqs = pci_msix_vec_count(pdev);
+	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
+	if (rc != irqs) {
+		/* No interrupt available - carry on */
+		dev_dbg(dev, "No interrupts available for DOE\n");
+	} else {
+		/*
+		 * Enabling bus mastering could be done within the DOE
+		 * initialization, but as it potentially has other impacts
+		 * keep it within the driver.
+		 */
+		pci_set_master(pdev);
+		rc = devm_add_action_or_reset(dev,
+					      cxl_mem_free_irq_vectors,
+					      pdev);
+		if (rc)
+			return rc;
+	}
+
+	pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
+
+	while (pos > 0) {
+		struct auxiliary_device *adev;
+		struct pci_doe_dev *new_dev;
+		int id;
+
+		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
+		if (!new_dev)
+			return -ENOMEM;
+
+		new_dev->pdev = pdev;
+		new_dev->cap_offset = pos;
+
+		/* Set up struct auxiliary_device */
+		adev = &new_dev->adev;
+		id = ida_alloc(&cxl_doe_adev_ida, GFP_KERNEL);
+		if (id < 0) {
+			kfree(new_dev);
+			return -ENOMEM;
+		}
+
+		adev->id = id;
+		adev->name = DOE_DEV_NAME;
+		adev->dev.release = cxl_doe_dev_release;
+		adev->dev.parent = dev;
+
+		if (auxiliary_device_init(adev)) {
+			__doe_dev_release(adev);
+			return -EIO;
+		}
+
+		if (auxiliary_device_add(adev)) {
+			auxiliary_device_uninit(adev);
+			return -EIO;
+		}
+
+		rc = devm_add_action_or_reset(dev, cxl_destroy_doe_device, adev);
+		if (rc)
+			return rc;
+
+		if (device_attach(&adev->dev) != 1)
+			dev_err(&adev->dev,
+				"Failed to attach a driver to DOE device %d\n",
+				adev->id);
+
+		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
+	}
+
+	return 0;
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct cxl_register_map map;
@@ -517,6 +632,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	rc = cxl_setup_doe_devices(cxlds);
+	if (rc)
+		return rc;
+
 	cxlmd = devm_cxl_add_memdev(cxlds);
 	if (IS_ERR(cxlmd))
 		return PTR_ERR(cxlmd);
@@ -546,3 +665,4 @@ static struct pci_driver cxl_pci_driver = {
 MODULE_LICENSE("GPL v2");
 module_pci_driver(cxl_pci_driver);
 MODULE_IMPORT_NS(CXL);
+MODULE_SOFTDEP("pre: pci_doe");
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE
  2021-11-05 23:50 [PATCH 0/5] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (2 preceding siblings ...)
  2021-11-05 23:50 ` [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices ira.weiny
@ 2021-11-05 23:50 ` ira.weiny
  2021-11-08 13:21   ` Jonathan Cameron
                     ` (2 more replies)
  2021-11-05 23:50 ` [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
  4 siblings, 3 replies; 37+ messages in thread
From: ira.weiny @ 2021-11-05 23:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Ira Weiny, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Read CDAT raw table data from the cxl_mem state object.  Currently this
is only supported by a PCI CXL object through a DOE mailbox which supports
CDAT.  But any cxl_mem type object can provide this data later if need
be.  For example for testing.

Cache this data for later parsing.  Provide a sysfs binary attribute to
allow dumping of the CDAT.

Binary dumping is modeled on /sys/firmware/ACPI/tables/

The ability to dump this table will be very useful for emulation of real
devices once they become available as QEMU CXL type 3 device emulation will
be able to load this file in.

This does not support table updates at runtime. It will always provide
whatever was there when first cached. Handling of table updates can be
implemented later.

Once there are more users, this code can move out to driver/cxl/cdat.c
or similar.

Finally create a complete list of DOE defines within cdat.h for anyone
wishing to decode the CDAT table.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

---
Changes from V4:
	Split this into it's own patch
	Rearchitect this such that the memdev driver calls into the DOE
	driver via the cxl_mem state object.  This allows CDAT data to
	come from any type of cxl_mem object not just PCI DOE.
	Rebase on new struct cxl_dev_state
---
 drivers/cxl/cdat.h        | 81 +++++++++++++++++++++++++++++++++
 drivers/cxl/core/memdev.c | 46 +++++++++++++++++++
 drivers/cxl/cxl.h         |  7 +++
 drivers/cxl/cxlmem.h      | 25 +++++++++++
 drivers/cxl/pci.c         | 94 ++++++++++++++++++++++++++++++++++++++-
 5 files changed, 252 insertions(+), 1 deletion(-)
 create mode 100644 drivers/cxl/cdat.h

diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
new file mode 100644
index 000000000000..ee78eb822166
--- /dev/null
+++ b/drivers/cxl/cdat.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Coherent Device Attribute table (CDAT)
+ *
+ * Specification available from UEFI.org
+ *
+ * Whilst CDAT is defined as a single table, the access via DOE maiboxes is
+ * done one entry at a time, where the first entry is the header.
+ */
+
+#define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
+#define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
+#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
+#define   CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA	0
+#define CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE	0xffff0000
+
+/*
+ * CDAT entries are little endian and are read from PCI config space which
+ * is also little endian.
+ * As such, on a big endian system these will have been reversed.
+ * This prevents us from making easy use of packed structures.
+ * Style form pci_regs.h
+ */
+
+#define CDAT_HEADER_LENGTH_DW 4
+#define CDAT_HEADER_LENGTH_BYTES (CDAT_HEADER_LENGTH_DW * sizeof(u32))
+#define CDAT_HEADER_DW0_LENGTH		0xffffffff
+#define CDAT_HEADER_DW1_REVISION	0x000000ff
+#define CDAT_HEADER_DW1_CHECKSUM	0x0000ff00
+/* CDAT_HEADER_DW2_RESERVED	*/
+#define CDAT_HEADER_DW3_SEQUENCE	0xffffffff
+
+/* All structures have a common first DW */
+#define CDAT_STRUCTURE_DW0_TYPE		0x000000ff
+#define   CDAT_STRUCTURE_DW0_TYPE_DSMAS 0
+#define   CDAT_STRUCTURE_DW0_TYPE_DSLBIS 1
+#define   CDAT_STRUCTURE_DW0_TYPE_DSMSCIS 2
+#define   CDAT_STRUCTURE_DW0_TYPE_DSIS 3
+#define   CDAT_STRUCTURE_DW0_TYPE_DSEMTS 4
+#define   CDAT_STRUCTURE_DW0_TYPE_SSLBIS 5
+
+#define CDAT_STRUCTURE_DW0_LENGTH	0xffff0000
+
+/* Device Scoped Memory Affinity Structure */
+#define CDAT_DSMAS_DW1_DSMAD_HANDLE	0x000000ff
+#define CDAT_DSMAS_DW1_FLAGS		0x0000ff00
+#define CDAT_DSMAS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMAS_DPA_LEN(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
+#define CDAT_DSMAS_NON_VOLATILE(flags)  ((flags & 0x04) >> 2)
+
+/* Device Scoped Latency and Bandwidth Information Structure */
+#define CDAT_DSLBIS_DW1_HANDLE		0x000000ff
+#define CDAT_DSLBIS_DW1_FLAGS		0x0000ff00
+#define CDAT_DSLBIS_DW1_DATA_TYPE	0x00ff0000
+#define CDAT_DSLBIS_BASE_UNIT(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSLBIS_DW4_ENTRY_0		0x0000ffff
+#define CDAT_DSLBIS_DW4_ENTRY_1		0xffff0000
+#define CDAT_DSLBIS_DW5_ENTRY_2		0x0000ffff
+
+/* Device Scoped Memory Side Cache Information Structure */
+#define CDAT_DSMSCIS_DW1_HANDLE		0x000000ff
+#define CDAT_DSMSCIS_MEMORY_SIDE_CACHE_SIZE(entry) \
+	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMSCIS_DW4_MEMORY_SIDE_CACHE_ATTRS 0xffffffff
+
+/* Device Scoped Initiator Structure */
+#define CDAT_DSIS_DW1_FLAGS		0x000000ff
+#define CDAT_DSIS_DW1_HANDLE		0x0000ff00
+
+/* Device Scoped EFI Memory Type Structure */
+#define CDAT_DSEMTS_DW1_HANDLE		0x000000ff
+#define CDAT_DSEMTS_DW1_EFI_MEMORY_TYPE_ATTR	0x0000ff00
+#define CDAT_DSEMTS_DPA_OFFSET(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSEMTS_DPA_LENGTH(entry)	((u64)((entry)[5]) << 32 | (entry)[4])
+
+/* Switch Scoped Latency and Bandwidth Information Structure */
+#define CDAT_SSLBIS_DW1_DATA_TYPE	0x000000ff
+#define CDAT_SSLBIS_BASE_UNIT(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_SSLBIS_ENTRY_PORT_X(entry, i) ((entry)[4 + (i) * 2] & 0x0000ffff)
+#define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
+#define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 5341b0ba99a7..c35de9e8298e 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -86,6 +86,35 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 	return sysfs_emit(buf, "%#llx\n", len);
 }
 
+static ssize_t CDAT_read(struct file *filp, struct kobject *kobj,
+			 struct bin_attribute *bin_attr, char *buf,
+			 loff_t offset, size_t count)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+
+	if (!cxlmd->cdat_table)
+		return 0;
+
+	return memory_read_from_buffer(buf, count, &offset,
+				       cxlmd->cdat_table,
+				       cxlmd->cdat_length);
+}
+
+static BIN_ATTR_RO(CDAT, 0);
+
+static umode_t cxl_memdev_bin_attr_is_visible(struct kobject *kobj,
+					      struct bin_attribute *attr, int i)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+
+	if ((attr == &bin_attr_CDAT) && cxlmd->cdat_table)
+		return 0400;
+
+	return 0;
+}
+
 static struct device_attribute dev_attr_pmem_size =
 	__ATTR(size, 0444, pmem_size_show, NULL);
 
@@ -96,6 +125,11 @@ static struct attribute *cxl_memdev_attributes[] = {
 	NULL,
 };
 
+static struct bin_attribute *cxl_memdev_bin_attributes[] = {
+	&bin_attr_CDAT,
+	NULL,
+};
+
 static struct attribute *cxl_memdev_pmem_attributes[] = {
 	&dev_attr_pmem_size.attr,
 	NULL,
@@ -108,6 +142,8 @@ static struct attribute *cxl_memdev_ram_attributes[] = {
 
 static struct attribute_group cxl_memdev_attribute_group = {
 	.attrs = cxl_memdev_attributes,
+	.bin_attrs = cxl_memdev_bin_attributes,
+	.is_bin_visible = cxl_memdev_bin_attr_is_visible,
 };
 
 static struct attribute_group cxl_memdev_ram_attribute_group = {
@@ -293,6 +329,16 @@ devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
 	if (rc)
 		goto err;
 
+	/* Cache the data early to ensure is_visible() works */
+	if (!cxl_mem_cdat_get_length(cxlds, &cxlmd->cdat_length)) {
+		cxlmd->cdat_table = devm_kzalloc(dev, cxlmd->cdat_length, GFP_KERNEL);
+		if (!cxlmd->cdat_table) {
+			rc = -ENOMEM;
+			goto err;
+		}
+		cxl_mem_cdat_read_table(cxlds, cxlmd->cdat_table, cxlmd->cdat_length);
+	}
+
 	/*
 	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
 	 * needed as this is ordered with cdev_add() publishing the device.
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f1241a7f2b7b..f5dd38c6ce0f 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -88,6 +88,13 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
 #define CXL_DOE_PROTOCOL_COMPLIANCE 0
 #define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
 
+/* Common to request and response */
+#define CXL_DOE_TABLE_ACCESS_3_CODE GENMASK(7, 0)
+#define   CXL_DOE_TABLE_ACCESS_3_CODE_READ 0
+#define CXL_DOE_TABLE_ACCESS_3_TYPE GENMASK(15, 8)
+#define   CXL_DOE_TABLE_ACCESS_3_TYPE_CDAT 0
+#define CXL_DOE_TABLE_ACCESS_3_ENTRY_HANDLE GENMASK(31, 16)
+
 #define CXL_COMPONENT_REGS() \
 	void __iomem *hdm_decoder
 
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 8d96d009ad90..f6c62cd537bb 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -34,12 +34,16 @@
  * @dev: driver core device object
  * @cdev: char dev core object for ioctl operations
  * @cxlds: The device state backing this device
+ * @cdat_table: cache of CDAT table
+ * @cdat_length: length of cached CDAT table
  * @id: id number of this memdev instance.
  */
 struct cxl_memdev {
 	struct device dev;
 	struct cdev cdev;
 	struct cxl_dev_state *cxlds;
+	void *cdat_table;
+	size_t cdat_length;
 	int id;
 };
 
@@ -97,6 +101,7 @@ struct cxl_mbox_cmd {
  * Currently only memory devices are represented.
  *
  * @dev: The device associated with this CXL state
+ * @cdat_doe: Auxiliary DOE device capabile of reading CDAT
  * @regs: Parsed register blocks
  * @payload_size: Size of space for payload
  *                (CXL 2.0 8.2.8.4.3 Mailbox Capabilities Register)
@@ -117,6 +122,10 @@ struct cxl_mbox_cmd {
  * @next_volatile_bytes: volatile capacity change pending device reset
  * @next_persistent_bytes: persistent capacity change pending device reset
  * @mbox_send: @dev specific transport for transmitting mailbox commands
+ * @cdat_get_length: @dev specific function for reading the CDAT table length
+ *                   returns -errno if CDAT not supported on this device
+ * @cdat_read_table: @dev specific function for reading the table
+ *                   returns -errno if CDAT not supported on this device
  *
  * See section 8.2.9.5.2 Capacity Configuration and Label Storage for
  * details on capacity parameters.
@@ -124,6 +133,7 @@ struct cxl_mbox_cmd {
 struct cxl_dev_state {
 	struct device *dev;
 
+	struct pci_doe_dev *cdat_doe;
 	struct cxl_regs regs;
 
 	size_t payload_size;
@@ -146,6 +156,8 @@ struct cxl_dev_state {
 	u64 next_persistent_bytes;
 
 	int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
+	int (*cdat_get_length)(struct cxl_dev_state *cxlds, size_t *length);
+	int (*cdat_read_table)(struct cxl_dev_state *cxlds, u32 *data, size_t length);
 };
 
 enum cxl_opcode {
@@ -264,4 +276,17 @@ int cxl_mem_create_range_info(struct cxl_dev_state *cxlds);
 struct cxl_dev_state *cxl_dev_state_create(struct device *dev);
 void set_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds);
 void clear_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds);
+
+static inline int cxl_mem_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
+{
+	if (cxlds->cdat_get_length)
+		return cxlds->cdat_get_length(cxlds, length);
+	return -EOPNOTSUPP;
+}
+static inline int cxl_mem_cdat_read_table(struct cxl_dev_state *cxlds, u32 *data, size_t length)
+{
+	if (cxlds->cdat_read_table)
+		return cxlds->cdat_read_table(cxlds, data, length);
+	return -EOPNOTSUPP;
+}
 #endif /* __CXL_MEM_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index df524b74f1d2..086532a42480 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -11,6 +11,7 @@
 #include "cxlmem.h"
 #include "pci.h"
 #include "cxl.h"
+#include "cdat.h"
 
 /**
  * DOC: cxl pci
@@ -575,17 +576,106 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
 		if (rc)
 			return rc;
 
-		if (device_attach(&adev->dev) != 1)
+		if (device_attach(&adev->dev) != 1) {
 			dev_err(&adev->dev,
 				"Failed to attach a driver to DOE device %d\n",
 				adev->id);
+			goto next;
+		}
+
+		if (pci_doe_supports_prot(new_dev, PCI_DVSEC_VENDOR_ID_CXL,
+					  CXL_DOE_PROTOCOL_TABLE_ACCESS))
+			cxlds->cdat_doe = new_dev;
 
+next:
 		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
 	}
 
 	return 0;
 }
 
+#define CDAT_DOE_REQ(entry_handle)					\
+	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
+		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
+	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_TABLE_TYPE,			\
+		    CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA) |		\
+	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, (entry_handle)))
+
+static int cxl_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
+{
+	struct pci_doe_dev *doe_dev = cxlds->cdat_doe;
+	u32 cdat_request_pl = CDAT_DOE_REQ(0);
+	u32 cdat_response_pl[32];
+	struct pci_doe_exchange ex = {
+		.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+		.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+		.request_pl = &cdat_request_pl,
+		.request_pl_sz = sizeof(cdat_request_pl),
+		.response_pl = cdat_response_pl,
+		.response_pl_sz = sizeof(cdat_response_pl),
+	};
+
+	ssize_t rc;
+
+	rc = pci_doe_exchange_sync(doe_dev, &ex);
+	if (rc < 0)
+		return rc;
+	if (rc < 1)
+		return -EIO;
+
+	*length = cdat_response_pl[1];
+	dev_dbg(cxlds->dev, "CDAT length %zu\n", *length);
+	return 0;
+}
+
+static int cxl_cdat_read_table(struct cxl_dev_state *cxlds, u32 *data, size_t length)
+{
+	struct pci_doe_dev *doe_dev = cxlds->cdat_doe;
+	int entry_handle = 0;
+	int rc;
+
+	do {
+		u32 cdat_request_pl = CDAT_DOE_REQ(entry_handle);
+		u32 cdat_response_pl[32];
+		struct pci_doe_exchange ex = {
+			.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+			.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+			.request_pl = &cdat_request_pl,
+			.request_pl_sz = sizeof(cdat_request_pl),
+			.response_pl = cdat_response_pl,
+			.response_pl_sz = sizeof(cdat_response_pl),
+		};
+		size_t entry_dw;
+		u32 *entry;
+
+		rc = pci_doe_exchange_sync(doe_dev, &ex);
+		if (rc < 0)
+			return rc;
+
+		entry = cdat_response_pl + 1;
+		entry_dw = rc / sizeof(u32);
+		/* Skip Header */
+		entry_dw -= 1;
+		entry_dw = min(length / 4, entry_dw);
+		memcpy(data, entry, entry_dw * sizeof(u32));
+		length -= entry_dw * sizeof(u32);
+		data += entry_dw;
+		entry_handle = FIELD_GET(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, cdat_response_pl[0]);
+
+	} while (entry_handle != 0xFFFF);
+
+	return 0;
+}
+
+static void cxl_setup_cdat(struct cxl_dev_state *cxlds)
+{
+	if (!cxlds->cdat_doe)
+		return;
+
+	cxlds->cdat_get_length = cxl_cdat_get_length;
+	cxlds->cdat_read_table = cxl_cdat_read_table;
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct cxl_register_map map;
@@ -636,6 +726,8 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	cxl_setup_cdat(cxlds);
+
 	cxlmd = devm_cxl_add_memdev(cxlds);
 	if (IS_ERR(cxlmd))
 		return PTR_ERR(cxlmd);
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table
  2021-11-05 23:50 [PATCH 0/5] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (3 preceding siblings ...)
  2021-11-05 23:50 ` [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE ira.weiny
@ 2021-11-05 23:50 ` ira.weiny
  2021-11-08 14:52   ` Jonathan Cameron
                     ` (2 more replies)
  4 siblings, 3 replies; 37+ messages in thread
From: ira.weiny @ 2021-11-05 23:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ira Weiny, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, Jonathan Cameron, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

Parse and cache the DSMAS data from the CDAT table.  Store this data in
Unmarshaled data structures for use later.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V4
	New patch
---
 drivers/cxl/core/memdev.c | 111 ++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlmem.h      |  23 ++++++++
 2 files changed, 134 insertions(+)

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index c35de9e8298e..e5a2d30a3491 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -6,6 +6,7 @@
 #include <linux/idr.h>
 #include <linux/pci.h>
 #include <cxlmem.h>
+#include "cdat.h"
 #include "core.h"
 
 static DECLARE_RWSEM(cxl_memdev_rwsem);
@@ -312,6 +313,112 @@ static const struct file_operations cxl_memdev_fops = {
 	.llseek = noop_llseek,
 };
 
+static bool cdat_hdr_valid(struct cxl_memdev *cxlmd)
+{
+	u32 *data = cxlmd->cdat_table;
+	u8 *data8 = (u8 *)data;
+	u32 length, seq;
+	u8 rev, cs;
+	u8 check;
+	int i;
+
+	length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, data[0]);
+	if (length < CDAT_HEADER_LENGTH_BYTES)
+		return false;
+
+	rev = FIELD_GET(CDAT_HEADER_DW1_REVISION, data[1]);
+	cs = FIELD_GET(CDAT_HEADER_DW1_CHECKSUM, data[1]);
+	seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, data[3]);
+
+	/* Store the sequence for now. */
+	cxlmd->cdat_seq = seq;
+
+	for (check = 0, i = 0; i < length; i++)
+		check += data8[i];
+
+	return check == 0;
+}
+
+static int parse_dsmas(struct cxl_memdev *cxlmd)
+{
+	struct cxl_dsmas *dsmas_ary = NULL;
+	u32 *data = cxlmd->cdat_table;
+	int bytes_left = cxlmd->cdat_length;
+	int nr_dsmas = 0;
+	size_t dsmas_byte_size;
+	int rc = 0;
+
+	if (!data || !cdat_hdr_valid(cxlmd))
+		return -ENXIO;
+
+	/* Skip header */
+	data += CDAT_HEADER_LENGTH_DW;
+	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
+
+	while (bytes_left > 0) {
+		u32 *cur_rec = data;
+		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
+		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
+
+		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
+			struct cxl_dsmas *new_ary;
+			u8 flags;
+
+			new_ary = krealloc(dsmas_ary,
+					   sizeof(*dsmas_ary) * (nr_dsmas+1),
+					   GFP_KERNEL);
+			if (!new_ary) {
+				dev_err(&cxlmd->dev,
+					"Failed to allocate memory for DSMAS data\n");
+				rc = -ENOMEM;
+				goto free_dsmas;
+			}
+			dsmas_ary = new_ary;
+
+			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
+
+			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
+			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
+			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
+
+			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
+				nr_dsmas,
+				dsmas_ary[nr_dsmas].dpa_base,
+				dsmas_ary[nr_dsmas].dpa_base +
+					dsmas_ary[nr_dsmas].dpa_length,
+				(dsmas_ary[nr_dsmas].non_volatile ?
+					"Persistent" : "Volatile")
+				);
+
+			nr_dsmas++;
+		}
+
+		data += (length/sizeof(u32));
+		bytes_left -= length;
+	}
+
+	if (nr_dsmas == 0) {
+		rc = -ENXIO;
+		goto free_dsmas;
+	}
+
+	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
+
+	dsmas_byte_size = sizeof(*dsmas_ary) * nr_dsmas;
+	cxlmd->dsmas_ary = devm_kzalloc(&cxlmd->dev, dsmas_byte_size, GFP_KERNEL);
+	if (!cxlmd->dsmas_ary) {
+		rc = -ENOMEM;
+		goto free_dsmas;
+	}
+
+	memcpy(cxlmd->dsmas_ary, dsmas_ary, dsmas_byte_size);
+	cxlmd->nr_dsmas = nr_dsmas;
+
+free_dsmas:
+	kfree(dsmas_ary);
+	return rc;
+}
+
 struct cxl_memdev *
 devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
 {
@@ -339,6 +446,10 @@ devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
 		cxl_mem_cdat_read_table(cxlds, cxlmd->cdat_table, cxlmd->cdat_length);
 	}
 
+	rc = parse_dsmas(cxlmd);
+	if (rc)
+		dev_err(dev, "No DSMAS data found: %d\n", rc);
+
 	/*
 	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
 	 * needed as this is ordered with cdev_add() publishing the device.
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index f6c62cd537bb..d68da2610265 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -29,6 +29,23 @@
 	(FIELD_GET(CXLMDEV_RESET_NEEDED_MASK, status) !=                       \
 	 CXLMDEV_RESET_NEEDED_NOT)
 
+/**
+ * struct cxl_dsmas - host unmarshaled version of DSMAS data
+ *
+ * As defined in the Coherent Device Attribute Table (CDAT) specification this
+ * represents a single DSMAS entry in that table.
+ *
+ * @dpa_base: The lowest DPA address associated with this DSMAD
+ * @dpa_length: Length in bytes of this DSMAD
+ * @non_volatile: If set, the memory region represents Non-Volatile memory
+ */
+struct cxl_dsmas {
+	u64 dpa_base;
+	u64 dpa_length;
+	/* Flags */
+	int non_volatile:1;
+};
+
 /**
  * struct cxl_memdev - CXL bus object representing a Type-3 Memory Device
  * @dev: driver core device object
@@ -36,6 +53,9 @@
  * @cxlds: The device state backing this device
  * @cdat_table: cache of CDAT table
  * @cdat_length: length of cached CDAT table
+ * @cdat_seq: Last read Sequence number of the CDAT table
+ * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
+ * @nr_dsmas: Number of entries in dsmas_ary
  * @id: id number of this memdev instance.
  */
 struct cxl_memdev {
@@ -44,6 +64,9 @@ struct cxl_memdev {
 	struct cxl_dev_state *cxlds;
 	void *cdat_table;
 	size_t cdat_length;
+	u32 cdat_seq;
+	struct cxl_dsmas *dsmas_ary;
+	int nr_dsmas;
 	int id;
 };
 
-- 
2.28.0.rc0.12.gb6a658bd00c9


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver
  2021-11-05 23:50 ` [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
@ 2021-11-08 12:15   ` Jonathan Cameron
  2021-11-10  5:45     ` Ira Weiny
  2021-11-16 23:48   ` Bjorn Helgaas
  1 sibling, 1 reply; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-08 12:15 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Fri, 5 Nov 2021 16:50:53 -0700
<ira.weiny@intel.com> wrote:

> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> with standard protocol discovery.  Each mailbox is accessed through a
> DOE Extended Capability.
> 
> Define an auxiliary device driver which control DOE auxiliary devices
> registered on the auxiliary bus.
> 
> A DOE mailbox is allowed to support any number of protocols while some
> DOE protocol specifications apply additional restrictions.
> 
> The protocols supported are queried and cached.  pci_doe_supports_prot()
> can be used to determine if the DOE device supports the protocol
> specified.
> 
> A synchronous interface is provided in pci_doe_exchange_sync() to
> perform a single query / response exchange from the driver through the
> device specified.
> 
> Testing was conducted against QEMU using:
> 
> https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
> 
> This code is based on Jonathan's V4 series here:
> 
> https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> 
> [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
>     Data Object Exchange (DOE) - Approved 12 March 2020
> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Hi Ira,

Thanks for taking this on!

I'm sure at least half the comments below are about things I wrote
then forgot about. I'm not sure if it's a good thing but I've ignored
this for long enough I'm almost reviewing it as fresh code :(

I was carrying a local patch for the interrupt handler having 
figured out I'd missread the spec.   Note that I've since concluded
my local patch has it's own issues (it was unnecessarily complex)
so I've made some suggestions below that I'm fairly sure
fix things up.  Note these paths are hard to test and require adding
some fiddly state machines to QEMU to open up race windows...

> 
> ---
> Changes from Jonathan's V4
> 	Move the DOE MB code into the DOE auxiliary driver
> 	Remove Task List in favor of a wait queue
> 
> Changes from Ben
> 	remove CXL references
> 	propagate rc from pci functions on error

...


> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> new file mode 100644
> index 000000000000..2e702fdc7879
> --- /dev/null
> +++ b/drivers/pci/doe.c
> @@ -0,0 +1,701 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Object Exchange ECN
> + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci-doe.h>
> +#include <linux/workqueue.h>
> +#include <linux/module.h>
> +
> +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> +
> +#define PCI_DOE_BUSY_MAX_RETRIES 16
> +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> +
> +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> +#define PCI_DOE_TIMEOUT HZ
> +
> +enum pci_doe_state {
> +	DOE_IDLE,
> +	DOE_WAIT_RESP,
> +	DOE_WAIT_ABORT,
> +	DOE_WAIT_ABORT_ON_ERR,
> +};
> +
> +/*

/**

Given it's in kernel-doc syntax, we might as well mark it as such.

> + * struct pci_doe_task - description of a query / response task
> + * @ex: The details of the task to be done
> + * @rv: Return value.  Length of received response or error
> + * @cb: Callback for completion of task
> + * @private: Private data passed to callback on completion
> + */
> +struct pci_doe_task {
> +	struct pci_doe_exchange *ex;
> +	int rv;
> +	void (*cb)(void *private);
> +	void *private;
> +};
> +
> +/**
> + * struct pci_doe - A single DOE mailbox driver
> + *
> + * @doe_dev: The DOE Auxiliary device being driven
> + * @abort_c: Completion used for initial abort handling
> + * @irq: Interrupt used for signaling DOE ready or abort
> + * @irq_name: Name used to identify the irq for a particular DOE
> + * @prots: Array of identifiers for protocols supported
> + * @num_prots: Size of prots array
> + * @cur_task: Current task the state machine is working on
> + * @wq: Wait queue to wait on if a query is in progress
> + * @state_lock: Protect the state of cur_task, abort, and dead
> + * @statemachine: Work item for the DOE state machine
> + * @state: Current state of this DOE
> + * @timeout_jiffies: 1 second after GO set
> + * @busy_retries: Count of retry attempts
> + * @abort: Request a manual abort (e.g. on init)
> + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> + *        will immediately be aborted with error
> + */
> +struct pci_doe {
> +	struct pci_doe_dev *doe_dev;
> +	struct completion abort_c;
> +	int irq;
> +	char *irq_name;
> +	struct pci_doe_protocol *prots;
> +	int num_prots;
> +
> +	struct pci_doe_task *cur_task;
> +	wait_queue_head_t wq;
> +	struct mutex state_lock;
> +	struct delayed_work statemachine;
> +	enum pci_doe_state state;
> +	unsigned long timeout_jiffies;
> +	unsigned int busy_retries;
> +	unsigned int abort:1;
> +	unsigned int dead:1;
> +};
> +
> +static irqreturn_t pci_doe_irq(int irq, void *data)

I was carrying a rework of this locally because I managed
to convince myself this is wrong.  It's been a while and naturally
I didn't write a comprehensive set of notes on why it was wrong...
(Note you can't trigger the problem paths in QEMU without some
nasty hacks as it relies on opening up race windows that make
limited sense for the QEMU implementation).

It's all centered on some details of exactly what causes an interrupt
on a DOE.  Section 6.xx.3 Interrupt Generation states:

If enabled, an interrupt message must be triggered every time the
logical AND of the following conditions transitions from FALSE to TRUE:

* The associated vector is unmasked ...
* The value of the DOE interrupt enable bit is 1b
* The value of the DOE interrupt status bit is 1b
(only last one really maters to us I think).

The interrupt status bit is an OR conditional.

Must be set.. Data Object Read bit or DOE error bit set or DOE busy bit cleared.

> +{
> +	struct pci_doe *doe = data;
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {

So this bit is set on any of: BUSY dropped, READY or ERROR.
If it's set on BUSY drop, but then in between the read above and this clear
READY becomes true, then my reading is that we will not get another interrupt.
That is fine because we will read it again in the state machine and see the
new state. We could do more of the dance in the interrupt controller by doing
a reread after clear of INT_STATUS but I think it's cleaner to leave
it in the state machine.

It might look nicer here to only write BIT(1) - RW1C, but that doesn't matter as
all the rest of the register is RO.

> +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val);
> +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +	/* Leave the error case to be handled outside IRQ */
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {

I don't think we can get here because int status already true.
So should do this before the above general check to avoid clearning
the interrupt (we don't want more interrupts during the abort though
I'd hope the hardware wouldn't generate them).

So move this before the previous check.

> +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	/*
> +	 * Busy being cleared can result in an interrupt, but as
> +	 * the original Busy may not have been detected, there is no
> +	 * way to separate such an interrupt from a spurious interrupt.
> +	 */

This is misleading - as Busy bit clear would have resulted in INT_STATUS being true above
(that was a misread of the spec from me in v4).
So I don't think we can get here in any valid path.

return IRQ_NONE; should be safe.


> +	return IRQ_HANDLED;
> +}

Summary of above suggested changes:
1) Move the DOE_STATUS_ERROR block before the DOE_STATUS_INT_STATUS one
2) Possibly uses
   pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, PCI_DOE_STATUS_INT_STATUS);
   to be explicit on the write one to clear bit.
3) IRQ_NONE for the final return path as I'm fairly sure there is no valid route to that.
   
...

> +
> +static void pci_doe_task_complete(void *private)
> +{
> +	complete(private);
> +}

I wonder why this is up here? I'd move it down to just above the _sync()
function where it's used. This one was definitely one of mine :)

> +
> +static void doe_statemachine_work(struct work_struct *work)

I developed an interesting "relationship" with this state machine during
the original development ;)  I've just walked the paths and convinced
myself it works so all good.

> +{
> +	struct delayed_work *w = to_delayed_work(work);
> +	struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	struct pci_doe_task *task;
> +	bool abort;
> +	u32 val;
> +	int rc;
> +
> +	mutex_lock(&doe->state_lock);
> +	task = doe->cur_task;
> +	abort = doe->abort;
> +	doe->abort = false;
> +	mutex_unlock(&doe->state_lock);
> +
> +	if (abort) {
> +		/*
> +		 * Currently only used during init - care needed if
> +		 * pci_doe_abort() is generally exposed as it would impact
> +		 * queries in flight.
> +		 */
> +		WARN_ON(task);
> +		doe->state = DOE_WAIT_ABORT;
> +		pci_doe_abort_start(doe);
> +		return;
> +	}
> +
> +	switch (doe->state) {
> +	case DOE_IDLE:
> +		if (task == NULL)
> +			return;
> +
> +		/* Nothing currently in flight so queue a task */
> +		rc = pci_doe_send_req(doe, task->ex);
> +		/*
> +		 * The specification does not provide any guidance on how long
> +		 * some other entity could keep the DOE busy, so try for 1
> +		 * second then fail. Busy handling is best effort only, because
> +		 * there is no way of avoiding racing against another user of
> +		 * the DOE.
> +		 */
> +		if (rc == -EBUSY) {
> +			doe->busy_retries++;
> +			if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> +				/* Long enough, fail this request */
> +				pci_WARN(pdev, true, "DOE busy for too long\n");
> +				doe->busy_retries = 0;
> +				goto err_busy;
> +			}
> +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> +			return;
> +		}
> +		if (rc)
> +			goto err_abort;
> +		doe->busy_retries = 0;
> +
> +		doe->state = DOE_WAIT_RESP;
> +		doe->timeout_jiffies = jiffies + HZ;
> +		/* Now poll or wait for IRQ with timeout */
> +		if (doe->irq > 0)
> +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> +		else
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +		return;
> +
> +	case DOE_WAIT_RESP:
> +		/* Not possible to get here with NULL task */
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +			rc = -EIO;
> +			goto err_abort;
> +		}
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> +			/* If not yet at timeout reschedule otherwise abort */
> +			if (time_after(jiffies, doe->timeout_jiffies)) {
> +				rc = -ETIMEDOUT;
> +				goto err_abort;
> +			}
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +			return;
> +		}
> +
> +		rc  = pci_doe_recv_resp(doe, task->ex);
> +		if (rc < 0)
> +			goto err_abort;
> +
> +		doe->state = DOE_IDLE;
> +
> +		mutex_lock(&doe->state_lock);
> +		doe->cur_task = NULL;
> +		mutex_unlock(&doe->state_lock);
> +		wake_up_interruptible(&doe->wq);
> +
> +		/* Set the return value to the length of received payload */
> +		task->rv = rc;
> +		task->cb(task->private);
> +
> +		return;
> +
> +	case DOE_WAIT_ABORT:
> +	case DOE_WAIT_ABORT_ON_ERR:
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> +			/* Back to normal state - carry on */
> +			mutex_lock(&doe->state_lock);
> +			doe->cur_task = NULL;
> +			mutex_unlock(&doe->state_lock);
> +			wake_up_interruptible(&doe->wq);
> +
> +			/*
> +			 * For deliberately triggered abort, someone is
> +			 * waiting.
> +			 */
> +			if (doe->state == DOE_WAIT_ABORT)
> +				complete(&doe->abort_c);
> +
> +			doe->state = DOE_IDLE;
> +			return;
> +		}
> +		if (time_after(jiffies, doe->timeout_jiffies)) {
> +			/* Task has timed out and is dead - abort */
> +			pci_err(pdev, "DOE ABORT timed out\n");
> +			mutex_lock(&doe->state_lock);
> +			doe->dead = true;
> +			doe->cur_task = NULL;
> +			mutex_unlock(&doe->state_lock);
> +			wake_up_interruptible(&doe->wq);
> +
> +			if (doe->state == DOE_WAIT_ABORT)
> +				complete(&doe->abort_c);
> +		}
> +		return;
> +	}
> +
> +err_abort:
> +	doe->state = DOE_WAIT_ABORT_ON_ERR;
> +	pci_doe_abort_start(doe);
> +err_busy:
> +	task->rv = rc;
> +	task->cb(task->private);
> +	/* If here via err_busy, signal the task done. */
> +	if (doe->state == DOE_IDLE) {
> +		mutex_lock(&doe->state_lock);
> +		doe->cur_task = NULL;
> +		mutex_unlock(&doe->state_lock);
> +		wake_up_interruptible(&doe->wq);
> +	}
> +}
> +
> +/**
> + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response
> + * @doe: DOE mailbox state structure
> + * @ex: Description of the buffers and Vendor ID + type used in this
> + *      request/response pair
> + *
> + * Excess data will be discarded.
> + *
> + * RETURNS: payload in bytes on success, < 0 on error
> + */
> +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +	struct pci_doe_task task;
> +	DECLARE_COMPLETION_ONSTACK(c);
> +
> +	if (!doe)
> +		return -EAGAIN;
> +
> +	/* DOE requests must be a whole number of DW */
> +	if (ex->request_pl_sz % sizeof(u32))
> +		return -EINVAL;
> +
> +	task.ex = ex;
> +	task.cb = pci_doe_task_complete;
> +	task.private = &c;
> +
> +again:

Hmm.   Whether having this code at this layer makes sense hinges on
whether we want to easily support async use of the DOE in future.

In v4 some of the async handling had ended up in this function and
should probably have been factored out to give us a 
'queue up work' then 'wait for completion' sequence.

Given there is now more to be done in here perhaps we need to think
about such a separation to keep it clear that this is fundamentally
a synchronous wrapper around an asynchronous operation.

> +	mutex_lock(&doe->state_lock);
> +	if (doe->cur_task) {
> +		mutex_unlock(&doe->state_lock);
> +		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> +		goto again;
> +	}
> +
> +	if (doe->dead) {
> +		mutex_unlock(&doe->state_lock);
> +		return -EIO;
> +	}
> +	doe->cur_task = &task;
> +	schedule_delayed_work(&doe->statemachine, 0);
> +	mutex_unlock(&doe->state_lock);
> +
> +	wait_for_completion(&c);
> +
> +	return task.rv;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> +
> +/**
> + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol
> + * @pdev: Device on which to find the DOE instance
> + * @vid: Protocol Vendor ID
> + * @type: protocol type
> + *
> + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox
> + * exchange through that DOE mailbox.
> + *
> + * RETURNS: True if the DOE device supports the protocol specified
> + */
> +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +	int i;
> +
> +	if (!doe)
> +		return false;

How would this happen?  I don't think it can...  Probably
false paranoia from me...

> +
> +	for (i = 0; i < doe->num_prots; i++)
> +		if ((doe->prots[i].vid == vid) &&
> +		    (doe->prots[i].type == type))
> +			return true;
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);

...

> +static void pci_doe_release_irq(struct pci_doe *doe)
> +{
> +	if (doe->irq > 0)
> +		free_irq(doe->irq, doe);

Is this trivial wrapper worth bothering with?  Maybe just
put the code inline?

> +}
> +

...

> +
> +static void pci_doe_unregister(struct pci_doe *doe)
> +{
> +	pci_doe_release_irq(doe);
> +	kfree(doe->irq_name);
> +	put_device(&doe->doe_dev->pdev->dev);

This makes me wonder if we should be doing the get_device()
earlier in probe?  Limited harm in moving it to near the start
and then ending up with it being 'obviously' correct...

> +}
> +
> +/*
> + * pci_doe_probe() - Set up the Mailbox
> + * @aux_dev: Auxiliary Device
> + * @id: Auxiliary device ID
> + *
> + * Probe the mailbox found for all protocols and set up the Mailbox
> + *
> + * RETURNS: 0 on success, < 0 on error
> + */
> +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> +			 const struct auxiliary_device_id *id)
> +{
> +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> +					struct pci_doe_dev,
> +					adev);
> +	struct pci_doe *doe;
> +	int rc;
> +
> +	doe = kzalloc(sizeof(*doe), GFP_KERNEL);

Could go devm_ for this I think, though may not be worthwhile.

> +	if (!doe)
> +		return -ENOMEM;
> +
> +	mutex_init(&doe->state_lock);
> +	init_completion(&doe->abort_c);
> +	doe->doe_dev = doe_dev;
> +	init_waitqueue_head(&doe->wq);
> +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> +	dev_set_drvdata(&aux_dev->dev, doe);
> +
> +	rc = pci_doe_register(doe);
> +	if (rc)
> +		goto err_free;
> +
> +	rc = pci_doe_cache_protocols(doe);
> +	if (rc) {
> +		pci_doe_unregister(doe);

Mixture of different forms of error handling here.
I'd move this below and add an err_unregister label.

> +		goto err_free;
> +	}
> +
> +	return 0;
> +
> +err_free:
> +	kfree(doe);
> +	return rc;
> +}
> +
> +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> +
> +	/* First halt the state machine */
> +	cancel_delayed_work_sync(&doe->statemachine);
> +	kfree(doe->prots);

Logical flow to me is unregister first, free protocols second
(to reverse what we do in probe)

> +	pci_doe_unregister(doe);
> +	kfree(doe);
> +}
> +
> +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> +	{.name = "cxl_pci.doe", },

I'd like to hear from Bjorn on whether registering this from the CXL
device is the right approach or if we should perhaps just do it directly from
somewhere in PCI. (really applies to patch 3) I'll talk more about this there.

> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> +
> +struct auxiliary_driver pci_doe_auxiliary_drv = {
> +	.name = "pci_doe_drv",

I would assume this is only used in contexts where the _drv is
obvious?  I would go with "pci_doe".

> +	.id_table = pci_doe_auxiliary_id_table,
> +	.probe = pci_doe_probe,
> +	.remove = pci_doe_remove
> +};
> +
> +static int __init pci_doe_init_module(void)
> +{
> +	int ret;
> +
> +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> +	if (ret) {
> +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> +		       ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit pci_doe_exit_module(void)
> +{
> +	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> +}
> +
> +module_init(pci_doe_init_module);
> +module_exit(pci_doe_exit_module);

Seems like the auxiliary bus would benefit from a
module_auxiliary_driver() macro to cover this simple registration stuff
similar to module_i2c_driver() etc.

Mind you, looking at 5.15 this would be the only user, so maybe one
for the 'next' case on basis two instances proves it's 'common' ;)

> +MODULE_LICENSE("GPL v2");
> diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> new file mode 100644
> index 000000000000..8380b7ad33d4
> --- /dev/null
> +++ b/include/linux/pci-doe.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/completion.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>

Not used in this header that I can see, so push down to the c files.

> +#include <linux/auxiliary_bus.h>
> +
> +#ifndef LINUX_PCI_DOE_H
> +#define LINUX_PCI_DOE_H
> +
> +#define DOE_DEV_NAME "doe"

Not sure this is used?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-05 23:50 ` [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices ira.weiny
@ 2021-11-08 13:09   ` Jonathan Cameron
  2021-11-11  1:31     ` Ira Weiny
  2021-11-16 23:48   ` Bjorn Helgaas
  1 sibling, 1 reply; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-08 13:09 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Fri, 5 Nov 2021 16:50:54 -0700
<ira.weiny@intel.com> wrote:

> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> CXL devices have DOE mailboxes.  Create auxiliary devices which can be
> driven by the generic DOE auxiliary driver.

I'd like Bjorn's input on the balance here between what is done
in cxl/pci.c and what should be in the PCI core code somewhere.

The tricky bit preventing this being done entirely as part of 
PCI device instantiation is the interrupts.

> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Mostly new code, so not sure I should really be listed on this
one but I don't mind either way.

A few comments inline but overall this ended up nice and clean.

> 
> ---
> Changes from V4:
> 	Make this an Auxiliary Driver rather than library functions
> 	Split this out into it's own patch
> 	Base on the new cxl_dev_state structure
> 
> Changes from Ben
> 	s/CXL_DOE_DEV_NAME/DOE_DEV_NAME/
> ---
>  drivers/cxl/Kconfig |   1 +
>  drivers/cxl/cxl.h   |  13 +++++
>  drivers/cxl/pci.c   | 120 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 134 insertions(+)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 67c91378f2dd..9d53720bea07 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -16,6 +16,7 @@ if CXL_BUS
>  config CXL_MEM
>  	tristate "CXL.mem: Memory Devices"
>  	default CXL_BUS
> +	select PCI_DOE_DRIVER
>  	help
>  	  The CXL.mem protocol allows a device to act as a provider of
>  	  "System RAM" and/or "Persistent Memory" that is fully coherent
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 5e2e93451928..f1241a7f2b7b 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -75,6 +75,19 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
>  #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
>  #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
>  
> +/*
> + * Address space properties derived from:
> + * CXL 2.0 8.2.5.12.7 CXL HDM Decoder 0 Control Register
> + */
> +#define CXL_ADDRSPACE_RAM   BIT(0)
> +#define CXL_ADDRSPACE_PMEM  BIT(1)
> +#define CXL_ADDRSPACE_TYPE2 BIT(2)
> +#define CXL_ADDRSPACE_TYPE3 BIT(3)
> +#define CXL_ADDRSPACE_MASK  GENMASK(3, 0)

Stray.

> +
> +#define CXL_DOE_PROTOCOL_COMPLIANCE 0
> +#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
> +
>  #define CXL_COMPONENT_REGS() \
>  	void __iomem *hdm_decoder
>  
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 8dc91fd3396a..df524b74f1d2 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -6,6 +6,7 @@
>  #include <linux/mutex.h>
>  #include <linux/list.h>
>  #include <linux/pci.h>
> +#include <linux/pci-doe.h>
>  #include <linux/io.h>
>  #include "cxlmem.h"
>  #include "pci.h"
> @@ -471,6 +472,120 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
>  	return rc;
>  }
>  
> +static void cxl_mem_free_irq_vectors(void *data)
> +{
> +	pci_free_irq_vectors(data);
> +}
> +
> +static void cxl_destroy_doe_device(void *ad)
> +{
> +	struct auxiliary_device *adev = ad;
Local variable doesn't add anything, just pass it directly
into the functions as a void *.

> +
> +	auxiliary_device_delete(adev);
> +	auxiliary_device_uninit(adev);

Both needed?  These are just wrappers around
put_device() and device_del()

Normally after device_add() suceeded we only ever call device_del()
as per the docs for device_add()
https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3277

> +}
> +
> +static DEFINE_IDA(cxl_doe_adev_ida);
> +static void __doe_dev_release(struct auxiliary_device *adev)
> +{
> +	struct pci_doe_dev *doe_dev = container_of(adev, struct pci_doe_dev,
> +						   adev);
> +
> +	ida_free(&cxl_doe_adev_ida, adev->id);
> +	kfree(doe_dev);
> +}
> +
> +static void cxl_doe_dev_release(struct device *dev)
> +{
> +	struct auxiliary_device *adev = container_of(dev,
> +						struct auxiliary_device,
> +						dev);
> +	__doe_dev_release(adev);
> +}
> +
> +static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)

Pass in the struct device, or maybe even the struct pci_dev as
nothing in here is using the cxl_dev_state.

> +{
> +	struct device *dev = cxlds->dev;
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	int irqs, rc;
> +	u16 pos = 0;
> +
> +	/*
> +	 * An implementation of a cxl type3 device may support an unknown
> +	 * number of interrupts. Assume that number is not that large and
> +	 * request them all.
> +	 */
> +	irqs = pci_msix_vec_count(pdev);
> +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
> +	if (rc != irqs) {
> +		/* No interrupt available - carry on */
> +		dev_dbg(dev, "No interrupts available for DOE\n");
> +	} else {
> +		/*
> +		 * Enabling bus mastering could be done within the DOE
> +		 * initialization, but as it potentially has other impacts
> +		 * keep it within the driver.
> +		 */
> +		pci_set_master(pdev);
> +		rc = devm_add_action_or_reset(dev,
> +					      cxl_mem_free_irq_vectors,
> +					      pdev);
> +		if (rc)
> +			return rc;
> +	}
> +

Above here is driver specific...
Everything from here is is generic so perhaps move it to the PCI core?
Alternatively wait until we have users that aren't CXL.

> +	pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> +
> +	while (pos > 0) {
> +		struct auxiliary_device *adev;
> +		struct pci_doe_dev *new_dev;
> +		int id;
> +
> +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> +		if (!new_dev)
> +			return -ENOMEM;
> +
> +		new_dev->pdev = pdev;
> +		new_dev->cap_offset = pos;
> +
> +		/* Set up struct auxiliary_device */
> +		adev = &new_dev->adev;
> +		id = ida_alloc(&cxl_doe_adev_ida, GFP_KERNEL);
> +		if (id < 0) {
> +			kfree(new_dev);
> +			return -ENOMEM;
> +		}
> +
> +		adev->id = id;
> +		adev->name = DOE_DEV_NAME;
> +		adev->dev.release = cxl_doe_dev_release;
> +		adev->dev.parent = dev;
> +
> +		if (auxiliary_device_init(adev)) {
> +			__doe_dev_release(adev);
> +			return -EIO;
> +		}
> +
> +		if (auxiliary_device_add(adev)) {
> +			auxiliary_device_uninit(adev);
> +			return -EIO;
> +		}
> +
> +		rc = devm_add_action_or_reset(dev, cxl_destroy_doe_device, adev);
> +		if (rc)
> +			return rc;
> +
> +		if (device_attach(&adev->dev) != 1)
> +			dev_err(&adev->dev,
> +				"Failed to attach a driver to DOE device %d\n",
> +				adev->id);

I wondered about this and how it would happen.
Given soft dependency only between the drivers it's possible but error or info?
I'd go with dev_info().  It is an error I'd bail out and used deferred probing
to try again when it will succeed.

> +
> +		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> +	}
> +
> +	return 0;
> +}
> +
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct cxl_register_map map;
> @@ -517,6 +632,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxl_setup_doe_devices(cxlds);
> +	if (rc)
> +		return rc;
> +
>  	cxlmd = devm_cxl_add_memdev(cxlds);
>  	if (IS_ERR(cxlmd))
>  		return PTR_ERR(cxlmd);
> @@ -546,3 +665,4 @@ static struct pci_driver cxl_pci_driver = {
>  MODULE_LICENSE("GPL v2");
>  module_pci_driver(cxl_pci_driver);
>  MODULE_IMPORT_NS(CXL);
> +MODULE_SOFTDEP("pre: pci_doe");


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE
  2021-11-05 23:50 ` [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE ira.weiny
@ 2021-11-08 13:21   ` Jonathan Cameron
  2021-11-08 23:19     ` Ira Weiny
  2021-11-08 15:02   ` Jonathan Cameron
  2021-11-19 14:40   ` Jonathan Cameron
  2 siblings, 1 reply; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-08 13:21 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Fri, 5 Nov 2021 16:50:55 -0700
<ira.weiny@intel.com> wrote:

> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Read CDAT raw table data from the cxl_mem state object.  Currently this
> is only supported by a PCI CXL object through a DOE mailbox which supports
> CDAT.  But any cxl_mem type object can provide this data later if need
> be.  For example for testing.
> 
> Cache this data for later parsing.  Provide a sysfs binary attribute to
> allow dumping of the CDAT.
> 
> Binary dumping is modeled on /sys/firmware/ACPI/tables/
> 
> The ability to dump this table will be very useful for emulation of real
> devices once they become available as QEMU CXL type 3 device emulation will
> be able to load this file in.
> 
> This does not support table updates at runtime. It will always provide
> whatever was there when first cached. Handling of table updates can be
> implemented later.
> 
> Once there are more users, this code can move out to driver/cxl/cdat.c
> or similar.
> 
> Finally create a complete list of DOE defines within cdat.h for anyone
> wishing to decode the CDAT table.
> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Hi Ira,

A few other comments inline, some of which are really updates of
earlier comments now I see how it is fitting together.

Jonathan

> 
> ---
> Changes from V4:
> 	Split this into it's own patch
> 	Rearchitect this such that the memdev driver calls into the DOE
> 	driver via the cxl_mem state object.  This allows CDAT data to
> 	come from any type of cxl_mem object not just PCI DOE.

Ah.  Is this to allow mocking? Or is there another architected source
of this information that I've missed?

> 	Rebase on new struct cxl_dev_state
> ---

...

> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index df524b74f1d2..086532a42480 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -11,6 +11,7 @@
>  #include "cxlmem.h"
>  #include "pci.h"
>  #include "cxl.h"
> +#include "cdat.h"
>  
>  /**
>   * DOC: cxl pci
> @@ -575,17 +576,106 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
>  		if (rc)
>  			return rc;
>  
> -		if (device_attach(&adev->dev) != 1)
> +		if (device_attach(&adev->dev) != 1) {
>  			dev_err(&adev->dev,
>  				"Failed to attach a driver to DOE device %d\n",
>  				adev->id);
> +			goto next;
> +		}
> +
> +		if (pci_doe_supports_prot(new_dev, PCI_DVSEC_VENDOR_ID_CXL,
> +					  CXL_DOE_PROTOCOL_TABLE_ACCESS))
> +			cxlds->cdat_doe = new_dev;

Ah. If we did try to make this block generic, we'd then need a look
up function to call after the generic part.  I guess it is getting more
complex so maybe not having it generic is the right choice for now.

Also, this explains why you passed cxlds in.  So ignore that comment on
the previous.

>  
> +next:
>  		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
>  	}
>  
>  	return 0;
>  }
>  
> +#define CDAT_DOE_REQ(entry_handle)					\
> +	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
> +		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
> +	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_TABLE_TYPE,			\
> +		    CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA) |		\
> +	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, (entry_handle)))
> +
> +static int cxl_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
> +{
> +	struct pci_doe_dev *doe_dev = cxlds->cdat_doe;
> +	u32 cdat_request_pl = CDAT_DOE_REQ(0);
> +	u32 cdat_response_pl[32];
> +	struct pci_doe_exchange ex = {
> +		.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
> +		.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
> +		.request_pl = &cdat_request_pl,
> +		.request_pl_sz = sizeof(cdat_request_pl),
> +		.response_pl = cdat_response_pl,
> +		.response_pl_sz = sizeof(cdat_response_pl),
> +	};
> +
> +	ssize_t rc;
> +
> +	rc = pci_doe_exchange_sync(doe_dev, &ex);
> +	if (rc < 0)
> +		return rc;
> +	if (rc < 1)
> +		return -EIO;
> +
> +	*length = cdat_response_pl[1];
> +	dev_dbg(cxlds->dev, "CDAT length %zu\n", *length);

Probably not useful any more... 

> +	return 0;
> +}
> +


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table
  2021-11-05 23:50 ` [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
@ 2021-11-08 14:52   ` Jonathan Cameron
  2021-11-11  3:58     ` Ira Weiny
  2021-11-18 17:02   ` Jonathan Cameron
  2021-11-19 14:55   ` Jonathan Cameron
  2 siblings, 1 reply; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-08 14:52 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Fri, 5 Nov 2021 16:50:56 -0700
<ira.weiny@intel.com> wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> Parse and cache the DSMAS data from the CDAT table.  Store this data in
> Unmarshaled data structures for use later.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

A few minor comments inline.  In particular I think we need to conclude if
failure to parse is an error or not.  Right now it's reported as an error
but then we carry on anyway.

Jonathan

> 
> ---
> Changes from V4
> 	New patch
> ---
>  drivers/cxl/core/memdev.c | 111 ++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxlmem.h      |  23 ++++++++
>  2 files changed, 134 insertions(+)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index c35de9e8298e..e5a2d30a3491 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -6,6 +6,7 @@

...

> +
> +static int parse_dsmas(struct cxl_memdev *cxlmd)
> +{
> +	struct cxl_dsmas *dsmas_ary = NULL;
> +	u32 *data = cxlmd->cdat_table;
> +	int bytes_left = cxlmd->cdat_length;
> +	int nr_dsmas = 0;
> +	size_t dsmas_byte_size;
> +	int rc = 0;
> +
> +	if (!data || !cdat_hdr_valid(cxlmd))

If that's invalid, right answer might be to run it again as we probably
just raced with an update...  Perhaps try it a couple of times before
failing hard?

> +		return -ENXIO;
> +
> +	/* Skip header */
> +	data += CDAT_HEADER_LENGTH_DW;
> +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> +
> +	while (bytes_left > 0) {
> +		u32 *cur_rec = data;
> +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> +
> +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
> +			struct cxl_dsmas *new_ary;
> +			u8 flags;
> +
> +			new_ary = krealloc(dsmas_ary,
> +					   sizeof(*dsmas_ary) * (nr_dsmas+1),

Spaces around the +

You could do this with devm_krealloc() and then just assign it at the end
rather than allocate a new one and copy.


> +					   GFP_KERNEL);
> +			if (!new_ary) {
> +				dev_err(&cxlmd->dev,
> +					"Failed to allocate memory for DSMAS data\n");
> +				rc = -ENOMEM;
> +				goto free_dsmas;
> +			}
> +			dsmas_ary = new_ary;
> +
> +			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
> +
> +			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
> +			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
> +			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
> +
> +			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
> +				nr_dsmas,
> +				dsmas_ary[nr_dsmas].dpa_base,
> +				dsmas_ary[nr_dsmas].dpa_base +
> +					dsmas_ary[nr_dsmas].dpa_length,
> +				(dsmas_ary[nr_dsmas].non_volatile ?
> +					"Persistent" : "Volatile")
> +				);
> +
> +			nr_dsmas++;
> +		}
> +
> +		data += (length/sizeof(u32));

spaces around /

> +		bytes_left -= length;
> +	}
> +
> +	if (nr_dsmas == 0) {
> +		rc = -ENXIO;
> +		goto free_dsmas;
> +	}
> +
> +	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
> +
> +	dsmas_byte_size = sizeof(*dsmas_ary) * nr_dsmas;
> +	cxlmd->dsmas_ary = devm_kzalloc(&cxlmd->dev, dsmas_byte_size, GFP_KERNEL);

As above, you could have done a devm_krealloc() and then just assigned here.
Side effect of that being direct returns should be fine.  However, that relies
treating an error from this function as an error that will result in failures below.


> +	if (!cxlmd->dsmas_ary) {
> +		rc = -ENOMEM;
> +		goto free_dsmas;
> +	}
> +
> +	memcpy(cxlmd->dsmas_ary, dsmas_ary, dsmas_byte_size);
> +	cxlmd->nr_dsmas = nr_dsmas;
> +
> +free_dsmas:
> +	kfree(dsmas_ary);
> +	return rc;
> +}
> +
>  struct cxl_memdev *
>  devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
>  {
> @@ -339,6 +446,10 @@ devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
>  		cxl_mem_cdat_read_table(cxlds, cxlmd->cdat_table, cxlmd->cdat_length);
>  	}
>  
> +	rc = parse_dsmas(cxlmd);
> +	if (rc)
> +		dev_err(dev, "No DSMAS data found: %d\n", rc);

dev_info() maybe as it's not being treated as an error?

However I think it should be treated as an error.  It's a device failure if
we can't parse this (and table protocol is available)

If it turns out we need to quirk some devices, then fair enough.



> +
>  	/*
>  	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
>  	 * needed as this is ordered with cdev_add() publishing the device.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE
  2021-11-05 23:50 ` [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE ira.weiny
  2021-11-08 13:21   ` Jonathan Cameron
@ 2021-11-08 15:02   ` Jonathan Cameron
  2021-11-08 22:25     ` Ira Weiny
  2021-11-19 14:40   ` Jonathan Cameron
  2 siblings, 1 reply; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-08 15:02 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Fri, 5 Nov 2021 16:50:55 -0700
<ira.weiny@intel.com> wrote:

> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Read CDAT raw table data from the cxl_mem state object.  Currently this
> is only supported by a PCI CXL object through a DOE mailbox which supports
> CDAT.  But any cxl_mem type object can provide this data later if need
> be.  For example for testing.
> 
> Cache this data for later parsing.  Provide a sysfs binary attribute to
> allow dumping of the CDAT.
> 
> Binary dumping is modeled on /sys/firmware/ACPI/tables/
> 
> The ability to dump this table will be very useful for emulation of real
> devices once they become available as QEMU CXL type 3 device emulation will
> be able to load this file in.
> 
> This does not support table updates at runtime. It will always provide
> whatever was there when first cached. Handling of table updates can be
> implemented later.
> 
> Once there are more users, this code can move out to driver/cxl/cdat.c
> or similar.
> 
> Finally create a complete list of DOE defines within cdat.h for anyone
> wishing to decode the CDAT table.
> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

A few more things came to mind whilst reading the rest of the series. In particular
lifetime management for the doe structures.

>   * DOC: cxl pci
> @@ -575,17 +576,106 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
>  		if (rc)
>  			return rc;
>  
> -		if (device_attach(&adev->dev) != 1)
> +		if (device_attach(&adev->dev) != 1) {
>  			dev_err(&adev->dev,
>  				"Failed to attach a driver to DOE device %d\n",
>  				adev->id);
> +			goto next;
> +		}
> +
> +		if (pci_doe_supports_prot(new_dev, PCI_DVSEC_VENDOR_ID_CXL,
> +					  CXL_DOE_PROTOCOL_TABLE_ACCESS))
> +			cxlds->cdat_doe = new_dev;

I'm probably missing something, but what prevents new_dev from going away after
this assignment?  Perhaps a force unbind or driver removal.  Should we get a
reference?

Also it's possible we'll have multiple CDAT supporting DOEs so
I'd suggest checking if cxlds->cdata_doe is already set before setting it.

We could break out of the loop early, but I want to bolt the CMA doe detection
in there so I'd rather we didn't.  This is all subject to whether we attempt
to generalize this support and move it over to the PCI side of things.

>  
> +next:
>  		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
>  	}
>  
>  	return 0;
>  }
>  

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE
  2021-11-08 15:02   ` Jonathan Cameron
@ 2021-11-08 22:25     ` Ira Weiny
  2021-11-09 11:09       ` Jonathan Cameron
  0 siblings, 1 reply; 37+ messages in thread
From: Ira Weiny @ 2021-11-08 22:25 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Mon, Nov 08, 2021 at 03:02:36PM +0000, Jonathan Cameron wrote:
> On Fri, 5 Nov 2021 16:50:55 -0700
> <ira.weiny@intel.com> wrote:
> 
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 

[snip]

> 
> A few more things came to mind whilst reading the rest of the series. In particular
> lifetime management for the doe structures.

Thanks for the review I'm just working through all the comments and so I'm
somewhat working backwards.

> 
> >   * DOC: cxl pci
> > @@ -575,17 +576,106 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> >  		if (rc)
> >  			return rc;
> >  
> > -		if (device_attach(&adev->dev) != 1)
> > +		if (device_attach(&adev->dev) != 1) {
> >  			dev_err(&adev->dev,
> >  				"Failed to attach a driver to DOE device %d\n",
> >  				adev->id);
> > +			goto next;
> > +		}
> > +
> > +		if (pci_doe_supports_prot(new_dev, PCI_DVSEC_VENDOR_ID_CXL,
> > +					  CXL_DOE_PROTOCOL_TABLE_ACCESS))
> > +			cxlds->cdat_doe = new_dev;
> 
> I'm probably missing something, but what prevents new_dev from going away after
> this assignment?
> Perhaps a force unbind or driver removal.  Should we get a
> reference?

I had a get_device() here at one point but took it out...  Because I was
thinking that new_dev's lifetime was equal to cxlds because cxlds 'owned' the
DOE devices.  However this is totally not true.  And there is a race between
the device going away and cxlds going away which could be a problem.

> 
> Also it's possible we'll have multiple CDAT supporting DOEs so
> I'd suggest checking if cxlds->cdata_doe is already set before setting it.

Sure.

> 
> We could break out of the loop early, but I want to bolt the CMA doe detection
> in there so I'd rather we didn't.  This is all subject to whether we attempt
> to generalize this support and move it over to the PCI side of things.

I'm not 100% sure about moving it to the PCI side but it does make some sense
because really the auxiliary devices are only bounded by the PCI device being
available.  None of the CXL stuff needs to exist for the DOE driver to talk to
the device but the pdev does need to be there...  :-/

This is all part of what drove the cxl_mem rename because that structure was
really confusing me.  Dan got me straightened out but I did not revisit this
series after that.  Now off the top of my head I'm not sure that cxlds needs to
be involved in the auxiliary device creation.  OTOH I was making it a central
place for in kernel users to know where/how to get information from DOE
mailboxes.  Hence caching which of these devices had CDAT capability.[1]

Since you seem to have arrived at this conclusion before me where in the PCI
code do you think is appropriate for this?

Ira

[1] I'm not really sure what is going to happen if multiple DOE boxes have CDAT
capability.  This seems like a recipe for confusion.

> 
> >  
> > +next:
> >  		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> >  	}
> >  
> >  	return 0;
> >  }
> >  

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE
  2021-11-08 13:21   ` Jonathan Cameron
@ 2021-11-08 23:19     ` Ira Weiny
  0 siblings, 0 replies; 37+ messages in thread
From: Ira Weiny @ 2021-11-08 23:19 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Mon, Nov 08, 2021 at 01:21:51PM +0000, Jonathan Cameron wrote:
> On Fri, 5 Nov 2021 16:50:55 -0700
> <ira.weiny@intel.com> wrote:
> 
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > Read CDAT raw table data from the cxl_mem state object.  Currently this
> > is only supported by a PCI CXL object through a DOE mailbox which supports
> > CDAT.  But any cxl_mem type object can provide this data later if need
> > be.  For example for testing.
> > 
> > Cache this data for later parsing.  Provide a sysfs binary attribute to
> > allow dumping of the CDAT.
> > 
> > Binary dumping is modeled on /sys/firmware/ACPI/tables/
> > 
> > The ability to dump this table will be very useful for emulation of real
> > devices once they become available as QEMU CXL type 3 device emulation will
> > be able to load this file in.
> > 
> > This does not support table updates at runtime. It will always provide
> > whatever was there when first cached. Handling of table updates can be
> > implemented later.
> > 
> > Once there are more users, this code can move out to driver/cxl/cdat.c
> > or similar.
> > 
> > Finally create a complete list of DOE defines within cdat.h for anyone
> > wishing to decode the CDAT table.
> > 
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Hi Ira,
> 
> A few other comments inline, some of which are really updates of
> earlier comments now I see how it is fitting together.
> 
> Jonathan
> 
> > 
> > ---
> > Changes from V4:
> > 	Split this into it's own patch
> > 	Rearchitect this such that the memdev driver calls into the DOE
> > 	driver via the cxl_mem state object.  This allows CDAT data to
> > 	come from any type of cxl_mem object not just PCI DOE.
> 
> Ah.  Is this to allow mocking? Or is there another architected source
> of this information that I've missed?

Right now yes as the testing stuff could mock this.  But I did not plumb that
up yet.

> 
> > 	Rebase on new struct cxl_dev_state
> > ---
> 
> ...
> 
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index df524b74f1d2..086532a42480 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -11,6 +11,7 @@
> >  #include "cxlmem.h"
> >  #include "pci.h"
> >  #include "cxl.h"
> > +#include "cdat.h"
> >  
> >  /**
> >   * DOC: cxl pci
> > @@ -575,17 +576,106 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> >  		if (rc)
> >  			return rc;
> >  
> > -		if (device_attach(&adev->dev) != 1)
> > +		if (device_attach(&adev->dev) != 1) {
> >  			dev_err(&adev->dev,
> >  				"Failed to attach a driver to DOE device %d\n",
> >  				adev->id);
> > +			goto next;
> > +		}
> > +
> > +		if (pci_doe_supports_prot(new_dev, PCI_DVSEC_VENDOR_ID_CXL,
> > +					  CXL_DOE_PROTOCOL_TABLE_ACCESS))
> > +			cxlds->cdat_doe = new_dev;
> 
> Ah. If we did try to make this block generic, we'd then need a look
> up function to call after the generic part.  I guess it is getting more
> complex so maybe not having it generic is the right choice for now.

It is complex and I'm still concerned about getting the driver attached such
that this works.  But the MODULE_SOFTDEP should take care of this.

> 
> Also, this explains why you passed cxlds in.  So ignore that comment on
> the previous.

Yea.

> 
> >  
> > +next:
> >  		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> >  	}
> >  
> >  	return 0;
> >  }
> >  
> > +#define CDAT_DOE_REQ(entry_handle)					\
> > +	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
> > +		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
> > +	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_TABLE_TYPE,			\
> > +		    CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA) |		\
> > +	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, (entry_handle)))
> > +
> > +static int cxl_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
> > +{
> > +	struct pci_doe_dev *doe_dev = cxlds->cdat_doe;
> > +	u32 cdat_request_pl = CDAT_DOE_REQ(0);
> > +	u32 cdat_response_pl[32];
> > +	struct pci_doe_exchange ex = {
> > +		.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
> > +		.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
> > +		.request_pl = &cdat_request_pl,
> > +		.request_pl_sz = sizeof(cdat_request_pl),
> > +		.response_pl = cdat_response_pl,
> > +		.response_pl_sz = sizeof(cdat_response_pl),
> > +	};
> > +
> > +	ssize_t rc;
> > +
> > +	rc = pci_doe_exchange_sync(doe_dev, &ex);
> > +	if (rc < 0)
> > +		return rc;
> > +	if (rc < 1)
> > +		return -EIO;
> > +
> > +	*length = cdat_response_pl[1];
> > +	dev_dbg(cxlds->dev, "CDAT length %zu\n", *length);
> 
> Probably not useful any more... 

yea. removed.

Ira

> 
> > +	return 0;
> > +}
> > +
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE
  2021-11-08 22:25     ` Ira Weiny
@ 2021-11-09 11:09       ` Jonathan Cameron
  0 siblings, 0 replies; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-09 11:09 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

...

> > We could break out of the loop early, but I want to bolt the CMA doe detection
> > in there so I'd rather we didn't.  This is all subject to whether we attempt
> > to generalize this support and move it over to the PCI side of things.  
> 
> I'm not 100% sure about moving it to the PCI side but it does make some sense
> because really the auxiliary devices are only bounded by the PCI device being
> available.  None of the CXL stuff needs to exist for the DOE driver to talk to
> the device but the pdev does need to be there...  :-/

This will become more relevant with CMA etc on top of this series as that
is not CXL specific, so definitely shouldn't live in here.

> 
> This is all part of what drove the cxl_mem rename because that structure was
> really confusing me.  Dan got me straightened out but I did not revisit this
> series after that.  Now off the top of my head I'm not sure that cxlds needs to
> be involved in the auxiliary device creation.  OTOH I was making it a central
> place for in kernel users to know where/how to get information from DOE
> mailboxes.  Hence caching which of these devices had CDAT capability.[1]
Caching a particular instance makes sense (with a reference taken).

I'd expect something similar to the divide between
pci_alloc_irq_vectors() which enumerates them in the pci core, and
actually getting for a particular instance with request_irq()

So maybe
pci_alloc_doe_instances() which adds the auxiliary devices to the bus.

and

pci_doe_get(vendor_id, protcol_id);
with the _get() implemented using auxilliary_find_device() with
appropriate match function.


> 
> Since you seem to have arrived at this conclusion before me where in the PCI
> code do you think is appropriate for this?

I'm not sure to be honest.  Given the dependency on MSI/MSIX it may be that the best
we can do is to provide some utility functions for the auxiliary device
creation and then every driver for devices with a DOE would need to call
them manually.  As this isn't dependent on the DOE driver, it would need
to be tied to the PCI core rather than that, possibly stubbed if
PCI_DOE isn't built.

> 
> Ira
> 
> [1] I'm not really sure what is going to happen if multiple DOE boxes have CDAT
> capability.  This seems like a recipe for confusion.

They will all report the same thing so just use the first one.
I can't really think why someone would do this deliberately but I can conceive of
people deciding to support multiple because they have a sneaky firmware running
somewhere and they want to avoid mediating between that and the OS. Mind you
that needs something to indicate to the OS which one it is which is still
an open problem.

Jonathan

> 
> >   
> > >  
> > > +next:
> > >  		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> > >  	}
> > >  
> > >  	return 0;
> > >  }
> > >    


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver
  2021-11-08 12:15   ` Jonathan Cameron
@ 2021-11-10  5:45     ` Ira Weiny
  2021-11-18 18:48       ` Jonathan Cameron
  0 siblings, 1 reply; 37+ messages in thread
From: Ira Weiny @ 2021-11-10  5:45 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Mon, Nov 08, 2021 at 12:15:46PM +0000, Jonathan Cameron wrote:
> On Fri, 5 Nov 2021 16:50:53 -0700
> <ira.weiny@intel.com> wrote:
> 
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> > 
> > Define an auxiliary device driver which control DOE auxiliary devices
> > registered on the auxiliary bus.
> > 
> > A DOE mailbox is allowed to support any number of protocols while some
> > DOE protocol specifications apply additional restrictions.
> > 
> > The protocols supported are queried and cached.  pci_doe_supports_prot()
> > can be used to determine if the DOE device supports the protocol
> > specified.
> > 
> > A synchronous interface is provided in pci_doe_exchange_sync() to
> > perform a single query / response exchange from the driver through the
> > device specified.
> > 
> > Testing was conducted against QEMU using:
> > 
> > https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
> > 
> > This code is based on Jonathan's V4 series here:
> > 
> > https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> > 
> > [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
> >     Data Object Exchange (DOE) - Approved 12 March 2020
> > 
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Hi Ira,
> 
> Thanks for taking this on!

NP I'm just sorry I'm so slow to get it moving.

> 
> I'm sure at least half the comments below are about things I wrote
> then forgot about. I'm not sure if it's a good thing but I've ignored
> this for long enough I'm almost reviewing it as fresh code :(
> 
> I was carrying a local patch for the interrupt handler having 
> figured out I'd missread the spec.   Note that I've since concluded
> my local patch has it's own issues (it was unnecessarily complex)
> so I've made some suggestions below that I'm fairly sure
> fix things up.  Note these paths are hard to test and require adding
> some fiddly state machines to QEMU to open up race windows...
> 
> > 
> > ---
> > Changes from Jonathan's V4
> > 	Move the DOE MB code into the DOE auxiliary driver
> > 	Remove Task List in favor of a wait queue
> > 
> > Changes from Ben
> > 	remove CXL references
> > 	propagate rc from pci functions on error
> 
> ...
> 
> 
> > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> > new file mode 100644
> > index 000000000000..2e702fdc7879
> > --- /dev/null
> > +++ b/drivers/pci/doe.c
> > @@ -0,0 +1,701 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Data Object Exchange ECN
> > + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> > + *
> > + * Copyright (C) 2021 Huawei
> > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + */
> > +
> > +#include <linux/bitfield.h>
> > +#include <linux/delay.h>
> > +#include <linux/jiffies.h>
> > +#include <linux/list.h>
> > +#include <linux/mutex.h>
> > +#include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/module.h>
> > +
> > +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> > +
> > +#define PCI_DOE_BUSY_MAX_RETRIES 16
> > +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> > +
> > +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> > +#define PCI_DOE_TIMEOUT HZ
> > +
> > +enum pci_doe_state {
> > +	DOE_IDLE,
> > +	DOE_WAIT_RESP,
> > +	DOE_WAIT_ABORT,
> > +	DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +/*
> 
> /**
> 
> Given it's in kernel-doc syntax, we might as well mark it as such.

Yep done.

> 
> > + * struct pci_doe_task - description of a query / response task
> > + * @ex: The details of the task to be done
> > + * @rv: Return value.  Length of received response or error
> > + * @cb: Callback for completion of task
> > + * @private: Private data passed to callback on completion
> > + */
> > +struct pci_doe_task {
> > +	struct pci_doe_exchange *ex;
> > +	int rv;
> > +	void (*cb)(void *private);
> > +	void *private;
> > +};
> > +
> > +/**
> > + * struct pci_doe - A single DOE mailbox driver
> > + *
> > + * @doe_dev: The DOE Auxiliary device being driven
> > + * @abort_c: Completion used for initial abort handling
> > + * @irq: Interrupt used for signaling DOE ready or abort
> > + * @irq_name: Name used to identify the irq for a particular DOE
> > + * @prots: Array of identifiers for protocols supported
> > + * @num_prots: Size of prots array
> > + * @cur_task: Current task the state machine is working on
> > + * @wq: Wait queue to wait on if a query is in progress
> > + * @state_lock: Protect the state of cur_task, abort, and dead
> > + * @statemachine: Work item for the DOE state machine
> > + * @state: Current state of this DOE
> > + * @timeout_jiffies: 1 second after GO set
> > + * @busy_retries: Count of retry attempts
> > + * @abort: Request a manual abort (e.g. on init)
> > + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> > + *        will immediately be aborted with error
> > + */
> > +struct pci_doe {
> > +	struct pci_doe_dev *doe_dev;
> > +	struct completion abort_c;
> > +	int irq;
> > +	char *irq_name;
> > +	struct pci_doe_protocol *prots;
> > +	int num_prots;
> > +
> > +	struct pci_doe_task *cur_task;
> > +	wait_queue_head_t wq;
> > +	struct mutex state_lock;
> > +	struct delayed_work statemachine;
> > +	enum pci_doe_state state;
> > +	unsigned long timeout_jiffies;
> > +	unsigned int busy_retries;
> > +	unsigned int abort:1;
> > +	unsigned int dead:1;
> > +};
> > +
> > +static irqreturn_t pci_doe_irq(int irq, void *data)
> 
> I was carrying a rework of this locally because I managed
> to convince myself this is wrong.  It's been a while and naturally
> I didn't write a comprehensive set of notes on why it was wrong...
> (Note you can't trigger the problem paths in QEMU without some
> nasty hacks as it relies on opening up race windows that make
> limited sense for the QEMU implementation).
> 
> It's all centered on some details of exactly what causes an interrupt
> on a DOE.  Section 6.xx.3 Interrupt Generation states:
> 
> If enabled, an interrupt message must be triggered every time the
> logical AND of the following conditions transitions from FALSE to TRUE:
> 
> * The associated vector is unmasked ...
> * The value of the DOE interrupt enable bit is 1b
> * The value of the DOE interrupt status bit is 1b
> (only last one really maters to us I think).
> 
> The interrupt status bit is an OR conditional.
> 
> Must be set.. Data Object Read bit or DOE error bit set or DOE busy bit cleared.
> 
> > +{
> > +	struct pci_doe *doe = data;
> > +	struct pci_dev *pdev = doe->doe_dev->pdev;
> > +	int offset = doe->doe_dev->cap_offset;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> 
> So this bit is set on any of: BUSY dropped, READY or ERROR.
> If it's set on BUSY drop, but then in between the read above and this clear
> READY becomes true, then my reading is that we will not get another interrupt.
> That is fine because we will read it again in the state machine and see the
> new state. We could do more of the dance in the interrupt controller by doing
> a reread after clear of INT_STATUS but I think it's cleaner to leave
> it in the state machine.
> 
> It might look nicer here to only write BIT(1) - RW1C, but that doesn't matter as
> all the rest of the register is RO.
> 
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val);
> > +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +	/* Leave the error case to be handled outside IRQ */
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> 
> I don't think we can get here because int status already true.
> So should do this before the above general check to avoid clearning
> the interrupt (we don't want more interrupts during the abort though
> I'd hope the hardware wouldn't generate them).
> 
> So move this before the previous check.
> 
> > +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	/*
> > +	 * Busy being cleared can result in an interrupt, but as
> > +	 * the original Busy may not have been detected, there is no
> > +	 * way to separate such an interrupt from a spurious interrupt.
> > +	 */
> 
> This is misleading - as Busy bit clear would have resulted in INT_STATUS being true above
> (that was a misread of the spec from me in v4).
> So I don't think we can get here in any valid path.
> 
> return IRQ_NONE; should be safe.
> 
> 
> > +	return IRQ_HANDLED;
> > +}
> 
> Summary of above suggested changes:
> 1) Move the DOE_STATUS_ERROR block before the DOE_STATUS_INT_STATUS one
> 2) Possibly uses
>    pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, PCI_DOE_STATUS_INT_STATUS);
>    to be explicit on the write one to clear bit.
> 3) IRQ_NONE for the final return path as I'm fairly sure there is no valid route to that.
>    

Done.

But just to ensure that I understand.  If STATUS_ERROR is indicated we are
basically not clearing the irq because we are resetting the mailbox?  Because
with this new code I don't see a pci_write_config_dword to clear INT_STATUS.

But if we are resetting the mailbox I think that is ok.

> ...
> 
> > +
> > +static void pci_doe_task_complete(void *private)
> > +{
> > +	complete(private);
> > +}
> 
> I wonder why this is up here? I'd move it down to just above the _sync()
> function where it's used. This one was definitely one of mine :)

Done.

> 
> > +
> > +static void doe_statemachine_work(struct work_struct *work)
> 
> I developed an interesting "relationship" with this state machine during
> the original development ;)  I've just walked the paths and convinced
> myself it works so all good.

Sweet!  :-D

> 
> > +{
> > +	struct delayed_work *w = to_delayed_work(work);
> > +	struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> > +	struct pci_dev *pdev = doe->doe_dev->pdev;
> > +	int offset = doe->doe_dev->cap_offset;
> > +	struct pci_doe_task *task;
> > +	bool abort;
> > +	u32 val;
> > +	int rc;
> > +
> > +	mutex_lock(&doe->state_lock);
> > +	task = doe->cur_task;
> > +	abort = doe->abort;
> > +	doe->abort = false;
> > +	mutex_unlock(&doe->state_lock);
> > +
> > +	if (abort) {
> > +		/*
> > +		 * Currently only used during init - care needed if
> > +		 * pci_doe_abort() is generally exposed as it would impact
> > +		 * queries in flight.
> > +		 */
> > +		WARN_ON(task);
> > +		doe->state = DOE_WAIT_ABORT;
> > +		pci_doe_abort_start(doe);
> > +		return;
> > +	}
> > +
> > +	switch (doe->state) {
> > +	case DOE_IDLE:
> > +		if (task == NULL)
> > +			return;
> > +
> > +		/* Nothing currently in flight so queue a task */
> > +		rc = pci_doe_send_req(doe, task->ex);
> > +		/*
> > +		 * The specification does not provide any guidance on how long
> > +		 * some other entity could keep the DOE busy, so try for 1
> > +		 * second then fail. Busy handling is best effort only, because
> > +		 * there is no way of avoiding racing against another user of
> > +		 * the DOE.
> > +		 */
> > +		if (rc == -EBUSY) {
> > +			doe->busy_retries++;
> > +			if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> > +				/* Long enough, fail this request */
> > +				pci_WARN(pdev, true, "DOE busy for too long\n");
> > +				doe->busy_retries = 0;
> > +				goto err_busy;
> > +			}
> > +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> > +			return;
> > +		}
> > +		if (rc)
> > +			goto err_abort;
> > +		doe->busy_retries = 0;
> > +
> > +		doe->state = DOE_WAIT_RESP;
> > +		doe->timeout_jiffies = jiffies + HZ;
> > +		/* Now poll or wait for IRQ with timeout */
> > +		if (doe->irq > 0)
> > +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> > +		else
> > +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +		return;
> > +
> > +	case DOE_WAIT_RESP:
> > +		/* Not possible to get here with NULL task */
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +			rc = -EIO;
> > +			goto err_abort;
> > +		}
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> > +			/* If not yet at timeout reschedule otherwise abort */
> > +			if (time_after(jiffies, doe->timeout_jiffies)) {
> > +				rc = -ETIMEDOUT;
> > +				goto err_abort;
> > +			}
> > +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +			return;
> > +		}
> > +
> > +		rc  = pci_doe_recv_resp(doe, task->ex);
> > +		if (rc < 0)
> > +			goto err_abort;
> > +
> > +		doe->state = DOE_IDLE;
> > +
> > +		mutex_lock(&doe->state_lock);
> > +		doe->cur_task = NULL;
> > +		mutex_unlock(&doe->state_lock);
> > +		wake_up_interruptible(&doe->wq);
> > +
> > +		/* Set the return value to the length of received payload */
> > +		task->rv = rc;
> > +		task->cb(task->private);
> > +
> > +		return;
> > +
> > +	case DOE_WAIT_ABORT:
> > +	case DOE_WAIT_ABORT_ON_ERR:
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > +			/* Back to normal state - carry on */
> > +			mutex_lock(&doe->state_lock);
> > +			doe->cur_task = NULL;
> > +			mutex_unlock(&doe->state_lock);
> > +			wake_up_interruptible(&doe->wq);
> > +
> > +			/*
> > +			 * For deliberately triggered abort, someone is
> > +			 * waiting.
> > +			 */
> > +			if (doe->state == DOE_WAIT_ABORT)
> > +				complete(&doe->abort_c);
> > +
> > +			doe->state = DOE_IDLE;
> > +			return;
> > +		}
> > +		if (time_after(jiffies, doe->timeout_jiffies)) {
> > +			/* Task has timed out and is dead - abort */
> > +			pci_err(pdev, "DOE ABORT timed out\n");
> > +			mutex_lock(&doe->state_lock);
> > +			doe->dead = true;
> > +			doe->cur_task = NULL;
> > +			mutex_unlock(&doe->state_lock);
> > +			wake_up_interruptible(&doe->wq);
> > +
> > +			if (doe->state == DOE_WAIT_ABORT)
> > +				complete(&doe->abort_c);
> > +		}
> > +		return;
> > +	}
> > +
> > +err_abort:
> > +	doe->state = DOE_WAIT_ABORT_ON_ERR;
> > +	pci_doe_abort_start(doe);
> > +err_busy:
> > +	task->rv = rc;
> > +	task->cb(task->private);
> > +	/* If here via err_busy, signal the task done. */
> > +	if (doe->state == DOE_IDLE) {
> > +		mutex_lock(&doe->state_lock);
> > +		doe->cur_task = NULL;
> > +		mutex_unlock(&doe->state_lock);
> > +		wake_up_interruptible(&doe->wq);
> > +	}
> > +}
> > +
> > +/**
> > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response
> > + * @doe: DOE mailbox state structure
> > + * @ex: Description of the buffers and Vendor ID + type used in this
> > + *      request/response pair
> > + *
> > + * Excess data will be discarded.
> > + *
> > + * RETURNS: payload in bytes on success, < 0 on error
> > + */
> > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex)
> > +{
> > +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > +	struct pci_doe_task task;
> > +	DECLARE_COMPLETION_ONSTACK(c);
> > +
> > +	if (!doe)
> > +		return -EAGAIN;
> > +
> > +	/* DOE requests must be a whole number of DW */
> > +	if (ex->request_pl_sz % sizeof(u32))
> > +		return -EINVAL;
> > +
> > +	task.ex = ex;
> > +	task.cb = pci_doe_task_complete;
> > +	task.private = &c;
> > +
> > +again:
> 
> Hmm.   Whether having this code at this layer makes sense hinges on
> whether we want to easily support async use of the DOE in future.

I struggled with this.  I was trying to strike a balance with making this a
synchronous call with only 1 outstanding task while leaving the statemachine
alone.

FWIW I think the queue you had was just fine even though there was only this
synchronous call.

> 
> In v4 some of the async handling had ended up in this function and
> should probably have been factored out to give us a 
> 'queue up work' then 'wait for completion' sequence.
> 
> Given there is now more to be done in here perhaps we need to think
> about such a separation to keep it clear that this is fundamentally
> a synchronous wrapper around an asynchronous operation.

I think that would be moving back in a direction of having a queue like you
defined in V4.  Eliminating the queue really defined this function to sleep
waiting for the state machine to be available.  Doing anything more would have
messed with the state machine you wrote and I did not want to do that.

Dan should we move back to having a queue_task/wait_task like Jonathan had
before?

> 
> > +	mutex_lock(&doe->state_lock);
> > +	if (doe->cur_task) {
> > +		mutex_unlock(&doe->state_lock);
> > +		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> > +		goto again;
> > +	}
> > +
> > +	if (doe->dead) {
> > +		mutex_unlock(&doe->state_lock);
> > +		return -EIO;
> > +	}
> > +	doe->cur_task = &task;
> > +	schedule_delayed_work(&doe->statemachine, 0);
> > +	mutex_unlock(&doe->state_lock);
> > +
> > +	wait_for_completion(&c);
> > +
> > +	return task.rv;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> > +
> > +/**
> > + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol
> > + * @pdev: Device on which to find the DOE instance
> > + * @vid: Protocol Vendor ID
> > + * @type: protocol type
> > + *
> > + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox
> > + * exchange through that DOE mailbox.
> > + *
> > + * RETURNS: True if the DOE device supports the protocol specified
> > + */
> > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> > +{
> > +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > +	int i;
> > +
> > +	if (!doe)
> > +		return false;
> 
> How would this happen?  I don't think it can...  Probably
> false paranoia from me...

The driver may not be loaded at this point.  The call operates on the aux
device not the driver.  Without a driver loaded I don't think we should return
any protocol support.  Even if the driver was loaded and there were some
protocols previously supported.

> 
> > +
> > +	for (i = 0; i < doe->num_prots; i++)
> > +		if ((doe->prots[i].vid == vid) &&
> > +		    (doe->prots[i].type == type))
> > +			return true;
> > +
> > +	return false;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> 
> ...
> 
> > +static void pci_doe_release_irq(struct pci_doe *doe)
> > +{
> > +	if (doe->irq > 0)
> > +		free_irq(doe->irq, doe);
> 
> Is this trivial wrapper worth bothering with?  Maybe just
> put the code inline?

Personally I like it this way because it is called in 2 places.

> 
> > +}
> > +
> 
> ...
> 
> > +
> > +static void pci_doe_unregister(struct pci_doe *doe)
> > +{
> > +	pci_doe_release_irq(doe);
> > +	kfree(doe->irq_name);
> > +	put_device(&doe->doe_dev->pdev->dev);
> 
> This makes me wonder if we should be doing the get_device()
> earlier in probe?  Limited harm in moving it to near the start
> and then ending up with it being 'obviously' correct...

Well...  get_device() is in pci_doe_register...  And it does it's own irq
unwinding.

I guess we could call pci_doe_unregister() from that if we refactored this...

How about this?  (Diff to this code)

diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
index 76acf4063b6b..6f2a419b3c93 100644
--- a/drivers/pci/doe.c
+++ b/drivers/pci/doe.c
@@ -545,10 +545,12 @@ static int pci_doe_abort(struct pci_doe *doe)
        return 0;
 }
 
-static void pci_doe_release_irq(struct pci_doe *doe)
+static void pci_doe_unregister(struct pci_doe *doe)
 {
        if (doe->irq > 0)
                free_irq(doe->irq, doe);
+       kfree(doe->irq_name);
+       put_device(&doe->doe_dev->pdev->dev);
 }
 
 static int pci_doe_register(struct pci_doe *doe)
@@ -559,21 +561,28 @@ static int pci_doe_register(struct pci_doe *doe)
        int rc, irq;
        u32 val;
 
+       /* Ensure the pci device remains until this driver is done with it */
+       get_device(&pdev->dev);
+
        pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
 
        if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
                irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
-               if (irq < 0)
-                       return irq;
+               if (irq < 0) {
+                       rc = irq;
+                       goto unregister;
+               }
 
                doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]",
                                          doe->doe_dev->adev.name);
-               if (!doe->irq_name)
-                       return -ENOMEM;
+               if (!doe->irq_name) {
+                       rc = -ENOMEM;
+                       goto unregister;
+               }
 
                rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe);
                if (rc)
-                       goto err_free_name;
+                       goto unregister;
 
                doe->irq = irq;
                pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
@@ -583,27 +592,15 @@ static int pci_doe_register(struct pci_doe *doe)
        /* Reset the mailbox by issuing an abort */
        rc = pci_doe_abort(doe);
        if (rc)
-               goto err_free_irq;
-
-       /* Ensure the pci device remains until this driver is done with it */
-       get_device(&pdev->dev);
+               goto unregister;
 
        return 0;
 
-err_free_irq:
-       pci_doe_release_irq(doe);
-err_free_name:
-       kfree(doe->irq_name);
+unregister:
+       pci_doe_unregister(doe);
        return rc;
 }
 
-static void pci_doe_unregister(struct pci_doe *doe)
-{
-       pci_doe_release_irq(doe);
-       kfree(doe->irq_name);
-       put_device(&doe->doe_dev->pdev->dev);
-}
-
 /*
  * pci_doe_probe() - Set up the Mailbox
  * @aux_dev: Auxiliary Device


> 
> > +}
> > +
> > +/*
> > + * pci_doe_probe() - Set up the Mailbox
> > + * @aux_dev: Auxiliary Device
> > + * @id: Auxiliary device ID
> > + *
> > + * Probe the mailbox found for all protocols and set up the Mailbox
> > + *
> > + * RETURNS: 0 on success, < 0 on error
> > + */
> > +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> > +			 const struct auxiliary_device_id *id)
> > +{
> > +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> > +					struct pci_doe_dev,
> > +					adev);
> > +	struct pci_doe *doe;
> > +	int rc;
> > +
> > +	doe = kzalloc(sizeof(*doe), GFP_KERNEL);
> 
> Could go devm_ for this I think, though may not be worthwhile.

Yes I think it is worth it...  I should use it more.

BTW why did you not use devm_krealloc() for the protocols?

I did not realize that call existed before you mentioned it in the other patch
review.

Any issue with using it there?

> 
> > +	if (!doe)
> > +		return -ENOMEM;
> > +
> > +	mutex_init(&doe->state_lock);
> > +	init_completion(&doe->abort_c);
> > +	doe->doe_dev = doe_dev;
> > +	init_waitqueue_head(&doe->wq);
> > +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> > +	dev_set_drvdata(&aux_dev->dev, doe);
> > +
> > +	rc = pci_doe_register(doe);
> > +	if (rc)
> > +		goto err_free;
> > +
> > +	rc = pci_doe_cache_protocols(doe);
> > +	if (rc) {
> > +		pci_doe_unregister(doe);
> 
> Mixture of different forms of error handling here.
> I'd move this below and add an err_unregister label.

Actually with the devm_kzalloc() we don't need the goto at all.  We can just
return.  I _think_?  Right?

> 
> > +		goto err_free;
> > +	}
> > +
> > +	return 0;
> > +
> > +err_free:
> > +	kfree(doe);
> > +	return rc;
> > +}
> > +
> > +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> > +{
> > +	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> > +
> > +	/* First halt the state machine */
> > +	cancel_delayed_work_sync(&doe->statemachine);
> > +	kfree(doe->prots);
> 
> Logical flow to me is unregister first, free protocols second
> (to reverse what we do in probe)

No this is the reverse of the probe order I think.

Order is
	register
	cache protocols

Then we
	free 'uncache' protocols
	unregister

Right?

> 
> > +	pci_doe_unregister(doe);
> > +	kfree(doe);
> > +}
> > +
> > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> > +	{.name = "cxl_pci.doe", },
> 
> I'd like to hear from Bjorn on whether registering this from the CXL
> device is the right approach or if we should perhaps just do it directly from
> somewhere in PCI. (really applies to patch 3) I'll talk more about this there.

Actually I think this could be left blank until the next patch...  It's just
odd to define an empty table in the next few structures.  But technically this
is not needed until the devices are defined.

I'm ok waiting to see what Bjorn thinks regarding the CXL vs PCI placement
though.

> 
> > +	{},
> > +};
> > +
> > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> > +
> > +struct auxiliary_driver pci_doe_auxiliary_drv = {
> > +	.name = "pci_doe_drv",
> 
> I would assume this is only used in contexts where the _drv is
> obvious?  I would go with "pci_doe".

Sure. done.

> 
> > +	.id_table = pci_doe_auxiliary_id_table,
> > +	.probe = pci_doe_probe,
> > +	.remove = pci_doe_remove
> > +};
> > +
> > +static int __init pci_doe_init_module(void)
> > +{
> > +	int ret;
> > +
> > +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> > +	if (ret) {
> > +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> > +		       ret);
> > +		return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void __exit pci_doe_exit_module(void)
> > +{
> > +	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> > +}
> > +
> > +module_init(pci_doe_init_module);
> > +module_exit(pci_doe_exit_module);
> 
> Seems like the auxiliary bus would benefit from a
> module_auxiliary_driver() macro to cover this simple registration stuff
> similar to module_i2c_driver() etc.
> 
> Mind you, looking at 5.15 this would be the only user, so maybe one
> for the 'next' case on basis two instances proves it's 'common' ;)

I'm inclined to leave this alone ATM.  I tried to clean up the auxiliary device
documentation and got a bunch more work asked of me by Greg KH.  So I'm behind
on that ATM.

Later we can investigate that a bit I think.

> 
> > +MODULE_LICENSE("GPL v2");
> > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> > new file mode 100644
> > index 000000000000..8380b7ad33d4
> > --- /dev/null
> > +++ b/include/linux/pci-doe.h
> > @@ -0,0 +1,63 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> > + *
> > + * Copyright (C) 2021 Huawei
> > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + */
> > +
> > +#include <linux/completion.h>
> > +#include <linux/list.h>
> > +#include <linux/mutex.h>
> 
> Not used in this header that I can see, so push down to the c files.

oops...  thanks.

> 
> > +#include <linux/auxiliary_bus.h>
> > +
> > +#ifndef LINUX_PCI_DOE_H
> > +#define LINUX_PCI_DOE_H
> > +
> > +#define DOE_DEV_NAME "doe"
> 
> Not sure this is used?

Used in the next patch...  and it kind of goes along with the table_id name...

I'll see about moving both of those to the next patch where it makes more sense
for now.

Thanks for the review,
Ira


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-08 13:09   ` Jonathan Cameron
@ 2021-11-11  1:31     ` Ira Weiny
  2021-11-11 11:53       ` Jonathan Cameron
  0 siblings, 1 reply; 37+ messages in thread
From: Ira Weiny @ 2021-11-11  1:31 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Mon, Nov 08, 2021 at 01:09:18PM +0000, Jonathan Cameron wrote:
> On Fri, 5 Nov 2021 16:50:54 -0700
> <ira.weiny@intel.com> wrote:
> 
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > CXL devices have DOE mailboxes.  Create auxiliary devices which can be
> > driven by the generic DOE auxiliary driver.
> 
> I'd like Bjorn's input on the balance here between what is done
> in cxl/pci.c and what should be in the PCI core code somewhere.
> 
> The tricky bit preventing this being done entirely as part of 
> PCI device instantiation is the interrupts.
> 
> > 
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Mostly new code, so not sure I should really be listed on this
> one but I don't mind either way.
> 
> A few comments inline but overall this ended up nice and clean.
> 
> > 
> > ---
> > Changes from V4:
> > 	Make this an Auxiliary Driver rather than library functions
> > 	Split this out into it's own patch
> > 	Base on the new cxl_dev_state structure
> > 
> > Changes from Ben
> > 	s/CXL_DOE_DEV_NAME/DOE_DEV_NAME/
> > ---
> >  drivers/cxl/Kconfig |   1 +
> >  drivers/cxl/cxl.h   |  13 +++++
> >  drivers/cxl/pci.c   | 120 ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 134 insertions(+)
> > 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index 67c91378f2dd..9d53720bea07 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -16,6 +16,7 @@ if CXL_BUS
> >  config CXL_MEM
> >  	tristate "CXL.mem: Memory Devices"
> >  	default CXL_BUS
> > +	select PCI_DOE_DRIVER
> >  	help
> >  	  The CXL.mem protocol allows a device to act as a provider of
> >  	  "System RAM" and/or "Persistent Memory" that is fully coherent
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 5e2e93451928..f1241a7f2b7b 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -75,6 +75,19 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
> >  #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
> >  #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
> >  
> > +/*
> > + * Address space properties derived from:
> > + * CXL 2.0 8.2.5.12.7 CXL HDM Decoder 0 Control Register
> > + */
> > +#define CXL_ADDRSPACE_RAM   BIT(0)
> > +#define CXL_ADDRSPACE_PMEM  BIT(1)
> > +#define CXL_ADDRSPACE_TYPE2 BIT(2)
> > +#define CXL_ADDRSPACE_TYPE3 BIT(3)
> > +#define CXL_ADDRSPACE_MASK  GENMASK(3, 0)
> 
> Stray.

Not sure what you mean here???

There were a number of defines which were unused but I left them in.

This came right out of your patch 3.

https://lore.kernel.org/linux-cxl/20210524133938.2815206-4-Jonathan.Cameron@huawei.com/

I can remove these defines if you want?

> 
> > +
> > +#define CXL_DOE_PROTOCOL_COMPLIANCE 0
> > +#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
> > +
> >  #define CXL_COMPONENT_REGS() \
> >  	void __iomem *hdm_decoder
> >  
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 8dc91fd3396a..df524b74f1d2 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/mutex.h>
> >  #include <linux/list.h>
> >  #include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> >  #include <linux/io.h>
> >  #include "cxlmem.h"
> >  #include "pci.h"
> > @@ -471,6 +472,120 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
> >  	return rc;
> >  }
> >  
> > +static void cxl_mem_free_irq_vectors(void *data)
> > +{
> > +	pci_free_irq_vectors(data);
> > +}
> > +
> > +static void cxl_destroy_doe_device(void *ad)
> > +{
> > +	struct auxiliary_device *adev = ad;
> Local variable doesn't add anything, just pass it directly
> into the functions as a void *.

Yea...  Thanks...  :-D

> 
> > +
> > +	auxiliary_device_delete(adev);
> > +	auxiliary_device_uninit(adev);
> 
> Both needed?  These are just wrappers around
> put_device() and device_del()

These are both needed per the Auxiliary Device doc.  :-/

> 
> Normally after device_add() suceeded we only ever call device_del()
> as per the docs for device_add()
> https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3277

I think you are miss reading that comment.  Here auxiliary_device_add() has
succeeded.  Therefore both device_del() and put_device() must be called.  In
the case of auxiliary_device_add() failing we only call
auxiliary_device_uninit() [put_device()].

So I think this is correct.

The other places I spot checked called device_del() _and_ put_device().

> 
> > +}
> > +
> > +static DEFINE_IDA(cxl_doe_adev_ida);
> > +static void __doe_dev_release(struct auxiliary_device *adev)
> > +{
> > +	struct pci_doe_dev *doe_dev = container_of(adev, struct pci_doe_dev,
> > +						   adev);
> > +
> > +	ida_free(&cxl_doe_adev_ida, adev->id);
> > +	kfree(doe_dev);
> > +}
> > +
> > +static void cxl_doe_dev_release(struct device *dev)
> > +{
> > +	struct auxiliary_device *adev = container_of(dev,
> > +						struct auxiliary_device,
> > +						dev);
> > +	__doe_dev_release(adev);
> > +}
> > +
> > +static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> 
> Pass in the struct device, or maybe even the struct pci_dev as
> nothing in here is using the cxl_dev_state.

Ah yea can I leave this per the next patch?  Or I can change it then change it
to cxlds in the next patch.  But I would rather leave it.

> 
> > +{
> > +	struct device *dev = cxlds->dev;
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	int irqs, rc;
> > +	u16 pos = 0;
> > +
> > +	/*
> > +	 * An implementation of a cxl type3 device may support an unknown
> > +	 * number of interrupts. Assume that number is not that large and
> > +	 * request them all.
> > +	 */
> > +	irqs = pci_msix_vec_count(pdev);
> > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
> > +	if (rc != irqs) {
> > +		/* No interrupt available - carry on */
> > +		dev_dbg(dev, "No interrupts available for DOE\n");
> > +	} else {
> > +		/*
> > +		 * Enabling bus mastering could be done within the DOE
> > +		 * initialization, but as it potentially has other impacts
> > +		 * keep it within the driver.
> > +		 */
> > +		pci_set_master(pdev);
> > +		rc = devm_add_action_or_reset(dev,
> > +					      cxl_mem_free_irq_vectors,
> > +					      pdev);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> 
> Above here is driver specific...
> Everything from here is is generic so perhaps move it to the PCI core?
> Alternatively wait until we have users that aren't CXL.

I'm still looking for where in the PCI core this would be appropriate to
place...

> 
> > +	pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> > +
> > +	while (pos > 0) {
> > +		struct auxiliary_device *adev;
> > +		struct pci_doe_dev *new_dev;
> > +		int id;
> > +
> > +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> > +		if (!new_dev)
> > +			return -ENOMEM;
> > +
> > +		new_dev->pdev = pdev;
> > +		new_dev->cap_offset = pos;
> > +
> > +		/* Set up struct auxiliary_device */
> > +		adev = &new_dev->adev;
> > +		id = ida_alloc(&cxl_doe_adev_ida, GFP_KERNEL);
> > +		if (id < 0) {
> > +			kfree(new_dev);
> > +			return -ENOMEM;
> > +		}
> > +
> > +		adev->id = id;
> > +		adev->name = DOE_DEV_NAME;
> > +		adev->dev.release = cxl_doe_dev_release;
> > +		adev->dev.parent = dev;
> > +
> > +		if (auxiliary_device_init(adev)) {
> > +			__doe_dev_release(adev);
> > +			return -EIO;
> > +		}
> > +
> > +		if (auxiliary_device_add(adev)) {
> > +			auxiliary_device_uninit(adev);
> > +			return -EIO;
> > +		}
> > +
> > +		rc = devm_add_action_or_reset(dev, cxl_destroy_doe_device, adev);
> > +		if (rc)
> > +			return rc;
> > +
> > +		if (device_attach(&adev->dev) != 1)
> > +			dev_err(&adev->dev,
> > +				"Failed to attach a driver to DOE device %d\n",
> > +				adev->id);
> 
> I wondered about this and how it would happen.
> Given soft dependency only between the drivers it's possible but error or info?
> I'd go with dev_info().  It is an error I'd bail out and used deferred probing
> to try again when it will succeed.

I made this dev_err() on purpose.  And I don't know about the deferred probing.
Maybe deferred probing on the CDAT read but even that I think is going to be a
pain.

The sequence I can think of is:

cxl_pci loaded
	[finds all devices]
	[soft loads pci_doe]
	[device_attach works]
Admin unloads pci_doe
	[hot-plug new device]
	[device_attach fails]
	[cdat will fail until driver is loaded]

I spoke with Dan about this and while this is unfortunate it is what the user
asked for.  So I prefer dev_err() above to make sure that there is an
indication of why this device is potentially not going to work.

Thanks for the review,
Ira

> 
> > +
> > +		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  {
> >  	struct cxl_register_map map;
> > @@ -517,6 +632,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  	if (rc)
> >  		return rc;
> >  
> > +	rc = cxl_setup_doe_devices(cxlds);
> > +	if (rc)
> > +		return rc;
> > +
> >  	cxlmd = devm_cxl_add_memdev(cxlds);
> >  	if (IS_ERR(cxlmd))
> >  		return PTR_ERR(cxlmd);
> > @@ -546,3 +665,4 @@ static struct pci_driver cxl_pci_driver = {
> >  MODULE_LICENSE("GPL v2");
> >  module_pci_driver(cxl_pci_driver);
> >  MODULE_IMPORT_NS(CXL);
> > +MODULE_SOFTDEP("pre: pci_doe");
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table
  2021-11-08 14:52   ` Jonathan Cameron
@ 2021-11-11  3:58     ` Ira Weiny
  2021-11-11 11:58       ` Jonathan Cameron
  0 siblings, 1 reply; 37+ messages in thread
From: Ira Weiny @ 2021-11-11  3:58 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Mon, Nov 08, 2021 at 02:52:39PM +0000, Jonathan Cameron wrote:
> On Fri, 5 Nov 2021 16:50:56 -0700
> <ira.weiny@intel.com> wrote:
> 
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Parse and cache the DSMAS data from the CDAT table.  Store this data in
> > Unmarshaled data structures for use later.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> A few minor comments inline.  In particular I think we need to conclude if
> failure to parse is an error or not.  Right now it's reported as an error
> but then we carry on anyway.

I report it as an error because if the device supports CDAT I made the
assumption that it was required for something up the stack.  However, I did not
want to make that decision at this point because all this code does is cache
the raw data.

So it may not be a fatal error depending on what the data is used for.  But IMO
it is still and error.

> 
> Jonathan
> 
> > 
> > ---
> > Changes from V4
> > 	New patch
> > ---
> >  drivers/cxl/core/memdev.c | 111 ++++++++++++++++++++++++++++++++++++++
> >  drivers/cxl/cxlmem.h      |  23 ++++++++
> >  2 files changed, 134 insertions(+)
> > 
> > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > index c35de9e8298e..e5a2d30a3491 100644
> > --- a/drivers/cxl/core/memdev.c
> > +++ b/drivers/cxl/core/memdev.c
> > @@ -6,6 +6,7 @@
> 
> ...
> 
> > +
> > +static int parse_dsmas(struct cxl_memdev *cxlmd)
> > +{
> > +	struct cxl_dsmas *dsmas_ary = NULL;
> > +	u32 *data = cxlmd->cdat_table;
> > +	int bytes_left = cxlmd->cdat_length;
> > +	int nr_dsmas = 0;
> > +	size_t dsmas_byte_size;
> > +	int rc = 0;
> > +
> > +	if (!data || !cdat_hdr_valid(cxlmd))
> 
> If that's invalid, right answer might be to run it again as we probably
> just raced with an update...  Perhaps try it a couple of times before
> failing hard?

I find it odd that the mailbox would return invalid data even during an update?

That said perhaps validating the header should be done as part of reading the
CDAT.

Thoughts?  Should I push this back to the previous patch?

> 
> > +		return -ENXIO;
> > +
> > +	/* Skip header */
> > +	data += CDAT_HEADER_LENGTH_DW;
> > +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> > +
> > +	while (bytes_left > 0) {
> > +		u32 *cur_rec = data;
> > +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> > +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> > +
> > +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
> > +			struct cxl_dsmas *new_ary;
> > +			u8 flags;
> > +
> > +			new_ary = krealloc(dsmas_ary,
> > +					   sizeof(*dsmas_ary) * (nr_dsmas+1),
> 
> Spaces around the +

Sure.

> You could do this with devm_krealloc() and then just assign it at the end
> rather than allocate a new one and copy.

I failed to see that call when I wrote this...  yes thanks!

> 
> 
> > +					   GFP_KERNEL);
> > +			if (!new_ary) {
> > +				dev_err(&cxlmd->dev,
> > +					"Failed to allocate memory for DSMAS data\n");
> > +				rc = -ENOMEM;
> > +				goto free_dsmas;
> > +			}
> > +			dsmas_ary = new_ary;
> > +
> > +			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
> > +
> > +			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
> > +			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
> > +			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
> > +
> > +			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
> > +				nr_dsmas,
> > +				dsmas_ary[nr_dsmas].dpa_base,
> > +				dsmas_ary[nr_dsmas].dpa_base +
> > +					dsmas_ary[nr_dsmas].dpa_length,
> > +				(dsmas_ary[nr_dsmas].non_volatile ?
> > +					"Persistent" : "Volatile")
> > +				);
> > +
> > +			nr_dsmas++;
> > +		}
> > +
> > +		data += (length/sizeof(u32));
> 
> spaces around /

Yep.

> 

> > +		bytes_left -= length;
> > +	}
> > +
> > +	if (nr_dsmas == 0) {
> > +		rc = -ENXIO;
> > +		goto free_dsmas;
> > +	}
> > +
> > +	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
> > +
> > +	dsmas_byte_size = sizeof(*dsmas_ary) * nr_dsmas;
> > +	cxlmd->dsmas_ary = devm_kzalloc(&cxlmd->dev, dsmas_byte_size, GFP_KERNEL);
> 
> As above, you could have done a devm_krealloc() and then just assigned here.
> Side effect of that being direct returns should be fine.

Yep devm_krealloc is much cleaner.

> However, that relies
> treating an error from this function as an error that will result in failures below.
> 
> 
> > +	if (!cxlmd->dsmas_ary) {
> > +		rc = -ENOMEM;
> > +		goto free_dsmas;
> > +	}
> > +
> > +	memcpy(cxlmd->dsmas_ary, dsmas_ary, dsmas_byte_size);
> > +	cxlmd->nr_dsmas = nr_dsmas;
> > +
> > +free_dsmas:
> > +	kfree(dsmas_ary);
> > +	return rc;
> > +}
> > +
> >  struct cxl_memdev *
> >  devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> >  {
> > @@ -339,6 +446,10 @@ devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> >  		cxl_mem_cdat_read_table(cxlds, cxlmd->cdat_table, cxlmd->cdat_length);
> >  	}
> >  
> > +	rc = parse_dsmas(cxlmd);
> > +	if (rc)
> > +		dev_err(dev, "No DSMAS data found: %d\n", rc);
> 
> dev_info() maybe as it's not being treated as an error?

This is an error.  But not a fatal error.

> 
> However I think it should be treated as an error.  It's a device failure if
> we can't parse this (and table protocol is available)

Shouldn't we let the consumer of this data determine if this is a fatal error and
bail out at that point?

Ira

> 
> If it turns out we need to quirk some devices, then fair enough.
> 
> 
> 
> > +
> >  	/*
> >  	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
> >  	 * needed as this is ordered with cdev_add() publishing the device.
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-11  1:31     ` Ira Weiny
@ 2021-11-11 11:53       ` Jonathan Cameron
  0 siblings, 0 replies; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-11 11:53 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Wed, 10 Nov 2021 17:31:23 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Mon, Nov 08, 2021 at 01:09:18PM +0000, Jonathan Cameron wrote:
> > On Fri, 5 Nov 2021 16:50:54 -0700
> > <ira.weiny@intel.com> wrote:
> >   
> > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > 
> > > CXL devices have DOE mailboxes.  Create auxiliary devices which can be
> > > driven by the generic DOE auxiliary driver.  
> > 
> > I'd like Bjorn's input on the balance here between what is done
> > in cxl/pci.c and what should be in the PCI core code somewhere.
> > 
> > The tricky bit preventing this being done entirely as part of 
> > PCI device instantiation is the interrupts.
> >   
> > > 
> > > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>  
> > 
> > Mostly new code, so not sure I should really be listed on this
> > one but I don't mind either way.
> > 
> > A few comments inline but overall this ended up nice and clean.
> >   
> > > 
> > > ---
> > > Changes from V4:
> > > 	Make this an Auxiliary Driver rather than library functions
> > > 	Split this out into it's own patch
> > > 	Base on the new cxl_dev_state structure
> > > 
> > > Changes from Ben
> > > 	s/CXL_DOE_DEV_NAME/DOE_DEV_NAME/
> > > ---
> > >  drivers/cxl/Kconfig |   1 +
> > >  drivers/cxl/cxl.h   |  13 +++++
> > >  drivers/cxl/pci.c   | 120 ++++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 134 insertions(+)
> > > 
> > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > index 67c91378f2dd..9d53720bea07 100644
> > > --- a/drivers/cxl/Kconfig
> > > +++ b/drivers/cxl/Kconfig
> > > @@ -16,6 +16,7 @@ if CXL_BUS
> > >  config CXL_MEM
> > >  	tristate "CXL.mem: Memory Devices"
> > >  	default CXL_BUS
> > > +	select PCI_DOE_DRIVER
> > >  	help
> > >  	  The CXL.mem protocol allows a device to act as a provider of
> > >  	  "System RAM" and/or "Persistent Memory" that is fully coherent
> > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > > index 5e2e93451928..f1241a7f2b7b 100644
> > > --- a/drivers/cxl/cxl.h
> > > +++ b/drivers/cxl/cxl.h
> > > @@ -75,6 +75,19 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
> > >  #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
> > >  #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
> > >  
> > > +/*
> > > + * Address space properties derived from:
> > > + * CXL 2.0 8.2.5.12.7 CXL HDM Decoder 0 Control Register
> > > + */
> > > +#define CXL_ADDRSPACE_RAM   BIT(0)
> > > +#define CXL_ADDRSPACE_PMEM  BIT(1)
> > > +#define CXL_ADDRSPACE_TYPE2 BIT(2)
> > > +#define CXL_ADDRSPACE_TYPE3 BIT(3)
> > > +#define CXL_ADDRSPACE_MASK  GENMASK(3, 0)  
> > 
> > Stray.  
> 
> Not sure what you mean here???
> 
> There were a number of defines which were unused but I left them in.
> 
> This came right out of your patch 3.
> 
> https://lore.kernel.org/linux-cxl/20210524133938.2815206-4-Jonathan.Cameron@huawei.com/
> 
> I can remove these defines if you want?

They don't have anything to do with DOE that I can see. Probably a side effect
of a merge that went wrong and I didn't notice!




> > 
> > Normally after device_add() suceeded we only ever call device_del()
> > as per the docs for device_add()
> > https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3277  
> 
> I think you are miss reading that comment.  Here auxiliary_device_add() has
> succeeded.  Therefore both device_del() and put_device() must be called.  In
> the case of auxiliary_device_add() failing we only call
> auxiliary_device_uninit() [put_device()].
> 
> So I think this is correct.
> 
> The other places I spot checked called device_del() _and_ put_device().

Yeah. I had that wrong. Ref counts will be wrong otherwise.


> 
> >   
> > > +}
> > > +
> > > +static DEFINE_IDA(cxl_doe_adev_ida);
> > > +static void __doe_dev_release(struct auxiliary_device *adev)
> > > +{
> > > +	struct pci_doe_dev *doe_dev = container_of(adev, struct pci_doe_dev,
> > > +						   adev);
> > > +
> > > +	ida_free(&cxl_doe_adev_ida, adev->id);
> > > +	kfree(doe_dev);
> > > +}
> > > +
> > > +static void cxl_doe_dev_release(struct device *dev)
> > > +{
> > > +	struct auxiliary_device *adev = container_of(dev,
> > > +						struct auxiliary_device,
> > > +						dev);
> > > +	__doe_dev_release(adev);
> > > +}
> > > +
> > > +static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)  
> > 
> > Pass in the struct device, or maybe even the struct pci_dev as
> > nothing in here is using the cxl_dev_state.  
> 
> Ah yea can I leave this per the next patch?  Or I can change it then change it
> to cxlds in the next patch.  But I would rather leave it.

I think we will end up reworking this anyway, but maybe there will
still be a cxl_setup_doe_devices wrapper involved.

> 
> >   
> > > +{
> > > +	struct device *dev = cxlds->dev;
> > > +	struct pci_dev *pdev = to_pci_dev(dev);
> > > +	int irqs, rc;
> > > +	u16 pos = 0;
> > > +
> > > +	/*
> > > +	 * An implementation of a cxl type3 device may support an unknown
> > > +	 * number of interrupts. Assume that number is not that large and
> > > +	 * request them all.
> > > +	 */
> > > +	irqs = pci_msix_vec_count(pdev);
> > > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
> > > +	if (rc != irqs) {
> > > +		/* No interrupt available - carry on */
> > > +		dev_dbg(dev, "No interrupts available for DOE\n");
> > > +	} else {
> > > +		/*
> > > +		 * Enabling bus mastering could be done within the DOE
> > > +		 * initialization, but as it potentially has other impacts
> > > +		 * keep it within the driver.
> > > +		 */
> > > +		pci_set_master(pdev);
> > > +		rc = devm_add_action_or_reset(dev,
> > > +					      cxl_mem_free_irq_vectors,
> > > +					      pdev);
> > > +		if (rc)
> > > +			return rc;
> > > +	}
> > > +  
> > 
> > Above here is driver specific...
> > Everything from here is is generic so perhaps move it to the PCI core?
> > Alternatively wait until we have users that aren't CXL.  
> 
> I'm still looking for where in the PCI core this would be appropriate to
> place...

Yeah, this needs Bjorn's input. One option would be to move from a soft
dependency to a hard one on the pci-doe module and just put this in there
as an exported utility function.

> 
> >   
> > > +	pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> > > +
> > > +	while (pos > 0) {
> > > +		struct auxiliary_device *adev;
> > > +		struct pci_doe_dev *new_dev;
> > > +		int id;
> > > +
> > > +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> > > +		if (!new_dev)
> > > +			return -ENOMEM;
> > > +
> > > +		new_dev->pdev = pdev;
> > > +		new_dev->cap_offset = pos;
> > > +
> > > +		/* Set up struct auxiliary_device */
> > > +		adev = &new_dev->adev;
> > > +		id = ida_alloc(&cxl_doe_adev_ida, GFP_KERNEL);
> > > +		if (id < 0) {
> > > +			kfree(new_dev);
> > > +			return -ENOMEM;
> > > +		}
> > > +
> > > +		adev->id = id;
> > > +		adev->name = DOE_DEV_NAME;
> > > +		adev->dev.release = cxl_doe_dev_release;
> > > +		adev->dev.parent = dev;
> > > +
> > > +		if (auxiliary_device_init(adev)) {
> > > +			__doe_dev_release(adev);
> > > +			return -EIO;
> > > +		}
> > > +
> > > +		if (auxiliary_device_add(adev)) {
> > > +			auxiliary_device_uninit(adev);
> > > +			return -EIO;
> > > +		}
> > > +
> > > +		rc = devm_add_action_or_reset(dev, cxl_destroy_doe_device, adev);
> > > +		if (rc)
> > > +			return rc;
> > > +
> > > +		if (device_attach(&adev->dev) != 1)
> > > +			dev_err(&adev->dev,
> > > +				"Failed to attach a driver to DOE device %d\n",
> > > +				adev->id);  
> > 
> > I wondered about this and how it would happen.
> > Given soft dependency only between the drivers it's possible but error or info?
> > I'd go with dev_info().  It is an error I'd bail out and used deferred probing
> > to try again when it will succeed.  
> 
> I made this dev_err() on purpose.  And I don't know about the deferred probing.
> Maybe deferred probing on the CDAT read but even that I think is going to be a
> pain.
> 
> The sequence I can think of is:
> 
> cxl_pci loaded
> 	[finds all devices]
> 	[soft loads pci_doe]
> 	[device_attach works]
> Admin unloads pci_doe
> 	[hot-plug new device]
> 	[device_attach fails]
> 	[cdat will fail until driver is loaded]
> 
> I spoke with Dan about this and while this is unfortunate it is what the user
> asked for.  So I prefer dev_err() above to make sure that there is an
> indication of why this device is potentially not going to work.
I'm fine with it being dev_err(), but make it a hard error if it happens.
I don't like potentially not working, and would rather see definitely not
working in this case - so have the function return an error code.

Jonathan
> 
> Thanks for the review,
> Ira
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table
  2021-11-11  3:58     ` Ira Weiny
@ 2021-11-11 11:58       ` Jonathan Cameron
  0 siblings, 0 replies; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-11 11:58 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Wed, 10 Nov 2021 19:58:24 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Mon, Nov 08, 2021 at 02:52:39PM +0000, Jonathan Cameron wrote:
> > On Fri, 5 Nov 2021 16:50:56 -0700
> > <ira.weiny@intel.com> wrote:
> >   
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > Parse and cache the DSMAS data from the CDAT table.  Store this data in
> > > Unmarshaled data structures for use later.
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>  
> > 
> > A few minor comments inline.  In particular I think we need to conclude if
> > failure to parse is an error or not.  Right now it's reported as an error
> > but then we carry on anyway.  
> 
> I report it as an error because if the device supports CDAT I made the
> assumption that it was required for something up the stack.  However, I did not
> want to make that decision at this point because all this code does is cache
> the raw data.
> 
> So it may not be a fatal error depending on what the data is used for.  But IMO
> it is still and error.
> 

dev_warn() perhaps is a good middle ground?   Something wrong, but not fatal
here...


> > 
> > Jonathan
> >   
> > > 
> > > ---
> > > Changes from V4
> > > 	New patch
> > > ---
> > >  drivers/cxl/core/memdev.c | 111 ++++++++++++++++++++++++++++++++++++++
> > >  drivers/cxl/cxlmem.h      |  23 ++++++++
> > >  2 files changed, 134 insertions(+)
> > > 
> > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > > index c35de9e8298e..e5a2d30a3491 100644
> > > --- a/drivers/cxl/core/memdev.c
> > > +++ b/drivers/cxl/core/memdev.c
> > > @@ -6,6 +6,7 @@  
> > 
> > ...
> >   
> > > +
> > > +static int parse_dsmas(struct cxl_memdev *cxlmd)
> > > +{
> > > +	struct cxl_dsmas *dsmas_ary = NULL;
> > > +	u32 *data = cxlmd->cdat_table;
> > > +	int bytes_left = cxlmd->cdat_length;
> > > +	int nr_dsmas = 0;
> > > +	size_t dsmas_byte_size;
> > > +	int rc = 0;
> > > +
> > > +	if (!data || !cdat_hdr_valid(cxlmd))  
> > 
> > If that's invalid, right answer might be to run it again as we probably
> > just raced with an update...  Perhaps try it a couple of times before
> > failing hard?  
> 
> I find it odd that the mailbox would return invalid data even during an update?

The read can take multiple exchanges.  It's not invalid as such, we just saw
parts of different valid states.  The checksum is there to protect against
such a race.  Lots of other ways it could have been designed, but that was the
choice made.

> 
> That said perhaps validating the header should be done as part of reading the
> CDAT.
> 
> Thoughts?  Should I push this back to the previous patch?

Agreed, it would make more sense to do it at the read.

> 
> >   
> > > +		return -ENXIO;
> > > +
> > > +	/* Skip header */
> > > +	data += CDAT_HEADER_LENGTH_DW;
> > > +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> > > +
> > > +	while (bytes_left > 0) {
> > > +		u32 *cur_rec = data;
> > > +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> > > +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> > > +
> > > +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
> > > +			struct cxl_dsmas *new_ary;
> > > +			u8 flags;
> > > +
> > > +			new_ary = krealloc(dsmas_ary,
> > > +					   sizeof(*dsmas_ary) * (nr_dsmas+1),  
> > 
> > Spaces around the +  
> 
> Sure.
> 
> > You could do this with devm_krealloc() and then just assign it at the end
> > rather than allocate a new one and copy.  
> 
> I failed to see that call when I wrote this...  yes thanks!

It's new.

> 
> > 
> >   
> > > +					   GFP_KERNEL);
> > > +			if (!new_ary) {
> > > +				dev_err(&cxlmd->dev,
> > > +					"Failed to allocate memory for DSMAS data\n");
> > > +				rc = -ENOMEM;
> > > +				goto free_dsmas;
> > > +			}
> > > +			dsmas_ary = new_ary;
> > > +
> > > +			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
> > > +
> > > +			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
> > > +			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
> > > +			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
> > > +
> > > +			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
> > > +				nr_dsmas,
> > > +				dsmas_ary[nr_dsmas].dpa_base,
> > > +				dsmas_ary[nr_dsmas].dpa_base +
> > > +					dsmas_ary[nr_dsmas].dpa_length,
> > > +				(dsmas_ary[nr_dsmas].non_volatile ?
> > > +					"Persistent" : "Volatile")
> > > +				);
> > > +
> > > +			nr_dsmas++;
> > > +		}
> > > +
> > > +		data += (length/sizeof(u32));  
> > 
> > spaces around /  
> 
> Yep.
> 
> >   
> 
> > > +		bytes_left -= length;
> > > +	}
> > > +
> > > +	if (nr_dsmas == 0) {
> > > +		rc = -ENXIO;
> > > +		goto free_dsmas;
> > > +	}
> > > +
> > > +	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
> > > +
> > > +	dsmas_byte_size = sizeof(*dsmas_ary) * nr_dsmas;
> > > +	cxlmd->dsmas_ary = devm_kzalloc(&cxlmd->dev, dsmas_byte_size, GFP_KERNEL);  
> > 
> > As above, you could have done a devm_krealloc() and then just assigned here.
> > Side effect of that being direct returns should be fine.  
> 
> Yep devm_krealloc is much cleaner.
> 
> > However, that relies
> > treating an error from this function as an error that will result in failures below.
> > 
> >   
> > > +	if (!cxlmd->dsmas_ary) {
> > > +		rc = -ENOMEM;
> > > +		goto free_dsmas;
> > > +	}
> > > +
> > > +	memcpy(cxlmd->dsmas_ary, dsmas_ary, dsmas_byte_size);
> > > +	cxlmd->nr_dsmas = nr_dsmas;
> > > +
> > > +free_dsmas:
> > > +	kfree(dsmas_ary);
> > > +	return rc;
> > > +}
> > > +
> > >  struct cxl_memdev *
> > >  devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> > >  {
> > > @@ -339,6 +446,10 @@ devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> > >  		cxl_mem_cdat_read_table(cxlds, cxlmd->cdat_table, cxlmd->cdat_length);
> > >  	}
> > >  
> > > +	rc = parse_dsmas(cxlmd);
> > > +	if (rc)
> > > +		dev_err(dev, "No DSMAS data found: %d\n", rc);  
> > 
> > dev_info() maybe as it's not being treated as an error?  
> 
> This is an error.  But not a fatal error.
> 
> > 
> > However I think it should be treated as an error.  It's a device failure if
> > we can't parse this (and table protocol is available)  
> 
> Shouldn't we let the consumer of this data determine if this is a fatal error and
> bail out at that point?

As above, dev_warn() seems more appropriate in that case to me.

Jonathan
> 
> Ira
> 
> > 
> > If it turns out we need to quirk some devices, then fair enough.
> > 
> > 
> >   
> > > +
> > >  	/*
> > >  	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
> > >  	 * needed as this is ordered with cdev_add() publishing the device.  
> >   


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver
  2021-11-05 23:50 ` [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
  2021-11-08 12:15   ` Jonathan Cameron
@ 2021-11-16 23:48   ` Bjorn Helgaas
  2021-12-03 20:48     ` Dan Williams
  1 sibling, 1 reply; 37+ messages in thread
From: Bjorn Helgaas @ 2021-11-16 23:48 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci

On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> with standard protocol discovery.  Each mailbox is accessed through a
> DOE Extended Capability.
> 
> Define an auxiliary device driver which control DOE auxiliary devices
> registered on the auxiliary bus.

What do we gain by making this an auxiliary driver?

This doesn't really feel like a "driver," and apparently it used to be
a library.  I'd like to see the rationale and benefits of the driver
approach (in the eventual commit log as well as the current email
thread).

> A DOE mailbox is allowed to support any number of protocols while some
> DOE protocol specifications apply additional restrictions.

This sounds something like a fancy version of VPD, and VPD has been a
huge headache.  I hope DOE avoids that ;)

> The protocols supported are queried and cached.  pci_doe_supports_prot()
> can be used to determine if the DOE device supports the protocol
> specified.
> 
> A synchronous interface is provided in pci_doe_exchange_sync() to
> perform a single query / response exchange from the driver through the
> device specified.
> 
> Testing was conducted against QEMU using:
> 
> https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
> 
> This code is based on Jonathan's V4 series here:
> 
> https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> 
> [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
>     Data Object Exchange (DOE) - Approved 12 March 2020
> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

I think these sign-offs are out of order, per
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=v5.14#n365
Last sign-off should be from the person who posted this.

> ---
> Changes from Jonathan's V4
> 	Move the DOE MB code into the DOE auxiliary driver
> 	Remove Task List in favor of a wait queue
> 
> Changes from Ben
> 	remove CXL references
> 	propagate rc from pci functions on error
> ---
>  drivers/pci/Kconfig           |  10 +
>  drivers/pci/Makefile          |   3 +
>  drivers/pci/doe.c             | 701 ++++++++++++++++++++++++++++++++++
>  include/linux/pci-doe.h       |  63 +++
>  include/uapi/linux/pci_regs.h |  29 +-
>  5 files changed, 805 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/pci/doe.c
>  create mode 100644 include/linux/pci-doe.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 0c473d75e625..b512295538ba 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND
>  	  The PCI device frontend driver allows the kernel to import arbitrary
>  	  PCI devices from a PCI backend to support PCI driver domains.
>  
> +config PCI_DOE_DRIVER
> +	tristate "PCI Data Object Exchange (DOE) driver"
> +	select AUXILIARY_BUS
> +	help
> +	  Driver for DOE auxiliary devices.
> +
> +	  DOE provides a simple mailbox in PCI config space that is used by a
> +	  number of different protocols.  DOE is defined in the Data Object
> +	  Exchange ECN to the PCIe r5.0 spec.
> +
>  config PCI_ATS
>  	bool
>  
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index d62c4ac4ae1b..afd9d7bd2b82 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB)		+= pci-stub.o
>  obj-$(CONFIG_PCI_PF_STUB)	+= pci-pf-stub.o
>  obj-$(CONFIG_PCI_ECAM)		+= ecam.o
>  obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
> +obj-$(CONFIG_PCI_DOE_DRIVER)	+= pci-doe.o
>  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>  
> +pci-doe-y := doe.o
> +
>  # Endpoint library must be initialized before its users
>  obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
>  
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> new file mode 100644
> index 000000000000..2e702fdc7879
> --- /dev/null
> +++ b/drivers/pci/doe.c
> @@ -0,0 +1,701 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Object Exchange ECN
> + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci-doe.h>
> +#include <linux/workqueue.h>
> +#include <linux/module.h>
> +
> +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> +
> +#define PCI_DOE_BUSY_MAX_RETRIES 16
> +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> +
> +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> +#define PCI_DOE_TIMEOUT HZ
> +
> +enum pci_doe_state {
> +	DOE_IDLE,
> +	DOE_WAIT_RESP,
> +	DOE_WAIT_ABORT,
> +	DOE_WAIT_ABORT_ON_ERR,
> +};
> +
> +/*
> + * struct pci_doe_task - description of a query / response task
> + * @ex: The details of the task to be done
> + * @rv: Return value.  Length of received response or error
> + * @cb: Callback for completion of task
> + * @private: Private data passed to callback on completion
> + */
> +struct pci_doe_task {
> +	struct pci_doe_exchange *ex;
> +	int rv;
> +	void (*cb)(void *private);
> +	void *private;
> +};
> +
> +/**
> + * struct pci_doe - A single DOE mailbox driver
> + *
> + * @doe_dev: The DOE Auxiliary device being driven
> + * @abort_c: Completion used for initial abort handling
> + * @irq: Interrupt used for signaling DOE ready or abort
> + * @irq_name: Name used to identify the irq for a particular DOE
> + * @prots: Array of identifiers for protocols supported
> + * @num_prots: Size of prots array
> + * @cur_task: Current task the state machine is working on
> + * @wq: Wait queue to wait on if a query is in progress
> + * @state_lock: Protect the state of cur_task, abort, and dead
> + * @statemachine: Work item for the DOE state machine
> + * @state: Current state of this DOE
> + * @timeout_jiffies: 1 second after GO set
> + * @busy_retries: Count of retry attempts
> + * @abort: Request a manual abort (e.g. on init)
> + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> + *        will immediately be aborted with error
> + */
> +struct pci_doe {
> +	struct pci_doe_dev *doe_dev;
> +	struct completion abort_c;
> +	int irq;
> +	char *irq_name;
> +	struct pci_doe_protocol *prots;
> +	int num_prots;
> +
> +	struct pci_doe_task *cur_task;
> +	wait_queue_head_t wq;
> +	struct mutex state_lock;
> +	struct delayed_work statemachine;
> +	enum pci_doe_state state;
> +	unsigned long timeout_jiffies;
> +	unsigned int busy_retries;
> +	unsigned int abort:1;
> +	unsigned int dead:1;
> +};
> +
> +static irqreturn_t pci_doe_irq(int irq, void *data)
> +{
> +	struct pci_doe *doe = data;
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val);
> +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +	/* Leave the error case to be handled outside IRQ */
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	/*
> +	 * Busy being cleared can result in an interrupt, but as
> +	 * the original Busy may not have been detected, there is no
> +	 * way to separate such an interrupt from a spurious interrupt.
> +	 */
> +	return IRQ_HANDLED;
> +}
> +
> +/*
> + * Only call when safe to directly access the DOE, either because no tasks yet
> + * queued, or called from doe_statemachine_work() which has exclusive access to
> + * the DOE config space.
> + */
> +static void pci_doe_abort_start(struct pci_doe *doe)
> +{
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	u32 val;
> +
> +	val = PCI_DOE_CTRL_ABORT;
> +	if (doe->irq)
> +		val |= PCI_DOE_CTRL_INT_EN;
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +
> +	doe->timeout_jiffies = jiffies + HZ;
> +	schedule_delayed_work(&doe->statemachine, HZ);
> +}
> +
> +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)
> +{
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	u32 val;
> +	int i;
> +
> +	/*
> +	 * Check the DOE busy bit is not set. If it is set, this could indicate
> +	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
> +	 * it is expected that firmware and OS will negotiate access rights via
> +	 * an, as yet to be defined method.
> +	 */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> +		return -EBUSY;
> +
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +		return -EIO;
> +
> +	/* Write DOE Header */
> +	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) |
> +		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> +	/* Length is 2 DW of header + length of payload in DW */
> +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> +					  2 + ex->request_pl_sz / sizeof(u32)));
> +	for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++)
> +		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +				       ex->request_pl[i]);
> +
> +	val = PCI_DOE_CTRL_GO;
> +	if (doe->irq)
> +		val |= PCI_DOE_CTRL_INT_EN;
> +
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +	/* Request is sent - now wait for poll or IRQ */
> +	return 0;
> +}
> +
> +static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex)
> +{
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	size_t length;
> +	u32 val;
> +	int i;
> +
> +	/* Read the first dword to get the protocol */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) ||
> +	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) {
> +		pci_err(pdev,
> +			"Expected [VID, Protocol] = [%x, %x], got [%x, %x]\n",

Maybe "%#x" so this is less ambiguous?

> +			ex->prot.vid, ex->prot.type,
> +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> +		return -EIO;
> +	}
> +
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	/* Read the second dword to get the length */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +
> +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> +	if (length > SZ_1M || length < 2)
> +		return -EIO;
> +
> +	/* First 2 dwords have already been read */
> +	length -= 2;
> +	/* Read the rest of the response payload */
> +	for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> +				      &ex->response_pl[i]);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}
> +
> +	/* Flush excess length */
> +	for (; i < length; i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}
> +	/* Final error check to pick up on any since Data Object Ready */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +		return -EIO;
> +
> +	return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32);
> +}
> +
> +static void pci_doe_task_complete(void *private)
> +{
> +	complete(private);
> +}
> +
> +static void doe_statemachine_work(struct work_struct *work)
> +{
> +	struct delayed_work *w = to_delayed_work(work);
> +	struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	struct pci_doe_task *task;
> +	bool abort;
> +	u32 val;
> +	int rc;
> +
> +	mutex_lock(&doe->state_lock);
> +	task = doe->cur_task;
> +	abort = doe->abort;
> +	doe->abort = false;
> +	mutex_unlock(&doe->state_lock);
> +
> +	if (abort) {
> +		/*
> +		 * Currently only used during init - care needed if
> +		 * pci_doe_abort() is generally exposed as it would impact
> +		 * queries in flight.
> +		 */
> +		WARN_ON(task);
> +		doe->state = DOE_WAIT_ABORT;
> +		pci_doe_abort_start(doe);
> +		return;
> +	}
> +
> +	switch (doe->state) {
> +	case DOE_IDLE:
> +		if (task == NULL)
> +			return;
> +
> +		/* Nothing currently in flight so queue a task */
> +		rc = pci_doe_send_req(doe, task->ex);
> +		/*
> +		 * The specification does not provide any guidance on how long
> +		 * some other entity could keep the DOE busy, so try for 1
> +		 * second then fail. Busy handling is best effort only, because
> +		 * there is no way of avoiding racing against another user of
> +		 * the DOE.
> +		 */
> +		if (rc == -EBUSY) {
> +			doe->busy_retries++;
> +			if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> +				/* Long enough, fail this request */
> +				pci_WARN(pdev, true, "DOE busy for too long\n");

I think pci_WARN() (as opposed to pci_warn()) gives us a register dump
or stacktrace.  That's useful if it might be a software problem.  But
this looks like an issue with the hardware or firmware on the adapter,
where the registers or stacktrace don't seem useful.

Maybe the busy duration would be useful?

> +				doe->busy_retries = 0;
> +				goto err_busy;
> +			}
> +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> +			return;
> +		}
> +		if (rc)
> +			goto err_abort;
> +		doe->busy_retries = 0;
> +
> +		doe->state = DOE_WAIT_RESP;
> +		doe->timeout_jiffies = jiffies + HZ;
> +		/* Now poll or wait for IRQ with timeout */
> +		if (doe->irq > 0)
> +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> +		else
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +		return;
> +
> +	case DOE_WAIT_RESP:
> +		/* Not possible to get here with NULL task */
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +			rc = -EIO;
> +			goto err_abort;
> +		}
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> +			/* If not yet at timeout reschedule otherwise abort */
> +			if (time_after(jiffies, doe->timeout_jiffies)) {
> +				rc = -ETIMEDOUT;
> +				goto err_abort;
> +			}
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +			return;
> +		}
> +
> +		rc  = pci_doe_recv_resp(doe, task->ex);
> +		if (rc < 0)
> +			goto err_abort;
> +
> +		doe->state = DOE_IDLE;
> +
> +		mutex_lock(&doe->state_lock);
> +		doe->cur_task = NULL;
> +		mutex_unlock(&doe->state_lock);
> +		wake_up_interruptible(&doe->wq);
> +
> +		/* Set the return value to the length of received payload */
> +		task->rv = rc;
> +		task->cb(task->private);
> +
> +		return;
> +
> +	case DOE_WAIT_ABORT:
> +	case DOE_WAIT_ABORT_ON_ERR:
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> +			/* Back to normal state - carry on */
> +			mutex_lock(&doe->state_lock);
> +			doe->cur_task = NULL;
> +			mutex_unlock(&doe->state_lock);
> +			wake_up_interruptible(&doe->wq);
> +
> +			/*
> +			 * For deliberately triggered abort, someone is
> +			 * waiting.
> +			 */
> +			if (doe->state == DOE_WAIT_ABORT)
> +				complete(&doe->abort_c);
> +
> +			doe->state = DOE_IDLE;
> +			return;
> +		}
> +		if (time_after(jiffies, doe->timeout_jiffies)) {
> +			/* Task has timed out and is dead - abort */
> +			pci_err(pdev, "DOE ABORT timed out\n");
> +			mutex_lock(&doe->state_lock);
> +			doe->dead = true;
> +			doe->cur_task = NULL;
> +			mutex_unlock(&doe->state_lock);
> +			wake_up_interruptible(&doe->wq);
> +
> +			if (doe->state == DOE_WAIT_ABORT)
> +				complete(&doe->abort_c);
> +		}
> +		return;
> +	}
> +
> +err_abort:
> +	doe->state = DOE_WAIT_ABORT_ON_ERR;
> +	pci_doe_abort_start(doe);
> +err_busy:
> +	task->rv = rc;
> +	task->cb(task->private);
> +	/* If here via err_busy, signal the task done. */
> +	if (doe->state == DOE_IDLE) {
> +		mutex_lock(&doe->state_lock);
> +		doe->cur_task = NULL;
> +		mutex_unlock(&doe->state_lock);
> +		wake_up_interruptible(&doe->wq);
> +	}
> +}
> +
> +/**
> + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response

Wrap to fit in 80 columns.  There are a few more.

> + * @doe: DOE mailbox state structure
> + * @ex: Description of the buffers and Vendor ID + type used in this
> + *      request/response pair
> + *
> + * Excess data will be discarded.
> + *
> + * RETURNS: payload in bytes on success, < 0 on error
> + */
> +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +	struct pci_doe_task task;
> +	DECLARE_COMPLETION_ONSTACK(c);
> +
> +	if (!doe)
> +		return -EAGAIN;
> +
> +	/* DOE requests must be a whole number of DW */
> +	if (ex->request_pl_sz % sizeof(u32))
> +		return -EINVAL;
> +
> +	task.ex = ex;
> +	task.cb = pci_doe_task_complete;
> +	task.private = &c;
> +
> +again:
> +	mutex_lock(&doe->state_lock);
> +	if (doe->cur_task) {
> +		mutex_unlock(&doe->state_lock);
> +		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> +		goto again;
> +	}
> +
> +	if (doe->dead) {
> +		mutex_unlock(&doe->state_lock);
> +		return -EIO;
> +	}
> +	doe->cur_task = &task;
> +	schedule_delayed_work(&doe->statemachine, 0);
> +	mutex_unlock(&doe->state_lock);
> +
> +	wait_for_completion(&c);
> +
> +	return task.rv;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> +
> +/**
> + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol
> + * @pdev: Device on which to find the DOE instance
> + * @vid: Protocol Vendor ID
> + * @type: protocol type
> + *
> + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox
> + * exchange through that DOE mailbox.
> + *
> + * RETURNS: True if the DOE device supports the protocol specified
> + */
> +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +	int i;
> +
> +	if (!doe)
> +		return false;
> +
> +	for (i = 0; i < doe->num_prots; i++)
> +		if ((doe->prots[i].vid == vid) &&
> +		    (doe->prots[i].type == type))
> +			return true;
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> +
> +static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid,
> +			     u8 *protocol)
> +{
> +	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX, *index);
> +	u32 response_pl;
> +	struct pci_doe_exchange ex = {
> +		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
> +		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> +		.request_pl = &request_pl,
> +		.request_pl_sz = sizeof(request_pl),
> +		.response_pl = &response_pl,
> +		.response_pl_sz = sizeof(response_pl),
> +	};
> +	int ret;
> +
> +	ret = pci_doe_exchange_sync(doe->doe_dev, &ex);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ret != sizeof(response_pl))
> +		return -EIO;
> +
> +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL, response_pl);
> +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX, response_pl);
> +
> +	return 0;
> +}
> +
> +static int pci_doe_cache_protocols(struct pci_doe *doe)
> +{
> +	u8 index = 0;
> +	int rc;
> +
> +	/* Discovery protocol must always be supported and must report itself */
> +	doe->num_prots = 1;
> +	doe->prots = kcalloc(doe->num_prots, sizeof(*doe->prots), GFP_KERNEL);
> +	if (doe->prots == NULL)
> +		return -ENOMEM;
> +
> +	do {
> +		struct pci_doe_protocol *prot;
> +
> +		prot = &doe->prots[doe->num_prots - 1];
> +		rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
> +		if (rc)
> +			goto err_free_prots;
> +
> +		if (index) {
> +			struct pci_doe_protocol *prot_new;
> +
> +			doe->num_prots++;
> +			prot_new = krealloc(doe->prots,
> +					    sizeof(*doe->prots) * doe->num_prots,
> +					    GFP_KERNEL);
> +			if (prot_new == NULL) {
> +				rc = -ENOMEM;
> +				goto err_free_prots;
> +			}
> +			doe->prots = prot_new;
> +		}
> +	} while (index);
> +
> +	return 0;
> +
> +err_free_prots:
> +	kfree(doe->prots);
> +	doe->num_prots = 0;
> +	doe->prots = NULL;
> +	return rc;
> +}
> +
> +static int pci_doe_abort(struct pci_doe *doe)
> +{
> +	reinit_completion(&doe->abort_c);
> +	mutex_lock(&doe->state_lock);
> +	doe->abort = true;
> +	mutex_unlock(&doe->state_lock);
> +	schedule_delayed_work(&doe->statemachine, 0);
> +	wait_for_completion(&doe->abort_c);
> +
> +	if (doe->dead)
> +		return -EIO;
> +
> +	return 0;
> +}
> +
> +static void pci_doe_release_irq(struct pci_doe *doe)
> +{
> +	if (doe->irq > 0)
> +		free_irq(doe->irq, doe);
> +}
> +
> +static int pci_doe_register(struct pci_doe *doe)
> +{
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	bool poll = !pci_dev_msi_enabled(pdev);
> +	int offset = doe->doe_dev->cap_offset;
> +	int rc, irq;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> +
> +	if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
> +		irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
> +		if (irq < 0)
> +			return irq;
> +
> +		doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]",
> +					  doe->doe_dev->adev.name);
> +		if (!doe->irq_name)
> +			return -ENOMEM;
> +
> +		rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe);
> +		if (rc)
> +			goto err_free_name;
> +
> +		doe->irq = irq;
> +		pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> +				       PCI_DOE_CTRL_INT_EN);
> +	}
> +
> +	/* Reset the mailbox by issuing an abort */
> +	rc = pci_doe_abort(doe);
> +	if (rc)
> +		goto err_free_irqs;
> +
> +	/* Ensure the pci device remains until this driver is done with it */

s/pci/PCI/

> +	get_device(&pdev->dev);

pci_dev_get()

There's kind of a mix of both in drivers/pci/, but since it exists, we
might as well use it.

> +
> +	return 0;
> +
> +err_free_irqs:
> +	pci_doe_release_irq(doe);
> +err_free_name:
> +	kfree(doe->irq_name);
> +	return rc;
> +}
> +
> +static void pci_doe_unregister(struct pci_doe *doe)
> +{
> +	pci_doe_release_irq(doe);
> +	kfree(doe->irq_name);
> +	put_device(&doe->doe_dev->pdev->dev);

pci_dev_put()

> +}
> +
> +/*
> + * pci_doe_probe() - Set up the Mailbox
> + * @aux_dev: Auxiliary Device
> + * @id: Auxiliary device ID
> + *
> + * Probe the mailbox found for all protocols and set up the Mailbox
> + *
> + * RETURNS: 0 on success, < 0 on error
> + */
> +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> +			 const struct auxiliary_device_id *id)
> +{
> +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> +					struct pci_doe_dev,
> +					adev);
> +	struct pci_doe *doe;
> +	int rc;
> +
> +	doe = kzalloc(sizeof(*doe), GFP_KERNEL);
> +	if (!doe)
> +		return -ENOMEM;
> +
> +	mutex_init(&doe->state_lock);
> +	init_completion(&doe->abort_c);
> +	doe->doe_dev = doe_dev;
> +	init_waitqueue_head(&doe->wq);
> +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> +	dev_set_drvdata(&aux_dev->dev, doe);
> +
> +	rc = pci_doe_register(doe);
> +	if (rc)
> +		goto err_free;
> +
> +	rc = pci_doe_cache_protocols(doe);
> +	if (rc) {
> +		pci_doe_unregister(doe);
> +		goto err_free;
> +	}
> +
> +	return 0;
> +
> +err_free:
> +	kfree(doe);
> +	return rc;
> +}
> +
> +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> +
> +	/* First halt the state machine */
> +	cancel_delayed_work_sync(&doe->statemachine);
> +	kfree(doe->prots);
> +	pci_doe_unregister(doe);
> +	kfree(doe);
> +}
> +
> +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> +	{.name = "cxl_pci.doe", },
> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> +
> +struct auxiliary_driver pci_doe_auxiliary_drv = {
> +	.name = "pci_doe_drv",
> +	.id_table = pci_doe_auxiliary_id_table,
> +	.probe = pci_doe_probe,
> +	.remove = pci_doe_remove
> +};
> +
> +static int __init pci_doe_init_module(void)
> +{
> +	int ret;
> +
> +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> +	if (ret) {
> +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> +		       ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit pci_doe_exit_module(void)
> +{
> +	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> +}
> +
> +module_init(pci_doe_init_module);
> +module_exit(pci_doe_exit_module);
> +MODULE_LICENSE("GPL v2");

What is the benefit of this being loadable as opposed to being static?

> diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> new file mode 100644
> index 000000000000..8380b7ad33d4
> --- /dev/null
> +++ b/include/linux/pci-doe.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/completion.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/auxiliary_bus.h>
> +
> +#ifndef LINUX_PCI_DOE_H
> +#define LINUX_PCI_DOE_H
> +
> +#define DOE_DEV_NAME "doe"

This isn't used here and should move to the patch that needs it.

> +struct pci_doe_protocol {
> +	u16 vid;
> +	u8 type;
> +};
> +
> +/**
> + * struct pci_doe_exchange - represents a single query/response
> + *
> + * @prot: DOE Protocol
> + * @request_pl: The request payload
> + * @request_pl_sz: Size of the request payload
> + * @response_pl: The response payload
> + * @response_pl_sz: Size of the response payload
> + */
> +struct pci_doe_exchange {
> +	struct pci_doe_protocol prot;
> +	u32 *request_pl;
> +	size_t request_pl_sz;
> +	u32 *response_pl;
> +	size_t response_pl_sz;
> +};
> +
> +/**
> + * struct pci_doe_dev - DOE mailbox device
> + *
> + * @adrv: Auxiliary Driver data
> + * @pdev: PCI device this belongs to
> + * @offset: Capability offset
> + *
> + * This represents a single DOE mailbox device.  Devices should create this
> + * device and register it on the Auxiliary bus for the DOE driver to maintain.
> + *
> + */
> +struct pci_doe_dev {
> +	struct auxiliary_device adev;
> +	struct pci_dev *pdev;
> +	int cap_offset;
> +};
> +
> +/* Library operations */
> +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
> +				 struct pci_doe_exchange *ex);
> +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type);
> +
> +#endif
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index e709ae8235e7..1073cd1916e1 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -730,7 +730,8 @@
>  #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
>  #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
>  #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
> -#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
> +#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
> +#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
>  
>  #define PCI_EXT_CAP_DSN_SIZEOF	12
>  #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
> @@ -1092,4 +1093,30 @@
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
>  
> +/* Data Object Exchange */
> +#define PCI_DOE_CAP            0x04    /* DOE Capabilities Register */
> +#define  PCI_DOE_CAP_INT                       0x00000001  /* Interrupt Support */
> +#define  PCI_DOE_CAP_IRQ                       0x00000ffe  /* Interrupt Message Number */
> +#define PCI_DOE_CTRL           0x08    /* DOE Control Register */
> +#define  PCI_DOE_CTRL_ABORT                    0x00000001  /* DOE Abort */
> +#define  PCI_DOE_CTRL_INT_EN                   0x00000002  /* DOE Interrupt Enable */
> +#define  PCI_DOE_CTRL_GO                       0x80000000  /* DOE Go */
> +#define PCI_DOE_STATUS         0x0c    /* DOE Status Register */
> +#define  PCI_DOE_STATUS_BUSY                   0x00000001  /* DOE Busy */
> +#define  PCI_DOE_STATUS_INT_STATUS             0x00000002  /* DOE Interrupt Status */
> +#define  PCI_DOE_STATUS_ERROR                  0x00000004  /* DOE Error */
> +#define  PCI_DOE_STATUS_DATA_OBJECT_READY      0x80000000  /* Data Object Ready */
> +#define PCI_DOE_WRITE          0x10    /* DOE Write Data Mailbox Register */
> +#define PCI_DOE_READ           0x14    /* DOE Read Data Mailbox Register */
> +
> +/* DOE Data Object - note not actually registers */
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID       0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE      0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH    0x0003ffff
> +
> +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX   0x000000ff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID     0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL        0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX 0xff000000

Conventional identation is via tabs when possible; the above all
appear to be spaces.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-05 23:50 ` [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices ira.weiny
  2021-11-08 13:09   ` Jonathan Cameron
@ 2021-11-16 23:48   ` Bjorn Helgaas
  2021-11-17 12:23     ` Jonathan Cameron
  1 sibling, 1 reply; 37+ messages in thread
From: Bjorn Helgaas @ 2021-11-16 23:48 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci

On Fri, Nov 05, 2021 at 04:50:54PM -0700, ira.weiny@intel.com wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> CXL devices have DOE mailboxes.  Create auxiliary devices which can be
> driven by the generic DOE auxiliary driver.

I admit to not being thrilled about joining the elite group of six
users of auxiliary_device_init(), and I don't know exactly what
benefits the auxiliary devices have.

Based on the ECN, it sounds like any PCI device can have DOE
capabilities, so I suspect the support for it should be in
drivers/pci/, not drivers/cxl/.  I don't really see anything
CXL-specific below.

What do these DOE capabilities look like in lspci?  I don't see any
support in the current version (which looks like it's a year old).

> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> ---
> Changes from V4:
> 	Make this an Auxiliary Driver rather than library functions
> 	Split this out into it's own patch
> 	Base on the new cxl_dev_state structure
> 
> Changes from Ben
> 	s/CXL_DOE_DEV_NAME/DOE_DEV_NAME/
> ---
>  drivers/cxl/Kconfig |   1 +
>  drivers/cxl/cxl.h   |  13 +++++
>  drivers/cxl/pci.c   | 120 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 134 insertions(+)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 67c91378f2dd..9d53720bea07 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -16,6 +16,7 @@ if CXL_BUS
>  config CXL_MEM
>  	tristate "CXL.mem: Memory Devices"
>  	default CXL_BUS
> +	select PCI_DOE_DRIVER
>  	help
>  	  The CXL.mem protocol allows a device to act as a provider of
>  	  "System RAM" and/or "Persistent Memory" that is fully coherent
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 5e2e93451928..f1241a7f2b7b 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -75,6 +75,19 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
>  #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
>  #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
>  
> +/*
> + * Address space properties derived from:
> + * CXL 2.0 8.2.5.12.7 CXL HDM Decoder 0 Control Register
> + */
> +#define CXL_ADDRSPACE_RAM   BIT(0)
> +#define CXL_ADDRSPACE_PMEM  BIT(1)
> +#define CXL_ADDRSPACE_TYPE2 BIT(2)
> +#define CXL_ADDRSPACE_TYPE3 BIT(3)
> +#define CXL_ADDRSPACE_MASK  GENMASK(3, 0)
> +
> +#define CXL_DOE_PROTOCOL_COMPLIANCE 0
> +#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2

None of these are used here, so they belong in a different patch.

>  #define CXL_COMPONENT_REGS() \
>  	void __iomem *hdm_decoder
>  
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 8dc91fd3396a..df524b74f1d2 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -6,6 +6,7 @@
>  #include <linux/mutex.h>
>  #include <linux/list.h>
>  #include <linux/pci.h>
> +#include <linux/pci-doe.h>
>  #include <linux/io.h>
>  #include "cxlmem.h"
>  #include "pci.h"
> @@ -471,6 +472,120 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
>  	return rc;
>  }
>  
> +static void cxl_mem_free_irq_vectors(void *data)
> +{
> +	pci_free_irq_vectors(data);
> +}
> +
> +static void cxl_destroy_doe_device(void *ad)
> +{
> +	struct auxiliary_device *adev = ad;
> +
> +	auxiliary_device_delete(adev);
> +	auxiliary_device_uninit(adev);
> +}
> +
> +static DEFINE_IDA(cxl_doe_adev_ida);
> +static void __doe_dev_release(struct auxiliary_device *adev)

Why the "__" prefix?  I don't see any similar name that requires
disambiguation.

> +{
> +	struct pci_doe_dev *doe_dev = container_of(adev, struct pci_doe_dev,
> +						   adev);
> +
> +	ida_free(&cxl_doe_adev_ida, adev->id);
> +	kfree(doe_dev);
> +}
> +
> +static void cxl_doe_dev_release(struct device *dev)
> +{
> +	struct auxiliary_device *adev = container_of(dev,
> +						struct auxiliary_device,
> +						dev);
> +	__doe_dev_release(adev);
> +}
> +
> +static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> +{
> +	struct device *dev = cxlds->dev;
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	int irqs, rc;
> +	u16 pos = 0;
> +
> +	/*
> +	 * An implementation of a cxl type3 device may support an unknown
> +	 * number of interrupts. Assume that number is not that large and
> +	 * request them all.
> +	 */
> +	irqs = pci_msix_vec_count(pdev);
> +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
> +	if (rc != irqs) {
> +		/* No interrupt available - carry on */
> +		dev_dbg(dev, "No interrupts available for DOE\n");
> +	} else {
> +		/*
> +		 * Enabling bus mastering could be done within the DOE
> +		 * initialization, but as it potentially has other impacts
> +		 * keep it within the driver.
> +		 */
> +		pci_set_master(pdev);

This enables the device to perform DMA, which doesn't seem to have
anything to do with the rest of this code.  Can it go somewhere near
something to do with DMA?

> +		rc = devm_add_action_or_reset(dev,
> +					      cxl_mem_free_irq_vectors,
> +					      pdev);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> +
> +	while (pos > 0) {
> +		struct auxiliary_device *adev;
> +		struct pci_doe_dev *new_dev;
> +		int id;
> +
> +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> +		if (!new_dev)
> +			return -ENOMEM;
> +
> +		new_dev->pdev = pdev;
> +		new_dev->cap_offset = pos;
> +
> +		/* Set up struct auxiliary_device */
> +		adev = &new_dev->adev;
> +		id = ida_alloc(&cxl_doe_adev_ida, GFP_KERNEL);
> +		if (id < 0) {
> +			kfree(new_dev);
> +			return -ENOMEM;
> +		}
> +
> +		adev->id = id;
> +		adev->name = DOE_DEV_NAME;
> +		adev->dev.release = cxl_doe_dev_release;
> +		adev->dev.parent = dev;
> +
> +		if (auxiliary_device_init(adev)) {
> +			__doe_dev_release(adev);
> +			return -EIO;
> +		}
> +
> +		if (auxiliary_device_add(adev)) {
> +			auxiliary_device_uninit(adev);
> +			return -EIO;
> +		}
> +
> +		rc = devm_add_action_or_reset(dev, cxl_destroy_doe_device, adev);
> +		if (rc)
> +			return rc;
> +
> +		if (device_attach(&adev->dev) != 1)
> +			dev_err(&adev->dev,
> +				"Failed to attach a driver to DOE device %d\n",
> +				adev->id);
> +
> +		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);

So we get an auxiliary device for every instance of a DOE capability?
I think the commit log should mention something about how many are
created (e.g., "one per DOE capability"), how they are named, whether
they appear in sysfs, how drivers bind to them, etc.

I assume there needs to be some coordination between possible multiple
users of a DOE capability?  How does that work?

> +	}
> +
> +	return 0;
> +}
> +
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct cxl_register_map map;
> @@ -517,6 +632,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxl_setup_doe_devices(cxlds);
> +	if (rc)
> +		return rc;
> +
>  	cxlmd = devm_cxl_add_memdev(cxlds);
>  	if (IS_ERR(cxlmd))
>  		return PTR_ERR(cxlmd);
> @@ -546,3 +665,4 @@ static struct pci_driver cxl_pci_driver = {
>  MODULE_LICENSE("GPL v2");
>  module_pci_driver(cxl_pci_driver);
>  MODULE_IMPORT_NS(CXL);
> +MODULE_SOFTDEP("pre: pci_doe");
> -- 
> 2.28.0.rc0.12.gb6a658bd00c9
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-16 23:48   ` Bjorn Helgaas
@ 2021-11-17 12:23     ` Jonathan Cameron
  2021-11-17 22:15       ` Bjorn Helgaas
  0 siblings, 1 reply; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-17 12:23 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ira.weiny, Dan Williams, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci

On Tue, 16 Nov 2021 17:48:29 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Fri, Nov 05, 2021 at 04:50:54PM -0700, ira.weiny@intel.com wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > CXL devices have DOE mailboxes.  Create auxiliary devices which can be
> > driven by the generic DOE auxiliary driver.  
> 
> I admit to not being thrilled about joining the elite group of six
> users of auxiliary_device_init(), and I don't know exactly what
> benefits the auxiliary devices have.

One for Dan...

> 
> Based on the ECN, it sounds like any PCI device can have DOE
> capabilities, so I suspect the support for it should be in
> drivers/pci/, not drivers/cxl/.  I don't really see anything
> CXL-specific below.

Agreed though how it all gets tied together isn't totally clear
to me yet. The messy bit is interrupts given I don't think we have
a model for enabling those anywhere other than in individual PCI drivers.

> 
> What do these DOE capabilities look like in lspci?  I don't see any
> support in the current version (which looks like it's a year old).

I don't think anyone has added support yet, but it would be simple to do.
Given possibility of breaking things if we actually exercise the discovery
protocol, we'll be constrained to just reporting there is a DOE instances
which is of limited use.

> 
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > ---
> > Changes from V4:
> > 	Make this an Auxiliary Driver rather than library functions
> > 	Split this out into it's own patch
> > 	Base on the new cxl_dev_state structure
> > 
> > Changes from Ben
> > 	s/CXL_DOE_DEV_NAME/DOE_DEV_NAME/
> > ---
> >  drivers/cxl/Kconfig |   1 +
> >  drivers/cxl/cxl.h   |  13 +++++
> >  drivers/cxl/pci.c   | 120 ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 134 insertions(+)
> > 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index 67c91378f2dd..9d53720bea07 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -16,6 +16,7 @@ if CXL_BUS
> >  config CXL_MEM
> >  	tristate "CXL.mem: Memory Devices"
> >  	default CXL_BUS
> > +	select PCI_DOE_DRIVER
> >  	help
> >  	  The CXL.mem protocol allows a device to act as a provider of
> >  	  "System RAM" and/or "Persistent Memory" that is fully coherent
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 5e2e93451928..f1241a7f2b7b 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -75,6 +75,19 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
> >  #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
> >  #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
> >  
> > +/*
> > + * Address space properties derived from:
> > + * CXL 2.0 8.2.5.12.7 CXL HDM Decoder 0 Control Register
> > + */
> > +#define CXL_ADDRSPACE_RAM   BIT(0)
> > +#define CXL_ADDRSPACE_PMEM  BIT(1)
> > +#define CXL_ADDRSPACE_TYPE2 BIT(2)
> > +#define CXL_ADDRSPACE_TYPE3 BIT(3)
> > +#define CXL_ADDRSPACE_MASK  GENMASK(3, 0)
> > +
> > +#define CXL_DOE_PROTOCOL_COMPLIANCE 0
> > +#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2  
> 
> None of these are used here, so they belong in a different patch.s
> 
> >  #define CXL_COMPONENT_REGS() \
> >  	void __iomem *hdm_decoder
> >  
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 8dc91fd3396a..df524b74f1d2 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/mutex.h>
> >  #include <linux/list.h>
> >  #include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> >  #include <linux/io.h>
> >  #include "cxlmem.h"
> >  #include "pci.h"
> > @@ -471,6 +472,120 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
> >  	return rc;
> >  }
> >  
> > +static void cxl_mem_free_irq_vectors(void *data)
> > +{
> > +	pci_free_irq_vectors(data);
> > +}
> > +
> > +static void cxl_destroy_doe_device(void *ad)
> > +{
> > +	struct auxiliary_device *adev = ad;
> > +
> > +	auxiliary_device_delete(adev);
> > +	auxiliary_device_uninit(adev);
> > +}
> > +
> > +static DEFINE_IDA(cxl_doe_adev_ida);
> > +static void __doe_dev_release(struct auxiliary_device *adev)  
> 
> Why the "__" prefix?  I don't see any similar name that requires
> disambiguation.
> 
> > +{
> > +	struct pci_doe_dev *doe_dev = container_of(adev, struct pci_doe_dev,
> > +						   adev);
> > +
> > +	ida_free(&cxl_doe_adev_ida, adev->id);
> > +	kfree(doe_dev);
> > +}
> > +
> > +static void cxl_doe_dev_release(struct device *dev)
> > +{
> > +	struct auxiliary_device *adev = container_of(dev,
> > +						struct auxiliary_device,
> > +						dev);
> > +	__doe_dev_release(adev);
> > +}
> > +
> > +static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> > +{
> > +	struct device *dev = cxlds->dev;
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	int irqs, rc;
> > +	u16 pos = 0;
> > +
> > +	/*
> > +	 * An implementation of a cxl type3 device may support an unknown
> > +	 * number of interrupts. Assume that number is not that large and
> > +	 * request them all.
> > +	 */
> > +	irqs = pci_msix_vec_count(pdev);
> > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
> > +	if (rc != irqs) {
> > +		/* No interrupt available - carry on */
> > +		dev_dbg(dev, "No interrupts available for DOE\n");
> > +	} else {
> > +		/*
> > +		 * Enabling bus mastering could be done within the DOE
> > +		 * initialization, but as it potentially has other impacts
> > +		 * keep it within the driver.
> > +		 */
> > +		pci_set_master(pdev);  
> 
> This enables the device to perform DMA, which doesn't seem to have
> anything to do with the rest of this code.  Can it go somewhere near
> something to do with DMA?

Needed for MSI/MSIx as well.  The driver doesn't do DMA for anything else.
Hence it's here in the interrupt enable path.

> 
> > +		rc = devm_add_action_or_reset(dev,
> > +					      cxl_mem_free_irq_vectors,
> > +					      pdev);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> > +	pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
> > +
> > +	while (pos > 0) {
> > +		struct auxiliary_device *adev;
> > +		struct pci_doe_dev *new_dev;
> > +		int id;
> > +
> > +		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
> > +		if (!new_dev)
> > +			return -ENOMEM;
> > +
> > +		new_dev->pdev = pdev;
> > +		new_dev->cap_offset = pos;
> > +
> > +		/* Set up struct auxiliary_device */
> > +		adev = &new_dev->adev;
> > +		id = ida_alloc(&cxl_doe_adev_ida, GFP_KERNEL);
> > +		if (id < 0) {
> > +			kfree(new_dev);
> > +			return -ENOMEM;
> > +		}
> > +
> > +		adev->id = id;
> > +		adev->name = DOE_DEV_NAME;
> > +		adev->dev.release = cxl_doe_dev_release;
> > +		adev->dev.parent = dev;
> > +
> > +		if (auxiliary_device_init(adev)) {
> > +			__doe_dev_release(adev);
> > +			return -EIO;
> > +		}
> > +
> > +		if (auxiliary_device_add(adev)) {
> > +			auxiliary_device_uninit(adev);
> > +			return -EIO;
> > +		}
> > +
> > +		rc = devm_add_action_or_reset(dev, cxl_destroy_doe_device, adev);
> > +		if (rc)
> > +			return rc;
> > +
> > +		if (device_attach(&adev->dev) != 1)
> > +			dev_err(&adev->dev,
> > +				"Failed to attach a driver to DOE device %d\n",
> > +				adev->id);
> > +
> > +		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);  
> 
> So we get an auxiliary device for every instance of a DOE capability?
> I think the commit log should mention something about how many are
> created (e.g., "one per DOE capability"), how they are named, whether
> they appear in sysfs, how drivers bind to them, etc.
> 
> I assume there needs to be some coordination between possible multiple
> users of a DOE capability?  How does that work?

The DOE handling implementation makes everything synchronous - so if multiple
users each may have to wait on queueing their query / responses exchanges.

The fun of non OS software accessing these is still an open question.

Jonathan

> 
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  {
> >  	struct cxl_register_map map;
> > @@ -517,6 +632,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  	if (rc)
> >  		return rc;
> >  
> > +	rc = cxl_setup_doe_devices(cxlds);
> > +	if (rc)
> > +		return rc;
> > +
> >  	cxlmd = devm_cxl_add_memdev(cxlds);
> >  	if (IS_ERR(cxlmd))
> >  		return PTR_ERR(cxlmd);
> > @@ -546,3 +665,4 @@ static struct pci_driver cxl_pci_driver = {
> >  MODULE_LICENSE("GPL v2");
> >  module_pci_driver(cxl_pci_driver);
> >  MODULE_IMPORT_NS(CXL);
> > +MODULE_SOFTDEP("pre: pci_doe");
> > -- 
> > 2.28.0.rc0.12.gb6a658bd00c9
> >   


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] PCI: Add vendor ID for the PCI SIG
  2021-11-05 23:50 ` [PATCH 1/5] PCI: Add vendor ID for the PCI SIG ira.weiny
@ 2021-11-17 21:50   ` Bjorn Helgaas
  0 siblings, 0 replies; 37+ messages in thread
From: Bjorn Helgaas @ 2021-11-17 21:50 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci

On Fri, Nov 05, 2021 at 04:50:52PM -0700, ira.weiny@intel.com wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> This ID is used in DOE headers to identify protocols that are defined
> within the PCI Express Base Specification.
> 
> Specified in Table 7-x2 of the Data Object Exchange ECN (approved 12 March
> 2020) available from https://members.pcisig.com/wg/PCI-SIG/document/14143
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  include/linux/pci_ids.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index 011f2f1ea5bb..849f514cd7db 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -149,6 +149,7 @@
>  #define PCI_CLASS_OTHERS		0xff
>  
>  /* Vendors and devices.  Sort key: vendor first, device next. */
> +#define PCI_VENDOR_ID_PCI_SIG		0x0001

We should probably also use this in pci_bus_crs_vendor_id().

>  #define PCI_VENDOR_ID_LOONGSON		0x0014
>  
> -- 
> 2.28.0.rc0.12.gb6a658bd00c9
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-17 12:23     ` Jonathan Cameron
@ 2021-11-17 22:15       ` Bjorn Helgaas
  2021-11-18 10:51         ` Jonathan Cameron
  2021-11-19  6:48         ` Christoph Hellwig
  0 siblings, 2 replies; 37+ messages in thread
From: Bjorn Helgaas @ 2021-11-17 22:15 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: ira.weiny, Dan Williams, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci,
	Christoph Hellwig, Thomas Gleixner

[+cc Christoph, Thomas for INTx/MSI/bus mastering question below]

On Wed, Nov 17, 2021 at 12:23:35PM +0000, Jonathan Cameron wrote:
> On Tue, 16 Nov 2021 17:48:29 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, Nov 05, 2021 at 04:50:54PM -0700, ira.weiny@intel.com wrote:
> > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > 
> > > CXL devices have DOE mailboxes.  Create auxiliary devices which can be
> > > driven by the generic DOE auxiliary driver.  

> > Based on the ECN, it sounds like any PCI device can have DOE
> > capabilities, so I suspect the support for it should be in
> > drivers/pci/, not drivers/cxl/.  I don't really see anything
> > CXL-specific below.
> 
> Agreed though how it all gets tied together isn't totally clear
> to me yet. The messy bit is interrupts given I don't think we have
> a model for enabling those anywhere other than in individual PCI drivers.

Ah.  Yeah, that is a little messy.  The only real precedent where the
PCI core and a driver might need to coordinate on interrupts is the
portdrv.  So far we've pretended that bridges do not have
device-specific functionality that might require interrupts.  I don't
think that's actually true, but we haven't integrated drivers for the
tuning, performance monitoring, and similar features that bridges may
have.  Yet.

In any case, I think the argument that DOE capabilities are not
CXL-specific still holds.

> > What do these DOE capabilities look like in lspci?  I don't see any
> > support in the current version (which looks like it's a year old).
> 
> I don't think anyone has added support yet, but it would be simple to do.
> Given possibility of breaking things if we actually exercise the discovery
> protocol, we'll be constrained to just reporting there is a DOE instances
> which is of limited use.

I think it's essential that lspci at least show the existence of DOE
capabilities and the safe-to-read registers (Capabilities, Control,
Status).

There's a very long lead time between adding the support and getting
updated versions of lspci into distros.

> > > +	 * An implementation of a cxl type3 device may support an unknown
> > > +	 * number of interrupts. Assume that number is not that large and
> > > +	 * request them all.
> > > +	 */
> > > +	irqs = pci_msix_vec_count(pdev);
> > > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
> > > +	if (rc != irqs) {
> > > +		/* No interrupt available - carry on */
> > > +		dev_dbg(dev, "No interrupts available for DOE\n");
> > > +	} else {
> > > +		/*
> > > +		 * Enabling bus mastering could be done within the DOE
> > > +		 * initialization, but as it potentially has other impacts
> > > +		 * keep it within the driver.
> > > +		 */
> > > +		pci_set_master(pdev);  
> > 
> > This enables the device to perform DMA, which doesn't seem to have
> > anything to do with the rest of this code.  Can it go somewhere
> > near something to do with DMA?
> 
> Needed for MSI/MSIx as well.  The driver doesn't do DMA for anything
> else.  Hence it's here in the interrupt enable path.

Oh, right, of course.  A hint here that MSI/MSI-X depends on bus
mastering would save me the trouble.

I wonder if the infrastructure, e.g., something inside
pci_alloc_irq_vectors_affinity() should do this for us.  The
connection is "obvious" but not mentioned in
Documentation/PCI/msi-howto.rst and I'm not sure how callers that
supply PCI_IRQ_ALL_TYPES would know whether they got a single MSI
vector (which requires bus mastering) or an INTx vector (which does
not).

> > So we get an auxiliary device for every instance of a DOE
> > capability?  I think the commit log should mention something about
> > how many are created (e.g., "one per DOE capability"), how they
> > are named, whether they appear in sysfs, how drivers bind to them,
> > etc.
> > 
> > I assume there needs to be some coordination between possible
> > multiple users of a DOE capability?  How does that work?
> 
> The DOE handling implementation makes everything synchronous - so if
> multiple users each may have to wait on queueing their query /
> responses exchanges.
> 
> The fun of non OS software accessing these is still an open
> question.

Sounds like something that potentially could be wrapped up in a safe
but slow interface that could be usable by others, including lspci?

Bjorn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-17 22:15       ` Bjorn Helgaas
@ 2021-11-18 10:51         ` Jonathan Cameron
  2021-11-19  6:48         ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-18 10:51 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ira.weiny, Dan Williams, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci,
	Christoph Hellwig, Thomas Gleixner

On Wed, 17 Nov 2021 16:15:36 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc Christoph, Thomas for INTx/MSI/bus mastering question below]
> 
> On Wed, Nov 17, 2021 at 12:23:35PM +0000, Jonathan Cameron wrote:
> > On Tue, 16 Nov 2021 17:48:29 -0600
> > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > On Fri, Nov 05, 2021 at 04:50:54PM -0700, ira.weiny@intel.com wrote:  
> > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > > 
> > > > CXL devices have DOE mailboxes.  Create auxiliary devices which can be
> > > > driven by the generic DOE auxiliary driver.    
> 
> > > Based on the ECN, it sounds like any PCI device can have DOE
> > > capabilities, so I suspect the support for it should be in
> > > drivers/pci/, not drivers/cxl/.  I don't really see anything
> > > CXL-specific below.  
> > 
> > Agreed though how it all gets tied together isn't totally clear
> > to me yet. The messy bit is interrupts given I don't think we have
> > a model for enabling those anywhere other than in individual PCI drivers.  
> 
> Ah.  Yeah, that is a little messy.  The only real precedent where the
> PCI core and a driver might need to coordinate on interrupts is the
> portdrv.  So far we've pretended that bridges do not have
> device-specific functionality that might require interrupts.  I don't
> think that's actually true, but we haven't integrated drivers for the
> tuning, performance monitoring, and similar features that bridges may
> have.  Yet.

Upstream ports of CXL switches will have DOE / CDAT - though no one has
one yet and we haven't emulated one in QEMU either yet (will do that shortly).
Also CMA (IDE possibly) will be a requirement for switches soon.  Still all that
stuff is at least in various specs, so can probably fit in the existing port-drv
framework.

I dropped work on our RP / portdrv PMU from a few years back because we broke the
hardware out as a separate RCiEP.  Hacking custom support into portdrv wasn't
pretty IIRC. 

> 
> In any case, I think the argument that DOE capabilities are not
> CXL-specific still holds.

Agreed.

> 
> > > What do these DOE capabilities look like in lspci?  I don't see any
> > > support in the current version (which looks like it's a year old).  
> > 
> > I don't think anyone has added support yet, but it would be simple to do.
> > Given possibility of breaking things if we actually exercise the discovery
> > protocol, we'll be constrained to just reporting there is a DOE instances
> > which is of limited use.  
> 
> I think it's essential that lspci at least show the existence of DOE
> capabilities and the safe-to-read registers (Capabilities, Control,
> Status).
> 
> There's a very long lead time between adding the support and getting
> updated versions of lspci into distros.

I'll add lspci support to my todo list.

> 
> > > > +	 * An implementation of a cxl type3 device may support an unknown
> > > > +	 * number of interrupts. Assume that number is not that large and
> > > > +	 * request them all.
> > > > +	 */
> > > > +	irqs = pci_msix_vec_count(pdev);
> > > > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
> > > > +	if (rc != irqs) {
> > > > +		/* No interrupt available - carry on */
> > > > +		dev_dbg(dev, "No interrupts available for DOE\n");
> > > > +	} else {
> > > > +		/*
> > > > +		 * Enabling bus mastering could be done within the DOE
> > > > +		 * initialization, but as it potentially has other impacts
> > > > +		 * keep it within the driver.
> > > > +		 */
> > > > +		pci_set_master(pdev);    
> > > 
> > > This enables the device to perform DMA, which doesn't seem to have
> > > anything to do with the rest of this code.  Can it go somewhere
> > > near something to do with DMA?  
> > 
> > Needed for MSI/MSIx as well.  The driver doesn't do DMA for anything
> > else.  Hence it's here in the interrupt enable path.  
> 
> Oh, right, of course.  A hint here that MSI/MSI-X depends on bus
> mastering would save me the trouble.

Ira, please add a comment when you respin.  Thanks!

> 
> I wonder if the infrastructure, e.g., something inside
> pci_alloc_irq_vectors_affinity() should do this for us.  The
> connection is "obvious" but not mentioned in
> Documentation/PCI/msi-howto.rst and I'm not sure how callers that
> supply PCI_IRQ_ALL_TYPES would know whether they got a single MSI
> vector (which requires bus mastering) or an INTx vector (which does
> not).

I wonder if there are devices that drivers that deliberately delay
the pci_set_master() until more stuff is set up.  I'd hope no device
is broken enough that it would matter, but 'maybe'?

In this particular case we don't run into the what type of interrupt
question as the spec requires MSI / MSIX but agreed that could be
a problem more generally.

I wonder how many types of EP actually have interrupts but no DMA though?
There are the memory buffers used for some types of P2P + the bridges you
mention.

> 
> > > So we get an auxiliary device for every instance of a DOE
> > > capability?  I think the commit log should mention something about
> > > how many are created (e.g., "one per DOE capability"), how they
> > > are named, whether they appear in sysfs, how drivers bind to them,
> > > etc.
> > > 
> > > I assume there needs to be some coordination between possible
> > > multiple users of a DOE capability?  How does that work?  
> > 
> > The DOE handling implementation makes everything synchronous - so if
> > multiple users each may have to wait on queueing their query /
> > responses exchanges.
> > 
> > The fun of non OS software accessing these is still an open
> > question.  
> 
> Sounds like something that potentially could be wrapped up in a safe
> but slow interface that could be usable by others, including lspci?

So one version of this patch set had a generic IOCTL interface, but
discussions around that (we also briefly touched on it at the plumbers
microconf) were heading in the direction of protocol specific interfaces.

If we put that back we'd need a more complex means of controlling access,
possibly only allowing the discovery protocols.
The problem is some protocols have strict ordering requirements so exposing
anything close to raw access could for example let userspace break establishment of
secure channels or collapse an existing secure channel.

A simple answer might be to have the DOE driver expose the discovered protocols
via sysfs.  I don't think they will change over time so simply caching them
at first load should be fine.

Jonathan

> 
> Bjorn


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table
  2021-11-05 23:50 ` [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
  2021-11-08 14:52   ` Jonathan Cameron
@ 2021-11-18 17:02   ` Jonathan Cameron
  2021-11-19 14:55   ` Jonathan Cameron
  2 siblings, 0 replies; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-18 17:02 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Fri, 5 Nov 2021 16:50:56 -0700
<ira.weiny@intel.com> wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> Parse and cache the DSMAS data from the CDAT table.  Store this data in
> Unmarshaled data structures for use later.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

> +static bool cdat_hdr_valid(struct cxl_memdev *cxlmd)
> +{
> +	u32 *data = cxlmd->cdat_table;
> +	u8 *data8 = (u8 *)data;
> +	u32 length, seq;
> +	u8 rev, cs;
> +	u8 check;
> +	int i;
> +
> +	length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, data[0]);
> +	if (length < CDAT_HEADER_LENGTH_BYTES)
> +		return false;
> +
> +	rev = FIELD_GET(CDAT_HEADER_DW1_REVISION, data[1]);
> +	cs = FIELD_GET(CDAT_HEADER_DW1_CHECKSUM, data[1]);
rev and cs both parsed out but not used...

W=1 is complaining at me, hence I noticed whilst rebasing this
series.

Jonathan

> +	seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, data[3]);
> +
> +	/* Store the sequence for now. */
> +	cxlmd->cdat_seq = seq;
> +
> +	for (check = 0, i = 0; i < length; i++)
> +		check += data8[i];
> +
> +	return check == 0;
> +}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver
  2021-11-10  5:45     ` Ira Weiny
@ 2021-11-18 18:48       ` Jonathan Cameron
  0 siblings, 0 replies; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-18 18:48 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

....

Sorry for the delay, I managed to miss this email.
Only realized I hadn't replied because I wanted to point out the docs
issue I'd missed in original review... See below.

> > I was carrying a rework of this locally because I managed
> > to convince myself this is wrong.  It's been a while and naturally
> > I didn't write a comprehensive set of notes on why it was wrong...
> > (Note you can't trigger the problem paths in QEMU without some
> > nasty hacks as it relies on opening up race windows that make
> > limited sense for the QEMU implementation).
> > 
> > It's all centered on some details of exactly what causes an interrupt
> > on a DOE.  Section 6.xx.3 Interrupt Generation states:
> > 
> > If enabled, an interrupt message must be triggered every time the
> > logical AND of the following conditions transitions from FALSE to TRUE:
> > 
> > * The associated vector is unmasked ...
> > * The value of the DOE interrupt enable bit is 1b
> > * The value of the DOE interrupt status bit is 1b
> > (only last one really maters to us I think).
> > 
> > The interrupt status bit is an OR conditional.
> > 
> > Must be set.. Data Object Read bit or DOE error bit set or DOE busy bit cleared.
> >   
> > > +{
> > > +	struct pci_doe *doe = data;
> > > +	struct pci_dev *pdev = doe->doe_dev->pdev;
> > > +	int offset = doe->doe_dev->cap_offset;
> > > +	u32 val;
> > > +
> > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {  
> > 
> > So this bit is set on any of: BUSY dropped, READY or ERROR.
> > If it's set on BUSY drop, but then in between the read above and this clear
> > READY becomes true, then my reading is that we will not get another interrupt.
> > That is fine because we will read it again in the state machine and see the
> > new state. We could do more of the dance in the interrupt controller by doing
> > a reread after clear of INT_STATUS but I think it's cleaner to leave
> > it in the state machine.
> > 
> > It might look nicer here to only write BIT(1) - RW1C, but that doesn't matter as
> > all the rest of the register is RO.
> >   
> > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val);
> > > +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +	/* Leave the error case to be handled outside IRQ */
> > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {  
> > 
> > I don't think we can get here because int status already true.
> > So should do this before the above general check to avoid clearning
> > the interrupt (we don't want more interrupts during the abort though
> > I'd hope the hardware wouldn't generate them).
> > 
> > So move this before the previous check.
> >   
> > > +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Busy being cleared can result in an interrupt, but as
> > > +	 * the original Busy may not have been detected, there is no
> > > +	 * way to separate such an interrupt from a spurious interrupt.
> > > +	 */  
> > 
> > This is misleading - as Busy bit clear would have resulted in INT_STATUS being true above
> > (that was a misread of the spec from me in v4).
> > So I don't think we can get here in any valid path.
> > 
> > return IRQ_NONE; should be safe.
> > 
> >   
> > > +	return IRQ_HANDLED;
> > > +}  
> > 
> > Summary of above suggested changes:
> > 1) Move the DOE_STATUS_ERROR block before the DOE_STATUS_INT_STATUS one
> > 2) Possibly uses
> >    pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, PCI_DOE_STATUS_INT_STATUS);
> >    to be explicit on the write one to clear bit.
> > 3) IRQ_NONE for the final return path as I'm fairly sure there is no valid route to that.
> >      
> 
> Done.
> 
> But just to ensure that I understand.  If STATUS_ERROR is indicated we are
> basically not clearing the irq because we are resetting the mailbox?  Because
> with this new code I don't see a pci_write_config_dword to clear INT_STATUS.

Exactly.

> 
> But if we are resetting the mailbox I think that is ok.
> 
> > ...
> >   
...

> > > +/**
> > > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response
> > > + * @doe: DOE mailbox state structure

This should be doe_dev,  another thing during build tests of a rebase as
I wanted to put the CMA stuff on top of this.


> > > + * @ex: Description of the buffers and Vendor ID + type used in this
> > > + *      request/response pair
> > > + *
> > > + * Excess data will be discarded.
> > > + *
> > > + * RETURNS: payload in bytes on success, < 0 on error
> > > + */
> > > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex)
> > > +{
> > > +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > > +	struct pci_doe_task task;
> > > +	DECLARE_COMPLETION_ONSTACK(c);
> > > +
> > > +	if (!doe)
> > > +		return -EAGAIN;
> > > +
> > > +	/* DOE requests must be a whole number of DW */
> > > +	if (ex->request_pl_sz % sizeof(u32))
> > > +		return -EINVAL;
> > > +
> > > +	task.ex = ex;
> > > +	task.cb = pci_doe_task_complete;
> > > +	task.private = &c;
> > > +
> > > +again:  
> > 
> > Hmm.   Whether having this code at this layer makes sense hinges on
> > whether we want to easily support async use of the DOE in future.  
> 
> I struggled with this.  I was trying to strike a balance with making this a
> synchronous call with only 1 outstanding task while leaving the statemachine
> alone.
> 
> FWIW I think the queue you had was just fine even though there was only this
> synchronous call.

We can put it back easily if we ever need it. Until then this is fine.

I'm not convinced any DOE use will be sufficiently high bandwidth that
it really matters if we support async accessors


> 
> > 
> > In v4 some of the async handling had ended up in this function and
> > should probably have been factored out to give us a 
> > 'queue up work' then 'wait for completion' sequence.
> > 
> > Given there is now more to be done in here perhaps we need to think
> > about such a separation to keep it clear that this is fundamentally
> > a synchronous wrapper around an asynchronous operation.  
> 
> I think that would be moving back in a direction of having a queue like you
> defined in V4.  Eliminating the queue really defined this function to sleep
> waiting for the state machine to be available.  Doing anything more would have
> messed with the state machine you wrote and I did not want to do that.
> 
> Dan should we move back to having a queue_task/wait_task like Jonathan had
> before?
> 
> >   
> > > +	mutex_lock(&doe->state_lock);
> > > +	if (doe->cur_task) {
> > > +		mutex_unlock(&doe->state_lock);
> > > +		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> > > +		goto again;
> > > +	}
> > > +
> > > +	if (doe->dead) {
> > > +		mutex_unlock(&doe->state_lock);
> > > +		return -EIO;
> > > +	}
> > > +	doe->cur_task = &task;
> > > +	schedule_delayed_work(&doe->statemachine, 0);
> > > +	mutex_unlock(&doe->state_lock);
> > > +
> > > +	wait_for_completion(&c);
> > > +
> > > +	return task.rv;
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);

> > ...
> >   
> > > +
> > > +static void pci_doe_unregister(struct pci_doe *doe)
> > > +{
> > > +	pci_doe_release_irq(doe);
> > > +	kfree(doe->irq_name);
> > > +	put_device(&doe->doe_dev->pdev->dev);  
> > 
> > This makes me wonder if we should be doing the get_device()
> > earlier in probe?  Limited harm in moving it to near the start
> > and then ending up with it being 'obviously' correct...  
> 
> Well...  get_device() is in pci_doe_register...  And it does it's own irq
> unwinding.
> 
> I guess we could call pci_doe_unregister() from that if we refactored this...
> 
> How about this?  (Diff to this code)

While it should work, I think I'd keep the error handling paths explicit
and not rely on irq_name == 0 and doe->irq == 0 making the error path
calls of pci_doe_unregister() safe.  That takes some thought when reviewing
(a little bit) whereas explicit error handling doesn't take as much.

I just don't like unregister having put_device() last when it isn't
the first thing done in register().  So moving that first perhaps gets
a reference before we strictly speaking need one, but it makes it clear
we can definitely release that reference where it's done in unregister.


> 
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> index 76acf4063b6b..6f2a419b3c93 100644
> --- a/drivers/pci/doe.c
> +++ b/drivers/pci/doe.c
> @@ -545,10 +545,12 @@ static int pci_doe_abort(struct pci_doe *doe)
>         return 0;
>  }
>  
> -static void pci_doe_release_irq(struct pci_doe *doe)
> +static void pci_doe_unregister(struct pci_doe *doe)
>  {
>         if (doe->irq > 0)
>                 free_irq(doe->irq, doe);
> +       kfree(doe->irq_name);
> +       put_device(&doe->doe_dev->pdev->dev);
>  }
>  
>  static int pci_doe_register(struct pci_doe *doe)
> @@ -559,21 +561,28 @@ static int pci_doe_register(struct pci_doe *doe)
>         int rc, irq;
>         u32 val;
>  
> +       /* Ensure the pci device remains until this driver is done with it */
> +       get_device(&pdev->dev);
> +
>         pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
>  
>         if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
>                 irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
> -               if (irq < 0)
> -                       return irq;
> +               if (irq < 0) {
> +                       rc = irq;
> +                       goto unregister;
> +               }
>  
>                 doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]",
>                                           doe->doe_dev->adev.name);
> -               if (!doe->irq_name)
> -                       return -ENOMEM;
> +               if (!doe->irq_name) {
> +                       rc = -ENOMEM;
> +                       goto unregister;
> +               }
>  
>                 rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe);
>                 if (rc)
> -                       goto err_free_name;
> +                       goto unregister;
>  
>                 doe->irq = irq;
>                 pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> @@ -583,27 +592,15 @@ static int pci_doe_register(struct pci_doe *doe)
>         /* Reset the mailbox by issuing an abort */
>         rc = pci_doe_abort(doe);
>         if (rc)
> -               goto err_free_irq;
> -
> -       /* Ensure the pci device remains until this driver is done with it */
> -       get_device(&pdev->dev);
> +               goto unregister;
>  
>         return 0;
>  
> -err_free_irq:
> -       pci_doe_release_irq(doe);
> -err_free_name:
> -       kfree(doe->irq_name);
> +unregister:
> +       pci_doe_unregister(doe);
>         return rc;
>  }
>  
> -static void pci_doe_unregister(struct pci_doe *doe)
> -{
> -       pci_doe_release_irq(doe);
> -       kfree(doe->irq_name);
> -       put_device(&doe->doe_dev->pdev->dev);
> -}
> -
>  /*
>   * pci_doe_probe() - Set up the Mailbox
>   * @aux_dev: Auxiliary Device
> 
> 
> >   
> > > +}
> > > +
> > > +/*
> > > + * pci_doe_probe() - Set up the Mailbox
> > > + * @aux_dev: Auxiliary Device
> > > + * @id: Auxiliary device ID
> > > + *
> > > + * Probe the mailbox found for all protocols and set up the Mailbox
> > > + *
> > > + * RETURNS: 0 on success, < 0 on error
> > > + */
> > > +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> > > +			 const struct auxiliary_device_id *id)
> > > +{
> > > +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> > > +					struct pci_doe_dev,
> > > +					adev);
> > > +	struct pci_doe *doe;
> > > +	int rc;
> > > +
> > > +	doe = kzalloc(sizeof(*doe), GFP_KERNEL);  
> > 
> > Could go devm_ for this I think, though may not be worthwhile.  
> 
> Yes I think it is worth it...  I should use it more.
> 
> BTW why did you not use devm_krealloc() for the protocols?

I have a sneaky suspicion this has been around long enough it predates
devm_krealloc.

> 
> I did not realize that call existed before you mentioned it in the other patch
> review.

It is rather new and shiny :)

> 
> Any issue with using it there?

Should be fine I think.

> 
> >   
> > > +	if (!doe)
> > > +		return -ENOMEM;
> > > +
> > > +	mutex_init(&doe->state_lock);
> > > +	init_completion(&doe->abort_c);
> > > +	doe->doe_dev = doe_dev;
> > > +	init_waitqueue_head(&doe->wq);
> > > +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> > > +	dev_set_drvdata(&aux_dev->dev, doe);
> > > +
> > > +	rc = pci_doe_register(doe);
> > > +	if (rc)
> > > +		goto err_free;
> > > +
> > > +	rc = pci_doe_cache_protocols(doe);
> > > +	if (rc) {
> > > +		pci_doe_unregister(doe);  
> > 
> > Mixture of different forms of error handling here.
> > I'd move this below and add an err_unregister label.  
> 
> Actually with the devm_kzalloc() we don't need the goto at all.  We can just
> return.  I _think_?  Right?

yes

> 
> >   
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err_free:
> > > +	kfree(doe);
> > > +	return rc;
> > > +}
> > > +
> > > +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> > > +{
> > > +	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> > > +
> > > +	/* First halt the state machine */
> > > +	cancel_delayed_work_sync(&doe->statemachine);
> > > +	kfree(doe->prots);  
> > 
> > Logical flow to me is unregister first, free protocols second
> > (to reverse what we do in probe)  
> 
> No this is the reverse of the probe order I think.
> 
> Order is
> 	register
> 	cache protocols
> 
> Then we
> 	free 'uncache' protocols
> 	unregister
> 
> Right?

Huh. No idea what I was going on about.

> 
> >   
> > > +	pci_doe_unregister(doe);
> > > +	kfree(doe);
> > > +}
> > > +
> > > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> > > +	{.name = "cxl_pci.doe", },  
> > 
> > I'd like to hear from Bjorn on whether registering this from the CXL
> > device is the right approach or if we should perhaps just do it directly from
> > somewhere in PCI. (really applies to patch 3) I'll talk more about this there.  
> 
> Actually I think this could be left blank until the next patch...  It's just
> odd to define an empty table in the next few structures.  But technically this
> is not needed until the devices are defined.
> 
> I'm ok waiting to see what Bjorn thinks regarding the CXL vs PCI placement
> though.
> 
> >   
> > > +	{},
> > > +};
> > > +
> > > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> > > +
> > > +struct auxiliary_driver pci_doe_auxiliary_drv = {
> > > +	.name = "pci_doe_drv",  
> > 
> > I would assume this is only used in contexts where the _drv is
> > obvious?  I would go with "pci_doe".  
> 
> Sure. done.
> 
> >   
> > > +	.id_table = pci_doe_auxiliary_id_table,
> > > +	.probe = pci_doe_probe,
> > > +	.remove = pci_doe_remove
> > > +};
> > > +
> > > +static int __init pci_doe_init_module(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> > > +	if (ret) {
> > > +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> > > +		       ret);
> > > +		return ret;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void __exit pci_doe_exit_module(void)
> > > +{
> > > +	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> > > +}
> > > +
> > > +module_init(pci_doe_init_module);
> > > +module_exit(pci_doe_exit_module);  
> > 
> > Seems like the auxiliary bus would benefit from a
> > module_auxiliary_driver() macro to cover this simple registration stuff
> > similar to module_i2c_driver() etc.
> > 
> > Mind you, looking at 5.15 this would be the only user, so maybe one
> > for the 'next' case on basis two instances proves it's 'common' ;)  
> 
> I'm inclined to leave this alone ATM.  I tried to clean up the auxiliary device
> documentation and got a bunch more work asked of me by Greg KH.  So I'm behind
> on that ATM.
> 
> Later we can investigate that a bit I think.

Sure. No rush on that.

> 
> >   
> > > +MODULE_LICENSE("GPL v2");
> > > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> > > new file mode 100644
> > > index 000000000000..8380b7ad33d4
> > > --- /dev/null
> > > +++ b/include/linux/pci-doe.h
> > > @@ -0,0 +1,63 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> > > + *
> > > + * Copyright (C) 2021 Huawei
> > > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > + */
> > > +
> > > +#include <linux/completion.h>
> > > +#include <linux/list.h>
> > > +#include <linux/mutex.h>  
> > 
> > Not used in this header that I can see, so push down to the c files.  
> 
> oops...  thanks.
> 
> >   
> > > +#include <linux/auxiliary_bus.h>
> > > +
> > > +#ifndef LINUX_PCI_DOE_H
> > > +#define LINUX_PCI_DOE_H
> > > +
> > > +#define DOE_DEV_NAME "doe"  
> > 
> > Not sure this is used?  
> 
> Used in the next patch...  and it kind of goes along with the table_id name...
> 
> I'll see about moving both of those to the next patch where it makes more sense
> for now.
> 
> Thanks for the review,
> Ira
> 
Thanks,

Jonathan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-17 22:15       ` Bjorn Helgaas
  2021-11-18 10:51         ` Jonathan Cameron
@ 2021-11-19  6:48         ` Christoph Hellwig
  2021-11-29 23:37           ` Dan Williams
  1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2021-11-19  6:48 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jonathan Cameron, ira.weiny, Dan Williams, Alison Schofield,
	Vishal Verma, Ben Widawsky, Bjorn Helgaas, linux-cxl, linux-pci,
	Christoph Hellwig, Thomas Gleixner

On Wed, Nov 17, 2021 at 04:15:36PM -0600, Bjorn Helgaas wrote:
> > Agreed though how it all gets tied together isn't totally clear
> > to me yet. The messy bit is interrupts given I don't think we have
> > a model for enabling those anywhere other than in individual PCI drivers.
> 
> Ah.  Yeah, that is a little messy.  The only real precedent where the
> PCI core and a driver might need to coordinate on interrupts is the
> portdrv.  So far we've pretended that bridges do not have
> device-specific functionality that might require interrupts.  I don't
> think that's actually true, but we haven't integrated drivers for the
> tuning, performance monitoring, and similar features that bridges may
> have.  Yet.

And portdrv really is conceptually part of the core PCI core, and
should eventually be fully integrated..

> In any case, I think the argument that DOE capabilities are not
> CXL-specific still holds.

Agreed.

> Oh, right, of course.  A hint here that MSI/MSI-X depends on bus
> mastering would save me the trouble.
> 
> I wonder if the infrastructure, e.g., something inside
> pci_alloc_irq_vectors_affinity() should do this for us.  The
> connection is "obvious" but not mentioned in
> Documentation/PCI/msi-howto.rst and I'm not sure how callers that
> supply PCI_IRQ_ALL_TYPES would know whether they got a single MSI
> vector (which requires bus mastering) or an INTx vector (which does
> not).

As a minimum step we should document that this.  That being said
I don't tink we can just make the interrupt API call pci_set_master
as there might be strange ordering requirements in the drivers.

> > > So we get an auxiliary device for every instance of a DOE
> > > capability?  I think the commit log should mention something about
> > > how many are created (e.g., "one per DOE capability"), how they
> > > are named, whether they appear in sysfs, how drivers bind to them,
> > > etc.
> > > 
> > > I assume there needs to be some coordination between possible
> > > multiple users of a DOE capability?  How does that work?
> > 
> > The DOE handling implementation makes everything synchronous - so if
> > multiple users each may have to wait on queueing their query /
> > responses exchanges.
> > 
> > The fun of non OS software accessing these is still an open
> > question.
> 
> Sounds like something that potentially could be wrapped up in a safe
> but slow interface that could be usable by others, including lspci?

I guess we have to.  I think this interface is a nightmare.  Why o why
does the PCI SGI keep doing these stupid things (see also VPDs).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE
  2021-11-05 23:50 ` [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE ira.weiny
  2021-11-08 13:21   ` Jonathan Cameron
  2021-11-08 15:02   ` Jonathan Cameron
@ 2021-11-19 14:40   ` Jonathan Cameron
  2 siblings, 0 replies; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-19 14:40 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Fri, 5 Nov 2021 16:50:55 -0700
<ira.weiny@intel.com> wrote:

> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Read CDAT raw table data from the cxl_mem state object.  Currently this
> is only supported by a PCI CXL object through a DOE mailbox which supports
> CDAT.  But any cxl_mem type object can provide this data later if need
> be.  For example for testing.
> 
> Cache this data for later parsing.  Provide a sysfs binary attribute to
> allow dumping of the CDAT.
> 
> Binary dumping is modeled on /sys/firmware/ACPI/tables/
> 
> The ability to dump this table will be very useful for emulation of real
> devices once they become available as QEMU CXL type 3 device emulation will
> be able to load this file in.
> 
> This does not support table updates at runtime. It will always provide
> whatever was there when first cached. Handling of table updates can be
> implemented later.
> 
> Once there are more users, this code can move out to driver/cxl/cdat.c
> or similar.
> 
> Finally create a complete list of DOE defines within cdat.h for anyone
> wishing to decode the CDAT table.
> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 

>  
>  static struct attribute_group cxl_memdev_ram_attribute_group = {
> @@ -293,6 +329,16 @@ devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
>  	if (rc)
>  		goto err;
>  
> +	/* Cache the data early to ensure is_visible() works */
> +	if (!cxl_mem_cdat_get_length(cxlds, &cxlmd->cdat_length)) {
> +		cxlmd->cdat_table = devm_kzalloc(dev, cxlmd->cdat_length, GFP_KERNEL);

I think this devm_ call should be using the parent device, not this one.

As it stands it breaks Ben's mem.c driver which probes for this device
and fails because you can't call probe for a device that already has things
in it's devres queue.

Too many patches in flight at the same time makes for some entertaining
rebases if you want them all at once..

Jonathan

> +		if (!cxlmd->cdat_table) {
> +			rc = -ENOMEM;
> +			goto err;
> +		}
> +		cxl_mem_cdat_read_table(cxlds, cxlmd->cdat_table, cxlmd->cdat_length);
> +	}
> +
>  	/*
>  	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
>  	 * needed as this is ordered with cdev_add() publishing the device.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table
  2021-11-05 23:50 ` [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
  2021-11-08 14:52   ` Jonathan Cameron
  2021-11-18 17:02   ` Jonathan Cameron
@ 2021-11-19 14:55   ` Jonathan Cameron
  2 siblings, 0 replies; 37+ messages in thread
From: Jonathan Cameron @ 2021-11-19 14:55 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Alison Schofield, Vishal Verma, Ben Widawsky,
	Bjorn Helgaas, linux-cxl, linux-pci

On Fri, 5 Nov 2021 16:50:56 -0700
<ira.weiny@intel.com> wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> Parse and cache the DSMAS data from the CDAT table.  Store this data in
> Unmarshaled data structures for use later.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

More fun from clashing patch sets below.
I think this is wrong rather than the other patch, but I'm prepared to
be persuaded otherwise!

Ben, this is related to your mega RFC for regions etc.

Jonathan


> +static int parse_dsmas(struct cxl_memdev *cxlmd)
> +{
> +	struct cxl_dsmas *dsmas_ary = NULL;
> +	u32 *data = cxlmd->cdat_table;
> +	int bytes_left = cxlmd->cdat_length;
> +	int nr_dsmas = 0;
> +	size_t dsmas_byte_size;
> +	int rc = 0;
> +
> +	if (!data || !cdat_hdr_valid(cxlmd))
> +		return -ENXIO;
> +
> +	/* Skip header */
> +	data += CDAT_HEADER_LENGTH_DW;
> +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> +
> +	while (bytes_left > 0) {
> +		u32 *cur_rec = data;
> +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> +
> +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
> +			struct cxl_dsmas *new_ary;
> +			u8 flags;
> +
> +			new_ary = krealloc(dsmas_ary,
> +					   sizeof(*dsmas_ary) * (nr_dsmas+1),
> +					   GFP_KERNEL);
> +			if (!new_ary) {
> +				dev_err(&cxlmd->dev,
> +					"Failed to allocate memory for DSMAS data\n");
> +				rc = -ENOMEM;
> +				goto free_dsmas;
> +			}
> +			dsmas_ary = new_ary;
> +
> +			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
> +
> +			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
> +			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
> +			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
> +
> +			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
> +				nr_dsmas,
> +				dsmas_ary[nr_dsmas].dpa_base,
> +				dsmas_ary[nr_dsmas].dpa_base +
> +					dsmas_ary[nr_dsmas].dpa_length,
> +				(dsmas_ary[nr_dsmas].non_volatile ?
> +					"Persistent" : "Volatile")
> +				);
> +
> +			nr_dsmas++;
> +		}
> +
> +		data += (length/sizeof(u32));
> +		bytes_left -= length;
> +	}
> +
> +	if (nr_dsmas == 0) {
> +		rc = -ENXIO;
> +		goto free_dsmas;
> +	}
> +
> +	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
> +
> +	dsmas_byte_size = sizeof(*dsmas_ary) * nr_dsmas;
> +	cxlmd->dsmas_ary = devm_kzalloc(&cxlmd->dev, dsmas_byte_size, GFP_KERNEL);

Here is another place where we need to hang this off cxlds->dev rather than this
one to avoid breaking Ben's code.


> +	if (!cxlmd->dsmas_ary) {
> +		rc = -ENOMEM;
> +		goto free_dsmas;
> +	}
> +
> +	memcpy(cxlmd->dsmas_ary, dsmas_ary, dsmas_byte_size);
> +	cxlmd->nr_dsmas = nr_dsmas;
> +
> +free_dsmas:
> +	kfree(dsmas_ary);
> +	return rc;
> +}
> +

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-19  6:48         ` Christoph Hellwig
@ 2021-11-29 23:37           ` Dan Williams
  2021-11-29 23:59             ` Dan Williams
  0 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2021-11-29 23:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bjorn Helgaas, Jonathan Cameron, Weiny, Ira, Alison Schofield,
	Vishal Verma, Ben Widawsky, Bjorn Helgaas, linux-cxl, Linux PCI,
	Thomas Gleixner

On Thu, Nov 18, 2021 at 10:48 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Nov 17, 2021 at 04:15:36PM -0600, Bjorn Helgaas wrote:
> > > Agreed though how it all gets tied together isn't totally clear
> > > to me yet. The messy bit is interrupts given I don't think we have
> > > a model for enabling those anywhere other than in individual PCI drivers.
> >
> > Ah.  Yeah, that is a little messy.  The only real precedent where the
> > PCI core and a driver might need to coordinate on interrupts is the
> > portdrv.  So far we've pretended that bridges do not have
> > device-specific functionality that might require interrupts.  I don't
> > think that's actually true, but we haven't integrated drivers for the
> > tuning, performance monitoring, and similar features that bridges may
> > have.  Yet.
>
> And portdrv really is conceptually part of the core PCI core, and
> should eventually be fully integrated..

What does a fully integrated portdrv look like? DOE enabling could
follow a similar model.

>
> > In any case, I think the argument that DOE capabilities are not
> > CXL-specific still holds.
>
> Agreed.

I don't think anyone is arguing that DOE is something CXL specific.
The enabling belongs only in drivers/pci/ as a DOE core, and then that
core is referenced by any other random PCI driver that needs to
interact with a DOE.

The question is what does that DOE core look like? A Linux device
representing the DOE capability and a common driver for the
data-transfer seems a reasonable abstraction to me and that's what
Auxiliary Bus offers.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-29 23:37           ` Dan Williams
@ 2021-11-29 23:59             ` Dan Williams
  2021-11-30  6:42               ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2021-11-29 23:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bjorn Helgaas, Jonathan Cameron, Weiny, Ira, Alison Schofield,
	Vishal Verma, Ben Widawsky, Bjorn Helgaas, linux-cxl, Linux PCI,
	Thomas Gleixner

On Mon, Nov 29, 2021 at 3:37 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Thu, Nov 18, 2021 at 10:48 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Wed, Nov 17, 2021 at 04:15:36PM -0600, Bjorn Helgaas wrote:
> > > > Agreed though how it all gets tied together isn't totally clear
> > > > to me yet. The messy bit is interrupts given I don't think we have
> > > > a model for enabling those anywhere other than in individual PCI drivers.
> > >
> > > Ah.  Yeah, that is a little messy.  The only real precedent where the
> > > PCI core and a driver might need to coordinate on interrupts is the
> > > portdrv.  So far we've pretended that bridges do not have
> > > device-specific functionality that might require interrupts.  I don't
> > > think that's actually true, but we haven't integrated drivers for the
> > > tuning, performance monitoring, and similar features that bridges may
> > > have.  Yet.
> >
> > And portdrv really is conceptually part of the core PCI core, and
> > should eventually be fully integrated..
>
> What does a fully integrated portdrv look like? DOE enabling could
> follow a similar model.
>
> >
> > > In any case, I think the argument that DOE capabilities are not
> > > CXL-specific still holds.
> >
> > Agreed.
>
> I don't think anyone is arguing that DOE is something CXL specific.
> The enabling belongs only in drivers/pci/ as a DOE core, and then that
> core is referenced by any other random PCI driver that needs to
> interact with a DOE.
>
> The question is what does that DOE core look like? A Linux device
> representing the DOE capability and a common driver for the
> data-transfer seems a reasonable abstraction to me and that's what
> Auxiliary Bus offers.

I will also add that this is not just an argument to use a
device+driver organization for its own sake, it also allows for an
idiomatic ABI for determining when the kernel is using a device
capability vs when userspace is using it. For example,
IO_STRICT_DEVMEM currently locks out userspace MMIO when a kernel
driver is attached to a device. I see a need for a similar policy to
lock out userspace from configuration writes to the DOE while the
kernel is using the DOE. If userspace wants / needs access returned to
it then it can force unbind just that one DOE aux-driver rather than
unbind the driver for the entire device if DOE was just a library that
all drivers linked against.

DOE negotiates security features like SPDM and IDE. I think it is
important for the kernel to be able to control access to DOE instances
even though it has not cared about protecting itself from userspace
initiated configuration writes in the past.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices
  2021-11-29 23:59             ` Dan Williams
@ 2021-11-30  6:42               ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2021-11-30  6:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Bjorn Helgaas, Jonathan Cameron, Weiny, Ira,
	Alison Schofield, Vishal Verma, Ben Widawsky, Bjorn Helgaas,
	linux-cxl, Linux PCI, Thomas Gleixner

On Mon, Nov 29, 2021 at 03:59:25PM -0800, Dan Williams wrote:
> DOE negotiates security features like SPDM and IDE. I think it is
> important for the kernel to be able to control access to DOE instances
> even though it has not cared about protecting itself from userspace
> initiated configuration writes in the past.

I think DOE is pretty much a kernel only feature and we can't allow
userspace access to it at all.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver
  2021-11-16 23:48   ` Bjorn Helgaas
@ 2021-12-03 20:48     ` Dan Williams
  2021-12-03 23:56       ` Bjorn Helgaas
  0 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2021-12-03 20:48 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Weiny, Ira, Jonathan Cameron, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, Linux PCI

On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >
> > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> >
> > Define an auxiliary device driver which control DOE auxiliary devices
> > registered on the auxiliary bus.
>
> What do we gain by making this an auxiliary driver?
>
> This doesn't really feel like a "driver," and apparently it used to be
> a library.  I'd like to see the rationale and benefits of the driver
> approach (in the eventual commit log as well as the current email
> thread).
>

I asked Ira to use the auxiliary bus for DOE primarily for the ABI it
offers for userspace to manage kernel vs userspace access to a device.
CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not
clobber mmio space that is actively claimed by a kernel driver. I
submit that DOE merits the same protection for DOE instances that the
kernel consumes.

Unlike other PCI configuration registers that root userspace has no
reason to touch unless it wants to actively break things, DOE is a
mechanism that root userspace may need to access directly in some
cases. There are a few examples that come to mind.

CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE)
offers a mechanism to set different test modes for a DOE device. The
kernel has no reason to ever use that interface, and it has strong
reasons to want to block access to it in production. However, hardware
vendors also use debug Linux builds for hardware bringup. So I would
like to be able to say that the mechanism to gain access to the
compliance DOE is to detach the aux DOE driver from the right aux DOE
device. Could we build a custom way to do the same for the DOE
library, sure, but why re-invent the wheel when udev and the driver
model can handle this type of policy question already?

Another use case is SPDM where an agent can establish a secure message
passing channel to a device, or paravirtualized device to exchange
protected messages with the hypervisor. My expectation is that in
addition to the kernel establishing SPDM sessions for PCI IDE and
CXL.cachemem IDE (link Integrity and Data Encryption) there will be
use cases for root userspace to establish their own SPDM session. In
that scenario as well the kernel can be told to give up control of a
specific DOE instance by detaching the aux device for its driver, but
otherwise the kernel driver can be assured that userspace will not
clobber its communications with its own attempts to talk over the DOE.

Lastly, and perhaps this is minor, the PCI core is a built-in object
and aux-bus allows for breaking out device library functionality like
this into a dedicated module. But yes, that's not a good reason unto
itself because you could "auxify" almost anything past the point of
reason just to get more modules.

> > A DOE mailbox is allowed to support any number of protocols while some
> > DOE protocol specifications apply additional restrictions.
>
> This sounds something like a fancy version of VPD, and VPD has been a
> huge headache.  I hope DOE avoids that ;)

Please say a bit more, I think DOE is a rather large headache as
evidenced by us fine folks grappling with how to architect the Linux
enabling.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver
  2021-12-03 20:48     ` Dan Williams
@ 2021-12-03 23:56       ` Bjorn Helgaas
  2021-12-04 15:47         ` Dan Williams
  0 siblings, 1 reply; 37+ messages in thread
From: Bjorn Helgaas @ 2021-12-03 23:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Weiny, Ira, Jonathan Cameron, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, Linux PCI

On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote:
> On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:
> > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > >
> > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > > with standard protocol discovery.  Each mailbox is accessed through a
> > > DOE Extended Capability.
> > >
> > > Define an auxiliary device driver which control DOE auxiliary devices
> > > registered on the auxiliary bus.
> >
> > What do we gain by making this an auxiliary driver?
> >
> > This doesn't really feel like a "driver," and apparently it used to be
> > a library.  I'd like to see the rationale and benefits of the driver
> > approach (in the eventual commit log as well as the current email
> > thread).
> 
> I asked Ira to use the auxiliary bus for DOE primarily for the ABI it
> offers for userspace to manage kernel vs userspace access to a device.
> CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not
> clobber mmio space that is actively claimed by a kernel driver. I
> submit that DOE merits the same protection for DOE instances that the
> kernel consumes.
>
> Unlike other PCI configuration registers that root userspace has no
> reason to touch unless it wants to actively break things, DOE is a
> mechanism that root userspace may need to access directly in some
> cases. There are a few examples that come to mind.

It's useful for root to read/write config registers with setpci, e.g.,
to test ASPM configuration, test power management behavior, etc.  That
can certainly break things and interfere with kernel access (and IMO
should taint the kernel) but we have so far accepted that risk.  I
think the same will be true for DOE.

In addition, I would think you might want a safe userspace interface
via sysfs, e.g., something like the "vpd" file, but I missed that if
it was in this series.

> CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE)
> offers a mechanism to set different test modes for a DOE device. The
> kernel has no reason to ever use that interface, and it has strong
> reasons to want to block access to it in production. However, hardware
> vendors also use debug Linux builds for hardware bringup. So I would
> like to be able to say that the mechanism to gain access to the
> compliance DOE is to detach the aux DOE driver from the right aux DOE
> device. Could we build a custom way to do the same for the DOE
> library, sure, but why re-invent the wheel when udev and the driver
> model can handle this type of policy question already?
> 
> Another use case is SPDM where an agent can establish a secure message
> passing channel to a device, or paravirtualized device to exchange
> protected messages with the hypervisor. My expectation is that in
> addition to the kernel establishing SPDM sessions for PCI IDE and
> CXL.cachemem IDE (link Integrity and Data Encryption) there will be
> use cases for root userspace to establish their own SPDM session. In
> that scenario as well the kernel can be told to give up control of a
> specific DOE instance by detaching the aux device for its driver, but
> otherwise the kernel driver can be assured that userspace will not
> clobber its communications with its own attempts to talk over the DOE.

I assume the kernel needs to control access to DOE in all cases,
doesn't it?  For example, DOE can generate interrupts, and only the
kernel can field them.  Maybe if I saw the userspace interface this
would make more sense to me.  I'm hoping there's a reasonable "send
this query and give me the response" primitive that can be implemented
in the kernel, used by drivers, and exposed safely to userspace.

> Lastly, and perhaps this is minor, the PCI core is a built-in object
> and aux-bus allows for breaking out device library functionality like
> this into a dedicated module. But yes, that's not a good reason unto
> itself because you could "auxify" almost anything past the point of
> reason just to get more modules.
> 
> > > A DOE mailbox is allowed to support any number of protocols while some
> > > DOE protocol specifications apply additional restrictions.
> >
> > This sounds something like a fancy version of VPD, and VPD has been a
> > huge headache.  I hope DOE avoids that ;)
> 
> Please say a bit more, I think DOE is a rather large headache as
> evidenced by us fine folks grappling with how to architect the Linux
> enabling.

VPD is not widely used, so gets poor testing.  The device doesn't tell
us how much VPD it has, so we have to read until the end.  The data is
theoretically self-describing (series of elements, each containing a
type and size), but of course some devices don't format it correctly,
so we read to the absolute limit, which takes a long time.

Typical contents of unimplemented or uninitialized VPD are all 0x00 or
all 0xff, neither of which is defined as a valid "end of VPD" signal,
so we have hacky "this doesn't look like VPD" code.

The PCI core doesn't need VPD itself and should only need to look for
the size of each element and the "end" tag, but it works around some
issues by doing more interpretation of the data.  The spec is a little
ambiguous and leaves room for vendors to use types not mentioned in
the spec.

Some devices share VPD hardware across functions but don't protect it
correctly (a hardware defect, granted).

VPD has address/data registers, so it requires locking, polling for
completion, and timeouts.

Bjorn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver
  2021-12-03 23:56       ` Bjorn Helgaas
@ 2021-12-04 15:47         ` Dan Williams
  2021-12-06 12:27           ` Jonathan Cameron
  0 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2021-12-04 15:47 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Weiny, Ira, Jonathan Cameron, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, Linux PCI

On Fri, Dec 3, 2021 at 3:56 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote:
> > On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:
> > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > >
> > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > > > with standard protocol discovery.  Each mailbox is accessed through a
> > > > DOE Extended Capability.
> > > >
> > > > Define an auxiliary device driver which control DOE auxiliary devices
> > > > registered on the auxiliary bus.
> > >
> > > What do we gain by making this an auxiliary driver?
> > >
> > > This doesn't really feel like a "driver," and apparently it used to be
> > > a library.  I'd like to see the rationale and benefits of the driver
> > > approach (in the eventual commit log as well as the current email
> > > thread).
> >
> > I asked Ira to use the auxiliary bus for DOE primarily for the ABI it
> > offers for userspace to manage kernel vs userspace access to a device.
> > CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not
> > clobber mmio space that is actively claimed by a kernel driver. I
> > submit that DOE merits the same protection for DOE instances that the
> > kernel consumes.
> >
> > Unlike other PCI configuration registers that root userspace has no
> > reason to touch unless it wants to actively break things, DOE is a
> > mechanism that root userspace may need to access directly in some
> > cases. There are a few examples that come to mind.
>
> It's useful for root to read/write config registers with setpci, e.g.,
> to test ASPM configuration, test power management behavior, etc.  That
> can certainly break things and interfere with kernel access (and IMO
> should taint the kernel) but we have so far accepted that risk.  I
> think the same will be true for DOE.

I think DOE is a demonstrable step worse than those examples and
pushes into the unacceptable risk category. It invites a communication
protocol with an unbounded range of side effects (especially when
controlling a device like a CXL memory expander that affects "System
RAM" directly). Part of what drives platform / device vendors to the
standards negotiation table is the OS encouraging common interfaces.
If Linux provides an unfettered DOE interface it reduces an incentive
for device vendors to collaborate with the kernel community.

I do like the taint proposal though, if I can't convince you that DOE
merits explicit root userspace lockout beyond
CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY, I would settle for the kernel
warning loudly about DOE usage that falls outside the kernel's
expectations.

> In addition, I would think you might want a safe userspace interface
> via sysfs, e.g., something like the "vpd" file, but I missed that if
> it was in this series.

I do not think the kernel is well served by a generic userspace
passthrough for DOE for the same reason that there is no generic
passthrough for the ACPI _DSM facility. When the kernel becomes a
generic pipe to a vendor specific interface it impedes the kernel from
developing standard interfaces across vendors. Each vendor will ship
their own quirky feature and corresponding vendor-specific tool with
minimal incentive to coordinate with other vendors doing similar
things. At a minimum the userspace interface for DOE should be at a
level above the raw transport and be enabled per standardized /
published DOE protocol.  I.e. a userspace interface to read the CDAT
table retrieved over DOE, or a userspace interface to enumerate IDE
capabilities, etc.

> > CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE)
> > offers a mechanism to set different test modes for a DOE device. The
> > kernel has no reason to ever use that interface, and it has strong
> > reasons to want to block access to it in production. However, hardware
> > vendors also use debug Linux builds for hardware bringup. So I would
> > like to be able to say that the mechanism to gain access to the
> > compliance DOE is to detach the aux DOE driver from the right aux DOE
> > device. Could we build a custom way to do the same for the DOE
> > library, sure, but why re-invent the wheel when udev and the driver
> > model can handle this type of policy question already?
> >
> > Another use case is SPDM where an agent can establish a secure message
> > passing channel to a device, or paravirtualized device to exchange
> > protected messages with the hypervisor. My expectation is that in
> > addition to the kernel establishing SPDM sessions for PCI IDE and
> > CXL.cachemem IDE (link Integrity and Data Encryption) there will be
> > use cases for root userspace to establish their own SPDM session. In
> > that scenario as well the kernel can be told to give up control of a
> > specific DOE instance by detaching the aux device for its driver, but
> > otherwise the kernel driver can be assured that userspace will not
> > clobber its communications with its own attempts to talk over the DOE.
>
> I assume the kernel needs to control access to DOE in all cases,
> doesn't it?  For example, DOE can generate interrupts, and only the
> kernel can field them.  Maybe if I saw the userspace interface this
> would make more sense to me.  I'm hoping there's a reasonable "send
> this query and give me the response" primitive that can be implemented
> in the kernel, used by drivers, and exposed safely to userspace.

A DOE can generate interrupts, but I have yet to see a protocol that
demands that. The userspace interface in the patches is just a binary
attribute to dump the "CDAT" table retrieved over DOE. No generic
passthrough is provided per the concerns above.

> > Lastly, and perhaps this is minor, the PCI core is a built-in object
> > and aux-bus allows for breaking out device library functionality like
> > this into a dedicated module. But yes, that's not a good reason unto
> > itself because you could "auxify" almost anything past the point of
> > reason just to get more modules.
> >
> > > > A DOE mailbox is allowed to support any number of protocols while some
> > > > DOE protocol specifications apply additional restrictions.
> > >
> > > This sounds something like a fancy version of VPD, and VPD has been a
> > > huge headache.  I hope DOE avoids that ;)
> >
> > Please say a bit more, I think DOE is a rather large headache as
> > evidenced by us fine folks grappling with how to architect the Linux
> > enabling.
>
> VPD is not widely used, so gets poor testing.  The device doesn't tell
> us how much VPD it has, so we have to read until the end.  The data is
> theoretically self-describing (series of elements, each containing a
> type and size), but of course some devices don't format it correctly,
> so we read to the absolute limit, which takes a long time.
>
> Typical contents of unimplemented or uninitialized VPD are all 0x00 or
> all 0xff, neither of which is defined as a valid "end of VPD" signal,
> so we have hacky "this doesn't look like VPD" code.
>
> The PCI core doesn't need VPD itself and should only need to look for
> the size of each element and the "end" tag, but it works around some
> issues by doing more interpretation of the data.  The spec is a little
> ambiguous and leaves room for vendors to use types not mentioned in
> the spec.
>
> Some devices share VPD hardware across functions but don't protect it
> correctly (a hardware defect, granted).
>
> VPD has address/data registers, so it requires locking, polling for
> completion, and timeouts.

Yikes, yes, that sounds like a headache.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver
  2021-12-04 15:47         ` Dan Williams
@ 2021-12-06 12:27           ` Jonathan Cameron
  0 siblings, 0 replies; 37+ messages in thread
From: Jonathan Cameron @ 2021-12-06 12:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bjorn Helgaas, Weiny, Ira, Alison Schofield, Vishal Verma,
	Ben Widawsky, Bjorn Helgaas, linux-cxl, Linux PCI

On Sat, 4 Dec 2021 07:47:59 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Fri, Dec 3, 2021 at 3:56 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote:  
> > > On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:  
> > > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > > >
> > > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > > > > with standard protocol discovery.  Each mailbox is accessed through a
> > > > > DOE Extended Capability.
> > > > >
> > > > > Define an auxiliary device driver which control DOE auxiliary devices
> > > > > registered on the auxiliary bus.  
> > > >
> > > > What do we gain by making this an auxiliary driver?
> > > >
> > > > This doesn't really feel like a "driver," and apparently it used to be
> > > > a library.  I'd like to see the rationale and benefits of the driver
> > > > approach (in the eventual commit log as well as the current email
> > > > thread).  
> > >
> > > I asked Ira to use the auxiliary bus for DOE primarily for the ABI it
> > > offers for userspace to manage kernel vs userspace access to a device.
> > > CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not
> > > clobber mmio space that is actively claimed by a kernel driver. I
> > > submit that DOE merits the same protection for DOE instances that the
> > > kernel consumes.
> > >
> > > Unlike other PCI configuration registers that root userspace has no
> > > reason to touch unless it wants to actively break things, DOE is a
> > > mechanism that root userspace may need to access directly in some
> > > cases. There are a few examples that come to mind.  
> >
> > It's useful for root to read/write config registers with setpci, e.g.,
> > to test ASPM configuration, test power management behavior, etc.  That
> > can certainly break things and interfere with kernel access (and IMO
> > should taint the kernel) but we have so far accepted that risk.  I
> > think the same will be true for DOE.  
> 
> I think DOE is a demonstrable step worse than those examples and
> pushes into the unacceptable risk category. It invites a communication
> protocol with an unbounded range of side effects (especially when
> controlling a device like a CXL memory expander that affects "System
> RAM" directly). Part of what drives platform / device vendors to the
> standards negotiation table is the OS encouraging common interfaces.
> If Linux provides an unfettered DOE interface it reduces an incentive
> for device vendors to collaborate with the kernel community.
> 
> I do like the taint proposal though, if I can't convince you that DOE
> merits explicit root userspace lockout beyond
> CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY, I would settle for the kernel
> warning loudly about DOE usage that falls outside the kernel's
> expectations.
> 
> > In addition, I would think you might want a safe userspace interface
> > via sysfs, e.g., something like the "vpd" file, but I missed that if
> > it was in this series.  
> 
> I do not think the kernel is well served by a generic userspace
> passthrough for DOE for the same reason that there is no generic
> passthrough for the ACPI _DSM facility. When the kernel becomes a
> generic pipe to a vendor specific interface it impedes the kernel from
> developing standard interfaces across vendors. Each vendor will ship
> their own quirky feature and corresponding vendor-specific tool with
> minimal incentive to coordinate with other vendors doing similar
> things. At a minimum the userspace interface for DOE should be at a
> level above the raw transport and be enabled per standardized /
> published DOE protocol.  I.e. a userspace interface to read the CDAT
> table retrieved over DOE, or a userspace interface to enumerate IDE
> capabilities, etc.

I agree with Dan that a generic pass through is a bad idea, but we
do have code for one in an earlier version...
https://lore.kernel.org/linux-pci/20210524133938.2815206-5-Jonathan.Cameron@huawei.com/

We could take the approach of an allow list for this, if we can figure
out an appropriate way to manage that list.

> 
> > > CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE)
> > > offers a mechanism to set different test modes for a DOE device. The
> > > kernel has no reason to ever use that interface, and it has strong
> > > reasons to want to block access to it in production. However, hardware
> > > vendors also use debug Linux builds for hardware bringup. So I would
> > > like to be able to say that the mechanism to gain access to the
> > > compliance DOE is to detach the aux DOE driver from the right aux DOE
> > > device. Could we build a custom way to do the same for the DOE
> > > library, sure, but why re-invent the wheel when udev and the driver
> > > model can handle this type of policy question already?
> > >
> > > Another use case is SPDM where an agent can establish a secure message
> > > passing channel to a device, or paravirtualized device to exchange
> > > protected messages with the hypervisor. My expectation is that in
> > > addition to the kernel establishing SPDM sessions for PCI IDE and
> > > CXL.cachemem IDE (link Integrity and Data Encryption) there will be
> > > use cases for root userspace to establish their own SPDM session. In
> > > that scenario as well the kernel can be told to give up control of a
> > > specific DOE instance by detaching the aux device for its driver, but
> > > otherwise the kernel driver can be assured that userspace will not
> > > clobber its communications with its own attempts to talk over the DOE.  
> >
> > I assume the kernel needs to control access to DOE in all cases,
> > doesn't it?  For example, DOE can generate interrupts, and only the
> > kernel can field them.  Maybe if I saw the userspace interface this
> > would make more sense to me.  I'm hoping there's a reasonable "send
> > this query and give me the response" primitive that can be implemented
> > in the kernel, used by drivers, and exposed safely to userspace.  
> 
> A DOE can generate interrupts, but I have yet to see a protocol that
> demands that. The userspace interface in the patches is just a binary
> attribute to dump the "CDAT" table retrieved over DOE. No generic
> passthrough is provided per the concerns above.

I don't think it would be that hard to set up a protocol specific interface
for establishment of secure channels to cover that particular case.  As long
as we ensure userspace can see / manage the crypto elements it won't matter
if the data is passed through another interface on it's way to the DOE.

Jonathan

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2021-12-06 12:28 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-05 23:50 [PATCH 0/5] CXL: Read CDAT and DSMAS data from the device ira.weiny
2021-11-05 23:50 ` [PATCH 1/5] PCI: Add vendor ID for the PCI SIG ira.weiny
2021-11-17 21:50   ` Bjorn Helgaas
2021-11-05 23:50 ` [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
2021-11-08 12:15   ` Jonathan Cameron
2021-11-10  5:45     ` Ira Weiny
2021-11-18 18:48       ` Jonathan Cameron
2021-11-16 23:48   ` Bjorn Helgaas
2021-12-03 20:48     ` Dan Williams
2021-12-03 23:56       ` Bjorn Helgaas
2021-12-04 15:47         ` Dan Williams
2021-12-06 12:27           ` Jonathan Cameron
2021-11-05 23:50 ` [PATCH 3/5] cxl/pci: Add DOE Auxiliary Devices ira.weiny
2021-11-08 13:09   ` Jonathan Cameron
2021-11-11  1:31     ` Ira Weiny
2021-11-11 11:53       ` Jonathan Cameron
2021-11-16 23:48   ` Bjorn Helgaas
2021-11-17 12:23     ` Jonathan Cameron
2021-11-17 22:15       ` Bjorn Helgaas
2021-11-18 10:51         ` Jonathan Cameron
2021-11-19  6:48         ` Christoph Hellwig
2021-11-29 23:37           ` Dan Williams
2021-11-29 23:59             ` Dan Williams
2021-11-30  6:42               ` Christoph Hellwig
2021-11-05 23:50 ` [PATCH 4/5] cxl/mem: Add CDAT table reading from DOE ira.weiny
2021-11-08 13:21   ` Jonathan Cameron
2021-11-08 23:19     ` Ira Weiny
2021-11-08 15:02   ` Jonathan Cameron
2021-11-08 22:25     ` Ira Weiny
2021-11-09 11:09       ` Jonathan Cameron
2021-11-19 14:40   ` Jonathan Cameron
2021-11-05 23:50 ` [PATCH 5/5] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
2021-11-08 14:52   ` Jonathan Cameron
2021-11-11  3:58     ` Ira Weiny
2021-11-11 11:58       ` Jonathan Cameron
2021-11-18 17:02   ` Jonathan Cameron
2021-11-19 14:55   ` Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).