All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device
@ 2022-02-01  7:19 ira.weiny
  2022-02-01  7:19 ` [PATCH V6 01/10] PCI: Add vendor ID for the PCI SIG ira.weiny
                   ` (9 more replies)
  0 siblings, 10 replies; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

Changes from V5:[0]

	Rework the patch set to split PCI vs CXL changes
		Also make each change a bit more stand alone for easier review
	Add cxl_cdat structure
	Put CDAT related data structures in cdat.h
	Clarify some device lifetimes with comments
	Incorporate feedback from Jonathan, Bjorn and Dan
		The bigest change is placing the DOE scanning code into the
			pci_doe driver (part of the PCI codre).
		Validate the CDAT when it is read rather than before DSMAS
			parsing
		Do not report DSMAS failure as an error, report a warning and
			keep going.
		Retry reading the table 1 time.
	Update commit messages and this cover letter

NOTE: Should we retry more than 1 time?


CXL drivers need various data which are provided through generic DOE mailboxes
as defined in the PCIe r5.0 ECN.[1]

One such data is the Coherent Device Atribute Table (CDAT).  CDAT data provides
coherent information about the various devices in the system.  It was developed
because systems know longer have apriori knowledge of all coherent devices
within a system.  CDAT describes the coherent characteristics of the
components on the CXL bus separate from system configurations.  The OS can
then, for example, use this information to form correct interleave sets.

To begin reading the CDAT the OS must have support to access the DOE mailboxes
provided by the CXL devices.

The series creates a new PCI DOE auxiliary bus driver.  The CXL devices are
modified to create DOE auxiliary devices which are found and driven by the new
PCI DOE driver.

After the devices are created and the driver attaches, CDAT data is read from
the device and DSMAS information parsed from that CDAT blob for use later.

Because DOE is not specific to DOE but is provided within the PCI spec, the DOE
driver is added to the PCI subsystem.  This is part of the reason the auxiliary
bus architecture was used.  It allows for a clean separation between
subsystems.  In addition, and more importantly, the auxiliary bus architecture
allows for root users to control which DOE mailboxes are controlled by the
kernel and which may be optionally used for direct access by user space.  One
such use case is to allow for CXL Compliance Testing (CXL 2.0 14.16.4
Compliance Mode DOE).  By default the kernel controls this mailbox and would
not allow access to it.  But a root user could detach the driver from this
mailbox and gain direct access to the mailbox on a test system.

sysfs shows this relationship.  Starting with a qemu system with 2 memory
devices mem0 and mem1.

$ ls -l /sys/bus/cxl/devices/mem*
lrwxrwxrwx 1 root root 0 Jan 25 16:15 /sys/bus/cxl/devices/mem0 -> ../../../devices/pci0000:34/0000:34:00.0/0000:35:00.0/mem0
lrwxrwxrwx 1 root root 0 Jan 25 16:15 /sys/bus/cxl/devices/mem1 -> ../../../devices/pci0000:34/0000:34:01.0/0000:36:00.0/mem1

$ ls -l /sys/bus/auxiliary/devices/
total 0
lrwxrwxrwx 1 root root 0 Jan 25 16:16 pci_doe.doe.0 -> ../../../devices/pci0000:34/0000:34:00.0/0000:35:00.0/pci_doe.doe.0
lrwxrwxrwx 1 root root 0 Jan 25 16:16 pci_doe.doe.1 -> ../../../devices/pci0000:34/0000:34:01.0/0000:36:00.0/pci_doe.doe.1
lrwxrwxrwx 1 root root 0 Jan 25 16:16 pci_doe.doe.2 -> ../../../devices/pci0000:34/0000:34:01.0/0000:36:00.0/pci_doe.doe.2
lrwxrwxrwx 1 root root 0 Jan 25 16:16 pci_doe.doe.3 -> ../../../devices/pci0000:34/0000:34:00.0/0000:35:00.0/pci_doe.doe.3

$ ls -l /sys/bus/auxiliary/drivers
total 0
drwxr-xr-x 2 root root 0 Jan 25 16:15 pci_doe.pci_doe


This work was built on Jonathan's V4 series here[2].  The big change is a
conversion to an Auxiliary bus infrastructure which allows the DOE code to be
in a separate driver object which is attached to any DOE devices created by any
device.

This work was tested using qemu with additional patches.[3, 4]


[0] https://lore.kernel.org/linux-cxl/20211105235056.3711389-1-ira.weiny@intel.com/
[1] https://pcisig.com/specifications
[2] https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com
[3] https://lore.kernel.org/qemu-devel/20210202005948.241655-1-ben.widawsky@intel.com/
[4] https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/


Ira Weiny (7):
PCI: Replace magic constant for PCI Sig Vendor ID
PCI/DOE: Introduce pci_doe_create_doe_devices
cxl/pci: Create DOE auxiliary devices
cxl/pci: Find the DOE mailbox which supports CDAT
cxl/cdat: Introduce cdat_hdr_valid()
cxl/mem: Retry reading CDAT on failure
cxl/cdat: Parse out DSMAS data from CDAT table

Jonathan Cameron (3):
PCI: Add vendor ID for the PCI SIG
PCI/DOE: Add Data Object Exchange Aux Driver
cxl/mem: Read CDAT table

drivers/cxl/Kconfig | 1 +
drivers/cxl/cdat.h | 120 +++++
drivers/cxl/core/memdev.c | 141 ++++++
drivers/cxl/cxl.h | 3 +
drivers/cxl/cxlmem.h | 27 ++
drivers/cxl/pci.c | 173 ++++++++
drivers/pci/Kconfig | 10 +
drivers/pci/Makefile | 3 +
drivers/pci/doe.c | 798 ++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 2 +-
include/linux/pci-doe.h | 63 +++
include/linux/pci_ids.h | 1 +
include/uapi/linux/pci_regs.h | 29 +-
13 files changed, 1369 insertions(+), 2 deletions(-)
create mode 100644 drivers/cxl/cdat.h
create mode 100644 drivers/pci/doe.c
create mode 100644 include/linux/pci-doe.h

--
2.31.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH V6 01/10] PCI: Add vendor ID for the PCI SIG
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-03 17:11   ` Bjorn Helgaas
  2022-02-01  7:19 ` [PATCH V6 02/10] PCI: Replace magic constant for PCI Sig Vendor ID ira.weiny
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

This ID is used in DOE headers to identify protocols that are defined
within the PCI Express Base Specification.

Specified in Table 7-x2 of the Data Object Exchange ECN (approved 12 March
2020) available from https://members.pcisig.com/wg/PCI-SIG/document/14143

Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 include/linux/pci_ids.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 011f2f1ea5bb..849f514cd7db 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -149,6 +149,7 @@
 #define PCI_CLASS_OTHERS		0xff
 
 /* Vendors and devices.  Sort key: vendor first, device next. */
+#define PCI_VENDOR_ID_PCI_SIG		0x0001
 
 #define PCI_VENDOR_ID_LOONGSON		0x0014
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V6 02/10] PCI: Replace magic constant for PCI Sig Vendor ID
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
  2022-02-01  7:19 ` [PATCH V6 01/10] PCI: Add vendor ID for the PCI SIG ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-04 21:16   ` Dan Williams
  2022-02-04 21:49   ` Bjorn Helgaas
  2022-02-01  7:19 ` [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

Based on Bjorn's suggestion[1], now that the PCI Sig Vendor ID is
defined the define should be used in pci_bus_crs_vendor_id() rather than
the hard coded magic value.

Replace the magic value in pci_bus_crs_vendor_id() with
PCI_VENDOR_ID_PCI_SIG.

[1] https://lore.kernel.org/linux-cxl/20211117215044.GA1777828@bhelgaas/

Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/pci/probe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 087d3658f75c..d92dbb136fc9 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2318,7 +2318,7 @@ EXPORT_SYMBOL(pci_alloc_dev);
 
 static bool pci_bus_crs_vendor_id(u32 l)
 {
-	return (l & 0xffff) == 0x0001;
+	return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG;
 }
 
 static bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
  2022-02-01  7:19 ` [PATCH V6 01/10] PCI: Add vendor ID for the PCI SIG ira.weiny
  2022-02-01  7:19 ` [PATCH V6 02/10] PCI: Replace magic constant for PCI Sig Vendor ID ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-03 22:40   ` Bjorn Helgaas
  2022-02-09  0:59   ` Dan Williams
  2022-02-01  7:19 ` [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices ira.weiny
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Introduced in a PCI ECN [1], DOE provides a config space based mailbox
with standard protocol discovery.  Each mailbox is accessed through a
DOE Extended Capability.

Define an auxiliary device driver which control DOE auxiliary devices
registered on the auxiliary bus.

A DOE mailbox is allowed to support any number of protocols while some
DOE protocol specifications apply additional restrictions.

The protocols supported are queried and cached.  pci_doe_supports_prot()
can be used to determine if the DOE device supports the protocol
specified.

A synchronous interface is provided in pci_doe_exchange_sync() to
perform a single query / response exchange from the driver through the
device specified.

Testing was conducted against QEMU using:

https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/

This code is based on Jonathan's V4 series here:

https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/

[1] https://members.pcisig.com/wg/PCI-SIG/document/14143
    Data Object Exchange (DOE) - Approved 12 March 2020

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

---
NOTE: Bjorn mentioned that the signed off by's are backwards but
	checkpatch complains no mater what I do.  Either the
	co-developed by is wrong or the signed off by is wrong because
	Jonathan is the original author.  The above order is acceptable
	to checkpatch so I left it that way.

Changes from V5
	From Bjorn
		s/pci_WARN/pci_warn
			Add timeout period to print
		Trim to 80 chars
		Use Tabs for DOE define spacing
		Use %#x for clarity
	From Jonathan
		Addresses concerns about the order of unwinding stuff
		s/doe/doe_dev in pci_doe_exhcnage_sync
		Correct kernel Doc comment
		Move pci_doe_task_complete() down in the file.
		Rework pci_doe_irq()
			process STATUS_ERROR first
			Return IRQ_NONE if the irq is not processed
			Use PCI_DOE_STATUS_INT_STATUS explicitly to
				clear the irq
	Clean up goto label s/err_free_irqs/err_free_irq
	use devm_kzalloc for doe struct
	clean up error paths in pci_doe_probe
	s/pci_doe_drv/pci_doe
	remove include mutex.h
	remove device name and define, move it in the next patch which uses it
	use devm_kasprintf() for irq_name
	use devm_request_irq()
	remove pci_doe_unregister()
		[get/put]_device() were unneeded and with the use of
		devm_* this function can be removed completely.
	refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq
		make this function just a registration of the irq and
		move pci_doe_abort() into pci_doe_probe()
	use devm_* to allocate the protocol array

Changes from Jonathan's V4
	Move the DOE MB code into the DOE auxiliary driver
	Remove Task List in favor of a wait queue

Changes from Ben
	remove CXL references
	propagate rc from pci functions on error
---
 drivers/pci/Kconfig           |  10 +
 drivers/pci/Makefile          |   3 +
 drivers/pci/doe.c             | 675 ++++++++++++++++++++++++++++++++++
 include/linux/pci-doe.h       |  60 +++
 include/uapi/linux/pci_regs.h |  29 +-
 5 files changed, 776 insertions(+), 1 deletion(-)
 create mode 100644 drivers/pci/doe.c
 create mode 100644 include/linux/pci-doe.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 43e615aa12ff..8de51b64067c 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND
 	  The PCI device frontend driver allows the kernel to import arbitrary
 	  PCI devices from a PCI backend to support PCI driver domains.
 
+config PCI_DOE_DRIVER
+	tristate "PCI Data Object Exchange (DOE) driver"
+	select AUXILIARY_BUS
+	help
+	  Driver for DOE auxiliary devices.
+
+	  DOE provides a simple mailbox in PCI config space that is used by a
+	  number of different protocols.  DOE is defined in the Data Object
+	  Exchange ECN to the PCIe r5.0 spec.
+
 config PCI_ATS
 	bool
 
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index d62c4ac4ae1b..afd9d7bd2b82 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB)		+= pci-stub.o
 obj-$(CONFIG_PCI_PF_STUB)	+= pci-pf-stub.o
 obj-$(CONFIG_PCI_ECAM)		+= ecam.o
 obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
+obj-$(CONFIG_PCI_DOE_DRIVER)	+= pci-doe.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 
+pci-doe-y := doe.o
+
 # Endpoint library must be initialized before its users
 obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
 
diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
new file mode 100644
index 000000000000..4ff54bade8ec
--- /dev/null
+++ b/drivers/pci/doe.c
@@ -0,0 +1,675 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Data Object Exchange ECN
+ * https://members.pcisig.com/wg/PCI-SIG/document/14143
+ *
+ * Copyright (C) 2021 Huawei
+ *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
+ */
+
+#include <linux/bitfield.h>
+#include <linux/delay.h>
+#include <linux/jiffies.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci-doe.h>
+#include <linux/workqueue.h>
+#include <linux/module.h>
+
+#define PCI_DOE_PROTOCOL_DISCOVERY 0
+
+#define PCI_DOE_BUSY_MAX_RETRIES 16
+#define PCI_DOE_POLL_INTERVAL (HZ / 128)
+
+/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
+#define PCI_DOE_TIMEOUT HZ
+
+enum pci_doe_state {
+	DOE_IDLE,
+	DOE_WAIT_RESP,
+	DOE_WAIT_ABORT,
+	DOE_WAIT_ABORT_ON_ERR,
+};
+
+/**
+ * struct pci_doe_task - description of a query / response task
+ * @ex: The details of the task to be done
+ * @rv: Return value.  Length of received response or error
+ * @cb: Callback for completion of task
+ * @private: Private data passed to callback on completion
+ */
+struct pci_doe_task {
+	struct pci_doe_exchange *ex;
+	int rv;
+	void (*cb)(void *private);
+	void *private;
+};
+
+/**
+ * struct pci_doe - A single DOE mailbox driver
+ *
+ * @doe_dev: The DOE Auxiliary device being driven
+ * @abort_c: Completion used for initial abort handling
+ * @irq: Interrupt used for signaling DOE ready or abort
+ * @irq_name: Name used to identify the irq for a particular DOE
+ * @prots: Array of identifiers for protocols supported
+ * @num_prots: Size of prots array
+ * @cur_task: Current task the state machine is working on
+ * @wq: Wait queue to wait on if a query is in progress
+ * @state_lock: Protect the state of cur_task, abort, and dead
+ * @statemachine: Work item for the DOE state machine
+ * @state: Current state of this DOE
+ * @timeout_jiffies: 1 second after GO set
+ * @busy_retries: Count of retry attempts
+ * @abort: Request a manual abort (e.g. on init)
+ * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
+ *        will immediately be aborted with error
+ */
+struct pci_doe {
+	struct pci_doe_dev *doe_dev;
+	struct completion abort_c;
+	int irq;
+	char *irq_name;
+	struct pci_doe_protocol *prots;
+	int num_prots;
+
+	struct pci_doe_task *cur_task;
+	wait_queue_head_t wq;
+	struct mutex state_lock;
+	struct delayed_work statemachine;
+	enum pci_doe_state state;
+	unsigned long timeout_jiffies;
+	unsigned int busy_retries;
+	unsigned int abort:1;
+	unsigned int dead:1;
+};
+
+static irqreturn_t pci_doe_irq(int irq, void *data)
+{
+	struct pci_doe *doe = data;
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	u32 val;
+
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+	/* Leave the error case to be handled outside IRQ */
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
+		mod_delayed_work(system_wq, &doe->statemachine, 0);
+		return IRQ_HANDLED;
+	}
+
+	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
+		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
+					PCI_DOE_STATUS_INT_STATUS);
+		mod_delayed_work(system_wq, &doe->statemachine, 0);
+		return IRQ_HANDLED;
+	}
+
+	return IRQ_NONE;
+}
+
+/*
+ * Only call when safe to directly access the DOE, either because no tasks yet
+ * queued, or called from doe_statemachine_work() which has exclusive access to
+ * the DOE config space.
+ */
+static void pci_doe_abort_start(struct pci_doe *doe)
+{
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	u32 val;
+
+	val = PCI_DOE_CTRL_ABORT;
+	if (doe->irq)
+		val |= PCI_DOE_CTRL_INT_EN;
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
+
+	doe->timeout_jiffies = jiffies + HZ;
+	schedule_delayed_work(&doe->statemachine, HZ);
+}
+
+static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)
+{
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	u32 val;
+	int i;
+
+	/*
+	 * Check the DOE busy bit is not set. If it is set, this could indicate
+	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
+	 * it is expected that firmware and OS will negotiate access rights via
+	 * an, as yet to be defined method.
+	 */
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
+		return -EBUSY;
+
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+		return -EIO;
+
+	/* Write DOE Header */
+	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) |
+		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type);
+	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
+	/* Length is 2 DW of header + length of payload in DW */
+	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
+					  2 + ex->request_pl_sz /
+						sizeof(u32)));
+	for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++)
+		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+				       ex->request_pl[i]);
+
+	val = PCI_DOE_CTRL_GO;
+	if (doe->irq)
+		val |= PCI_DOE_CTRL_INT_EN;
+
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
+	/* Request is sent - now wait for poll or IRQ */
+	return 0;
+}
+
+static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex)
+{
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	size_t length;
+	u32 val;
+	int i;
+
+	/* Read the first dword to get the protocol */
+	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) ||
+	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) {
+		pci_err(pdev,
+			"Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",
+			ex->prot.vid, ex->prot.type,
+			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
+			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
+		return -EIO;
+	}
+
+	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	/* Read the second dword to get the length */
+	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+
+	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
+	if (length > SZ_1M || length < 2)
+		return -EIO;
+
+	/* First 2 dwords have already been read */
+	length -= 2;
+	/* Read the rest of the response payload */
+	for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) {
+		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
+				      &ex->response_pl[i]);
+		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	}
+
+	/* Flush excess length */
+	for (; i < length; i++) {
+		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	}
+	/* Final error check to pick up on any since Data Object Ready */
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+		return -EIO;
+
+	return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32);
+}
+
+static void doe_statemachine_work(struct work_struct *work)
+{
+	struct delayed_work *w = to_delayed_work(work);
+	struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	int offset = doe->doe_dev->cap_offset;
+	struct pci_doe_task *task;
+	bool abort;
+	u32 val;
+	int rc;
+
+	mutex_lock(&doe->state_lock);
+	task = doe->cur_task;
+	abort = doe->abort;
+	doe->abort = false;
+	mutex_unlock(&doe->state_lock);
+
+	if (abort) {
+		/*
+		 * Currently only used during init - care needed if
+		 * pci_doe_abort() is generally exposed as it would impact
+		 * queries in flight.
+		 */
+		WARN_ON(task);
+		doe->state = DOE_WAIT_ABORT;
+		pci_doe_abort_start(doe);
+		return;
+	}
+
+	switch (doe->state) {
+	case DOE_IDLE:
+		if (task == NULL)
+			return;
+
+		/* Nothing currently in flight so queue a task */
+		rc = pci_doe_send_req(doe, task->ex);
+		/*
+		 * The specification does not provide any guidance on how long
+		 * some other entity could keep the DOE busy, so try for 1
+		 * second then fail. Busy handling is best effort only, because
+		 * there is no way of avoiding racing against another user of
+		 * the DOE.
+		 */
+		if (rc == -EBUSY) {
+			doe->busy_retries++;
+			if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
+				/* Long enough, fail this request */
+				pci_warn(pdev,
+					"DOE busy for too long (> 1 sec)\n");
+				doe->busy_retries = 0;
+				goto err_busy;
+			}
+			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
+			return;
+		}
+		if (rc)
+			goto err_abort;
+		doe->busy_retries = 0;
+
+		doe->state = DOE_WAIT_RESP;
+		doe->timeout_jiffies = jiffies + HZ;
+		/* Now poll or wait for IRQ with timeout */
+		if (doe->irq > 0)
+			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
+		else
+			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
+		return;
+
+	case DOE_WAIT_RESP:
+		/* Not possible to get here with NULL task */
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
+			rc = -EIO;
+			goto err_abort;
+		}
+
+		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
+			/* If not yet at timeout reschedule otherwise abort */
+			if (time_after(jiffies, doe->timeout_jiffies)) {
+				rc = -ETIMEDOUT;
+				goto err_abort;
+			}
+			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
+			return;
+		}
+
+		rc  = pci_doe_recv_resp(doe, task->ex);
+		if (rc < 0)
+			goto err_abort;
+
+		doe->state = DOE_IDLE;
+
+		mutex_lock(&doe->state_lock);
+		doe->cur_task = NULL;
+		mutex_unlock(&doe->state_lock);
+		wake_up_interruptible(&doe->wq);
+
+		/* Set the return value to the length of received payload */
+		task->rv = rc;
+		task->cb(task->private);
+
+		return;
+
+	case DOE_WAIT_ABORT:
+	case DOE_WAIT_ABORT_ON_ERR:
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
+		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
+			/* Back to normal state - carry on */
+			mutex_lock(&doe->state_lock);
+			doe->cur_task = NULL;
+			mutex_unlock(&doe->state_lock);
+			wake_up_interruptible(&doe->wq);
+
+			/*
+			 * For deliberately triggered abort, someone is
+			 * waiting.
+			 */
+			if (doe->state == DOE_WAIT_ABORT)
+				complete(&doe->abort_c);
+
+			doe->state = DOE_IDLE;
+			return;
+		}
+		if (time_after(jiffies, doe->timeout_jiffies)) {
+			/* Task has timed out and is dead - abort */
+			pci_err(pdev, "DOE ABORT timed out\n");
+			mutex_lock(&doe->state_lock);
+			doe->dead = true;
+			doe->cur_task = NULL;
+			mutex_unlock(&doe->state_lock);
+			wake_up_interruptible(&doe->wq);
+
+			if (doe->state == DOE_WAIT_ABORT)
+				complete(&doe->abort_c);
+		}
+		return;
+	}
+
+err_abort:
+	doe->state = DOE_WAIT_ABORT_ON_ERR;
+	pci_doe_abort_start(doe);
+err_busy:
+	task->rv = rc;
+	task->cb(task->private);
+	/* If here via err_busy, signal the task done. */
+	if (doe->state == DOE_IDLE) {
+		mutex_lock(&doe->state_lock);
+		doe->cur_task = NULL;
+		mutex_unlock(&doe->state_lock);
+		wake_up_interruptible(&doe->wq);
+	}
+}
+
+static void pci_doe_task_complete(void *private)
+{
+	complete(private);
+}
+
+/**
+ * pci_doe_exchange_sync() - Send a request, then wait for and receive a
+ *			     response
+ * @doe_dev: DOE mailbox state structure
+ * @ex: Description of the buffers and Vendor ID + type used in this
+ *      request/response pair
+ *
+ * Excess data will be discarded.
+ *
+ * RETURNS: payload in bytes on success, < 0 on error
+ */
+int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
+			  struct pci_doe_exchange *ex)
+{
+	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
+	struct pci_doe_task task;
+	DECLARE_COMPLETION_ONSTACK(c);
+
+	if (!doe)
+		return -EAGAIN;
+
+	/* DOE requests must be a whole number of DW */
+	if (ex->request_pl_sz % sizeof(u32))
+		return -EINVAL;
+
+	task.ex = ex;
+	task.cb = pci_doe_task_complete;
+	task.private = &c;
+
+again:
+	mutex_lock(&doe->state_lock);
+	if (doe->cur_task) {
+		mutex_unlock(&doe->state_lock);
+		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
+		goto again;
+	}
+
+	if (doe->dead) {
+		mutex_unlock(&doe->state_lock);
+		return -EIO;
+	}
+	doe->cur_task = &task;
+	schedule_delayed_work(&doe->statemachine, 0);
+	mutex_unlock(&doe->state_lock);
+
+	wait_for_completion(&c);
+
+	return task.rv;
+}
+EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
+
+/**
+ * pci_doe_supports_prot() - Return if the DOE instance supports the given
+ *			     protocol
+ * @pdev: Device on which to find the DOE instance
+ * @vid: Protocol Vendor ID
+ * @type: protocol type
+ *
+ * This device can then be passed to pci_doe_exchange_sync() to execute a
+ * mailbox exchange through that DOE mailbox.
+ *
+ * RETURNS: True if the DOE device supports the protocol specified
+ */
+bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
+{
+	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
+	int i;
+
+	if (!doe)
+		return false;
+
+	for (i = 0; i < doe->num_prots; i++)
+		if ((doe->prots[i].vid == vid) &&
+		    (doe->prots[i].type == type))
+			return true;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
+
+static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid,
+			     u8 *protocol)
+{
+	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
+				    *index);
+	u32 response_pl;
+	struct pci_doe_exchange ex = {
+		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
+		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
+		.request_pl = &request_pl,
+		.request_pl_sz = sizeof(request_pl),
+		.response_pl = &response_pl,
+		.response_pl_sz = sizeof(response_pl),
+	};
+	int ret;
+
+	ret = pci_doe_exchange_sync(doe->doe_dev, &ex);
+	if (ret < 0)
+		return ret;
+
+	if (ret != sizeof(response_pl))
+		return -EIO;
+
+	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
+	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
+			      response_pl);
+	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
+			   response_pl);
+
+	return 0;
+}
+
+static int pci_doe_cache_protocols(struct pci_doe *doe)
+{
+	u8 index = 0;
+	int num_prots;
+	int rc;
+
+	/* Discovery protocol must always be supported and must report itself */
+	num_prots = 1;
+	doe->prots = devm_kcalloc(&doe->doe_dev->adev.dev, num_prots,
+				  sizeof(*doe->prots), GFP_KERNEL);
+	if (doe->prots == NULL)
+		return -ENOMEM;
+
+	do {
+		struct pci_doe_protocol *prot;
+
+		prot = &doe->prots[num_prots - 1];
+		rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
+		if (rc)
+			return rc;
+
+		if (index) {
+			struct pci_doe_protocol *prot_new;
+
+			num_prots++;
+			prot_new = devm_krealloc(&doe->doe_dev->adev.dev,
+						 doe->prots,
+						 sizeof(*doe->prots) *
+							num_prots,
+						 GFP_KERNEL);
+			if (prot_new == NULL)
+				return -ENOMEM;
+			doe->prots = prot_new;
+		}
+	} while (index);
+
+	doe->num_prots = num_prots;
+	return 0;
+}
+
+static int pci_doe_abort(struct pci_doe *doe)
+{
+	reinit_completion(&doe->abort_c);
+	mutex_lock(&doe->state_lock);
+	doe->abort = true;
+	mutex_unlock(&doe->state_lock);
+	schedule_delayed_work(&doe->statemachine, 0);
+	wait_for_completion(&doe->abort_c);
+
+	if (doe->dead)
+		return -EIO;
+
+	return 0;
+}
+
+static int pci_doe_reg_irq(struct pci_doe *doe)
+{
+	struct pci_dev *pdev = doe->doe_dev->pdev;
+	bool poll = !pci_dev_msi_enabled(pdev);
+	int offset = doe->doe_dev->cap_offset;
+	int rc, irq;
+	u32 val;
+
+	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
+
+	if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
+		irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
+		if (irq < 0)
+			return irq;
+
+		doe->irq_name = devm_kasprintf(&doe->doe_dev->adev.dev,
+						GFP_KERNEL,
+						"DOE[%s]",
+						doe->doe_dev->adev.name);
+		if (!doe->irq_name)
+			return -ENOMEM;
+
+		rc = devm_request_irq(&pdev->dev, irq, pci_doe_irq, 0,
+				      doe->irq_name, doe);
+		if (rc)
+			return rc;
+
+		doe->irq = irq;
+		pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
+				       PCI_DOE_CTRL_INT_EN);
+	}
+
+	return 0;
+}
+
+/*
+ * pci_doe_probe() - Set up the Mailbox
+ * @aux_dev: Auxiliary Device
+ * @id: Auxiliary device ID
+ *
+ * Probe the mailbox found for all protocols and set up the Mailbox
+ *
+ * RETURNS: 0 on success, < 0 on error
+ */
+static int pci_doe_probe(struct auxiliary_device *aux_dev,
+			 const struct auxiliary_device_id *id)
+{
+	struct pci_doe_dev *doe_dev = container_of(aux_dev,
+					struct pci_doe_dev,
+					adev);
+	struct pci_doe *doe;
+	int rc;
+
+	doe = devm_kzalloc(&aux_dev->dev, sizeof(*doe), GFP_KERNEL);
+	if (!doe)
+		return -ENOMEM;
+
+	mutex_init(&doe->state_lock);
+	init_completion(&doe->abort_c);
+	doe->doe_dev = doe_dev;
+	init_waitqueue_head(&doe->wq);
+	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
+	dev_set_drvdata(&aux_dev->dev, doe);
+
+	rc = pci_doe_reg_irq(doe);
+	if (rc)
+		return rc;
+
+	/* Reset the mailbox by issuing an abort */
+	rc = pci_doe_abort(doe);
+	if (rc)
+		return rc;
+
+	rc = pci_doe_cache_protocols(doe);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+static void pci_doe_remove(struct auxiliary_device *aux_dev)
+{
+	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
+
+	/* First halt the state machine */
+	cancel_delayed_work_sync(&doe->statemachine);
+}
+
+static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
+	{},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
+
+struct auxiliary_driver pci_doe_auxiliary_drv = {
+	.name = "pci_doe",
+	.id_table = pci_doe_auxiliary_id_table,
+	.probe = pci_doe_probe,
+	.remove = pci_doe_remove
+};
+
+static int __init pci_doe_init_module(void)
+{
+	int ret;
+
+	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
+	if (ret) {
+		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
+		       ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit pci_doe_exit_module(void)
+{
+	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
+}
+
+module_init(pci_doe_init_module);
+module_exit(pci_doe_exit_module);
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
new file mode 100644
index 000000000000..2f52b31c6f32
--- /dev/null
+++ b/include/linux/pci-doe.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
+ *
+ * Copyright (C) 2021 Huawei
+ *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
+ */
+
+#include <linux/completion.h>
+#include <linux/list.h>
+#include <linux/auxiliary_bus.h>
+
+#ifndef LINUX_PCI_DOE_H
+#define LINUX_PCI_DOE_H
+
+struct pci_doe_protocol {
+	u16 vid;
+	u8 type;
+};
+
+/**
+ * struct pci_doe_exchange - represents a single query/response
+ *
+ * @prot: DOE Protocol
+ * @request_pl: The request payload
+ * @request_pl_sz: Size of the request payload
+ * @response_pl: The response payload
+ * @response_pl_sz: Size of the response payload
+ */
+struct pci_doe_exchange {
+	struct pci_doe_protocol prot;
+	u32 *request_pl;
+	size_t request_pl_sz;
+	u32 *response_pl;
+	size_t response_pl_sz;
+};
+
+/**
+ * struct pci_doe_dev - DOE mailbox device
+ *
+ * @adrv: Auxiliary Driver data
+ * @pdev: PCI device this belongs to
+ * @offset: Capability offset
+ *
+ * This represents a single DOE mailbox device.  Devices should create this
+ * device and register it on the Auxiliary bus for the DOE driver to maintain.
+ *
+ */
+struct pci_doe_dev {
+	struct auxiliary_device adev;
+	struct pci_dev *pdev;
+	int cap_offset;
+};
+
+/* Library operations */
+int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
+				 struct pci_doe_exchange *ex);
+bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type);
+
+#endif
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index ff6ccbc6efe9..c04aad391669 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -736,7 +736,8 @@
 #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
 #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
 #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
-#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
+#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
+#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
 
 #define PCI_EXT_CAP_DSN_SIZEOF	12
 #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
@@ -1098,4 +1099,30 @@
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
 
+/* Data Object Exchange */
+#define PCI_DOE_CAP		0x04    /* DOE Capabilities Register */
+#define  PCI_DOE_CAP_INT			0x00000001  /* Interrupt Support */
+#define  PCI_DOE_CAP_IRQ			0x00000ffe  /* Interrupt Message Number */
+#define PCI_DOE_CTRL		0x08    /* DOE Control Register */
+#define  PCI_DOE_CTRL_ABORT			0x00000001  /* DOE Abort */
+#define  PCI_DOE_CTRL_INT_EN			0x00000002  /* DOE Interrupt Enable */
+#define  PCI_DOE_CTRL_GO			0x80000000  /* DOE Go */
+#define PCI_DOE_STATUS		0x0c    /* DOE Status Register */
+#define  PCI_DOE_STATUS_BUSY			0x00000001  /* DOE Busy */
+#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */
+#define  PCI_DOE_STATUS_ERROR			0x00000004  /* DOE Error */
+#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
+#define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
+#define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
+
+/* DOE Data Object - note not actually registers */
+#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
+#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
+#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
+
+#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
+
 #endif /* LINUX_PCI_REGS_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (2 preceding siblings ...)
  2022-02-01  7:19 ` [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-03 22:44   ` Bjorn Helgaas
  2022-02-01  7:19 ` [PATCH V6 05/10] cxl/pci: Create DOE auxiliary devices ira.weiny
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

CXL and/or PCI devices can define DOE mailboxes.  Normally the kernel
will want to maintain control of all of these mailboxes.  However, under
a limited number of use cases users may want to allow user space access
to some of these mailboxes while the kernel retains control of the rest.
An example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
Compliance Mode DOE) which offers a mechanism to set different test
modes for a device.

Rather than re-invent the wheel the architecture creates auxiliary
devices for each DOE mailbox which can then be driven by a generic DOE
mailbox driver.  If access to an individual mailbox is required by user
space the driver for that mailbox can be unloaded and access handed to
user space.

Create the helper pci_doe_create_doe_devices() which iterates each DOE
mailbox found in the device and creates a DOE auxiliary device on the
auxiliary bus.  While doing so ensure that the auxiliary DOE driver
loads to drive that device.

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V5:
	Rebased to latest
	Split this off from the CXL specific patch.  This introduces the
	support in PCI and the CXL code can call it in a future patch.
	Remove soft dep from the cxl_pci code because use of the helper
		function ensures that the pci_doe driver is already
		loaded.
	From Jonathan and Bjorn
		Move DOE device creation to the PCI core via
			pci_doe_create_doe_devices() helper
		document need for pci_set_master()
	From Bjorn
		Reword commit message for clarity
		put DOE_DEV_NAME in this patch from the previous
		Remove '__' prefix
	From Jonathan
		remove CXL_ADDRSPACE_* defines

Changes from V4:
	Make this an Auxiliary Driver rather than library functions
	Split this out into it's own patch
	Base on the new cxl_dev_state structure

Changes from Ben
	s/CXL_DOE_DEV_NAME/DOE_DEV_NAME/
---
 drivers/pci/doe.c       | 123 ++++++++++++++++++++++++++++++++++++++++
 include/linux/pci-doe.h |   3 +
 2 files changed, 126 insertions(+)

diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
index 4ff54bade8ec..1b2e69774ccf 100644
--- a/drivers/pci/doe.c
+++ b/drivers/pci/doe.c
@@ -383,6 +383,128 @@ static void pci_doe_task_complete(void *private)
 	complete(private);
 }
 
+static void pci_doe_free_irq_vectors(void *data)
+{
+	pci_free_irq_vectors(data);
+}
+
+static DEFINE_IDA(pci_doe_adev_ida);
+
+static void pci_doe_dev_release(struct device *dev)
+{
+	struct auxiliary_device *adev = container_of(dev,
+						struct auxiliary_device,
+						dev);
+	struct pci_doe_dev *doe_dev = container_of(adev, struct pci_doe_dev,
+						   adev);
+
+	ida_free(&pci_doe_adev_ida, adev->id);
+	kfree(doe_dev);
+}
+
+static void pci_doe_destroy_device(void *ad)
+{
+	auxiliary_device_delete(ad);
+	auxiliary_device_uninit(ad);
+}
+
+/**
+ * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
+ *                              mailboxes found
+ * @pci_dev: The PCI device to scan for DOE mailboxes
+ *
+ * There is no coresponding destroy of these devices.  This function associates
+ * the DOE auxiliary devices created with the pci_dev passed in.  That
+ * association is device managed (devm_*) such that the DOE auxiliary device
+ * lifetime is always greater than or equal to the lifetime of the pci_dev.
+ *
+ * RETURNS: 0 on success -ERRNO on failure.
+ */
+int pci_doe_create_doe_devices(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	int irqs, rc;
+	u16 pos = 0;
+
+	/*
+	 * An implementation may support an unknown number of interrupts.
+	 * Assume that number is not that large and request them all.
+	 */
+	irqs = pci_msix_vec_count(pdev);
+	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
+	if (rc != irqs) {
+		/* No interrupt available - carry on */
+		pci_dbg(pdev, "No interrupts available for DOE\n");
+	} else {
+		/*
+		 * Enabling bus mastering is require for MSI/MSIx.  It could be
+		 * done later within the DOE initialization, but as it
+		 * potentially has other impacts keep it here when setting up
+		 * the IRQ's.
+		 */
+		pci_set_master(pdev);
+		rc = devm_add_action_or_reset(dev,
+					      pci_doe_free_irq_vectors,
+					      pdev);
+		if (rc)
+			return rc;
+	}
+
+	pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
+
+	while (pos > 0) {
+		struct auxiliary_device *adev;
+		struct pci_doe_dev *new_dev;
+		int id;
+
+		new_dev = kzalloc(sizeof(*new_dev), GFP_KERNEL);
+		if (!new_dev)
+			return -ENOMEM;
+
+		new_dev->pdev = pdev;
+		new_dev->cap_offset = pos;
+
+		/* Set up struct auxiliary_device */
+		adev = &new_dev->adev;
+		id = ida_alloc(&pci_doe_adev_ida, GFP_KERNEL);
+		if (id < 0) {
+			kfree(new_dev);
+			return -ENOMEM;
+		}
+
+		adev->id = id;
+		adev->name = DOE_DEV_NAME;
+		adev->dev.release = pci_doe_dev_release;
+		adev->dev.parent = dev;
+
+		if (auxiliary_device_init(adev)) {
+			pci_doe_dev_release(&adev->dev);
+			return -EIO;
+		}
+
+		if (auxiliary_device_add(adev)) {
+			auxiliary_device_uninit(adev);
+			return -EIO;
+		}
+
+		rc = devm_add_action_or_reset(dev, pci_doe_destroy_device, adev);
+		if (rc)
+			return rc;
+
+		if (device_attach(&adev->dev) != 1) {
+			dev_err(&adev->dev,
+				"Failed to attach a driver to DOE device %d\n",
+				adev->id);
+			return -ENODEV;
+		}
+
+		pos = pci_find_next_ext_capability(pdev, pos, PCI_EXT_CAP_ID_DOE);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_doe_create_doe_devices);
+
 /**
  * pci_doe_exchange_sync() - Send a request, then wait for and receive a
  *			     response
@@ -639,6 +761,7 @@ static void pci_doe_remove(struct auxiliary_device *aux_dev)
 }
 
 static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
+	{.name = "pci_doe.doe", },
 	{},
 };
 
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
index 2f52b31c6f32..9ae2e96a0211 100644
--- a/include/linux/pci-doe.h
+++ b/include/linux/pci-doe.h
@@ -13,6 +13,8 @@
 #ifndef LINUX_PCI_DOE_H
 #define LINUX_PCI_DOE_H
 
+#define DOE_DEV_NAME "doe"
+
 struct pci_doe_protocol {
 	u16 vid;
 	u8 type;
@@ -53,6 +55,7 @@ struct pci_doe_dev {
 };
 
 /* Library operations */
+int pci_doe_create_doe_devices(struct pci_dev *pdev);
 int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
 				 struct pci_doe_exchange *ex);
 bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V6 05/10] cxl/pci: Create DOE auxiliary devices
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (3 preceding siblings ...)
  2022-02-01  7:19 ` [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-01  7:19 ` [PATCH V6 06/10] cxl/pci: Find the DOE mailbox which supports CDAT ira.weiny
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

CXL devices will need DOE mailbox access to read things like CDAT.

Call the PCI core helper to find all DOE mailboxes on the device and
create the auxiliary devices for those mailboxes.

sysfs shows this relationship.  Starting with a qemu system with 2
memory devices mem0 and mem1.

$ ls -l /sys/bus/cxl/devices/mem*
lrwxrwxrwx 1 root root 0 Jan 25 16:15 /sys/bus/cxl/devices/mem0 -> ../../../devices/pci0000:34/0000:34:00.0/0000:35:00.0/mem0
lrwxrwxrwx 1 root root 0 Jan 25 16:15 /sys/bus/cxl/devices/mem1 -> ../../../devices/pci0000:34/0000:34:01.0/0000:36:00.0/mem1

$ ls -l /sys/bus/auxiliary/devices/
total 0
lrwxrwxrwx 1 root root 0 Jan 25 16:16 pci_doe.doe.0 -> ../../../devices/pci0000:34/0000:34:00.0/0000:35:00.0/pci_doe.doe.0
lrwxrwxrwx 1 root root 0 Jan 25 16:16 pci_doe.doe.1 -> ../../../devices/pci0000:34/0000:34:01.0/0000:36:00.0/pci_doe.doe.1
lrwxrwxrwx 1 root root 0 Jan 25 16:16 pci_doe.doe.2 -> ../../../devices/pci0000:34/0000:34:01.0/0000:36:00.0/pci_doe.doe.2
lrwxrwxrwx 1 root root 0 Jan 25 16:16 pci_doe.doe.3 -> ../../../devices/pci0000:34/0000:34:00.0/0000:35:00.0/pci_doe.doe.3

$ ls -l /sys/bus/auxiliary/drivers
total 0
drwxr-xr-x 2 root root 0 Jan 25 16:15 pci_doe.pci_doe

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V5:
	Split the CXL specific stuff off from the PCI DOE create
	auxiliary device code.
---
 drivers/cxl/Kconfig |  1 +
 drivers/cxl/pci.c   | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index b88ab956bb7c..6088456fe0ca 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -16,6 +16,7 @@ if CXL_BUS
 config CXL_PCI
 	tristate "PCI manageability"
 	default CXL_BUS
+	select PCI_DOE_DRIVER
 	help
 	  The CXL specification defines a "CXL memory device" sub-class in the
 	  PCI "memory controller" base class of devices. Device's identified by
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 9252e1f4b18c..d4ae79b62a14 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -8,6 +8,7 @@
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/pci.h>
+#include <linux/pci-doe.h>
 #include <linux/io.h>
 #include "cxlmem.h"
 #include "cxlpci.h"
@@ -535,6 +536,14 @@ static int cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
 	return rc;
 }
 
+static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
+{
+	struct device *dev = cxlds->dev;
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return pci_doe_create_doe_devices(pdev);
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct cxl_register_map map;
@@ -603,6 +612,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	rc = cxl_setup_doe_devices(cxlds);
+	if (rc)
+		return rc;
+
 	rc = cxl_dvsec_ranges(cxlds);
 	if (rc)
 		dev_err(&pdev->dev,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V6 06/10] cxl/pci: Find the DOE mailbox which supports CDAT
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (4 preceding siblings ...)
  2022-02-01  7:19 ` [PATCH V6 05/10] cxl/pci: Create DOE auxiliary devices ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-01 18:49   ` Ben Widawsky
  2022-02-01  7:19 ` [PATCH V6 07/10] cxl/mem: Read CDAT table ira.weiny
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

Memory devices need the CDAT data from the device.  This data is read
from a DOE mailbox which supports the CDAT protocol.

Search the DOE auxiliary devices for the one which supports the CDAT
protocol.  Cache that device to be used for future queries.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/cxl/cxl.h    |  3 +++
 drivers/cxl/cxlmem.h |  2 ++
 drivers/cxl/pci.c    | 43 ++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 962629c5775f..7169101db553 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -90,6 +90,9 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
 #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
 #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
 
+#define CXL_DOE_PROTOCOL_COMPLIANCE 0
+#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
+
 /*
  * Using struct_group() allows for per register-block-type helper routines,
  * without requiring block-type agnostic code to include the prefix.
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 5d33ce24fe09..0fefe43951e3 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -117,6 +117,7 @@ struct cxl_endpoint_dvsec_info {
  * Currently only memory devices are represented.
  *
  * @dev: The device associated with this CXL state
+ * @cdat_doe: Auxiliary DOE device capabile of reading CDAT
  * @regs: Parsed register blocks
  * @cxl_dvsec: Offset to the PCIe device DVSEC
  * @payload_size: Size of space for payload
@@ -149,6 +150,7 @@ struct cxl_endpoint_dvsec_info {
 struct cxl_dev_state {
 	struct device *dev;
 
+	struct pci_doe_dev *cdat_doe;
 	struct cxl_regs regs;
 	int cxl_dvsec;
 
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index d4ae79b62a14..dcc55c4efd85 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -536,12 +536,53 @@ static int cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
 	return rc;
 }
 
+static int cxl_match_cdat_doe_device(struct device *dev, const void *data)
+{
+	const struct cxl_dev_state *cxlds = data;
+	struct auxiliary_device *adev;
+	struct pci_doe_dev *doe_dev;
+
+	/* First determine if this auxiliary device belongs to the cxlds */
+	if (cxlds->dev != dev->parent)
+		return 0;
+
+	adev = to_auxiliary_dev(dev);
+	doe_dev = container_of(adev, struct pci_doe_dev, adev);
+
+	/* If it is one of ours check for the CDAT protocol */
+	if (pci_doe_supports_prot(doe_dev, PCI_DVSEC_VENDOR_ID_CXL,
+				  CXL_DOE_PROTOCOL_TABLE_ACCESS))
+		return 1;
+
+	return 0;
+}
+
 static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
 {
 	struct device *dev = cxlds->dev;
 	struct pci_dev *pdev = to_pci_dev(dev);
+	struct auxiliary_device *adev;
+	int rc;
 
-	return pci_doe_create_doe_devices(pdev);
+	rc = pci_doe_create_doe_devices(pdev);
+	if (rc)
+		return rc;
+
+	adev = auxiliary_find_device(NULL, cxlds, &cxl_match_cdat_doe_device);
+
+	if (adev) {
+		struct pci_doe_dev *doe_dev = container_of(adev,
+							   struct pci_doe_dev,
+							   adev);
+
+		/*
+		 * No reference need be taken.  The DOE device lifetime is
+		 * longer that the CXL device state lifetime
+		 */
+		cxlds->cdat_doe = doe_dev;
+	}
+
+	return 0;
 }
 
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V6 07/10] cxl/mem: Read CDAT table
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (5 preceding siblings ...)
  2022-02-01  7:19 ` [PATCH V6 06/10] cxl/pci: Find the DOE mailbox which supports CDAT ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-04 13:46   ` Jonathan Cameron
  2022-02-01  7:19 ` [PATCH V6 08/10] cxl/cdat: Introduce cdat_hdr_valid() ira.weiny
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

The OS will need CDAT data from the CXL devices to properly set up
interleave sets.

Search the DOE driver/devices attached to the CXL device for one which
supports the CDAT protocol.  If found, read the CDAT data from that
mailbox.

Currently this is only supported by a PCI CXL object through a DOE
mailbox which supports CDAT.  But any cxl_mem type object can provide
this data later if need be.  For example for testing.

Cache this data for later parsing.  Provide a sysfs binary attribute to
allow dumping of the CDAT.

Binary dumping is modeled on /sys/firmware/ACPI/tables/

The ability to dump this table will be very useful for emulation of real
devices once they become available as QEMU CXL type 3 device emulation will
be able to load this file in.

This does not support table updates at runtime. It will always provide
whatever was there when first cached. Handling of table updates can be
implemented later.

Once there are more users, this code can move out to driver/cxl/cdat.c
or similar.

Finally create a complete list of DOE defines within cdat.h for anyone
wishing to decode the CDAT table.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

---
Changes from V5:
	Add proper guards around cdat.h
	Split out finding the CDAT DOE mailbox
	Use cxl_cdat to group CDAT data together
	Adjust to use auxiliary_find_device() to find the DOE device
		which supplies the CDAT protocol.
	Rebased to latest
	Remove dev_dbg(length)
	Remove unneeded DOE Table access defines
	Move CXL_DOE_PROTOCOL_TABLE_ACCESS define into this patch where
		it is used

Changes from V4:
	Split this into it's own patch
	Rearchitect this such that the memdev driver calls into the DOE
	driver via the cxl_mem state object.  This allows CDAT data to
	come from any type of cxl_mem object not just PCI DOE.
	Rebase on new struct cxl_dev_state
---
 drivers/cxl/cdat.h        | 97 +++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/memdev.c | 56 ++++++++++++++++++++++
 drivers/cxl/cxlmem.h      | 25 ++++++++++
 drivers/cxl/pci.c         | 87 +++++++++++++++++++++++++++++++++++
 4 files changed, 265 insertions(+)
 create mode 100644 drivers/cxl/cdat.h

diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
new file mode 100644
index 000000000000..4722b6bbbaf0
--- /dev/null
+++ b/drivers/cxl/cdat.h
@@ -0,0 +1,97 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __CXL_CDAT_H__
+#define __CXL_CDAT_H__
+
+/*
+ * Coherent Device Attribute table (CDAT)
+ *
+ * Specification available from UEFI.org
+ *
+ * Whilst CDAT is defined as a single table, the access via DOE maiboxes is
+ * done one entry at a time, where the first entry is the header.
+ */
+
+#define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
+#define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
+#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
+#define   CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA	0
+#define CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE	0xffff0000
+
+/*
+ * CDAT entries are little endian and are read from PCI config space which
+ * is also little endian.
+ * As such, on a big endian system these will have been reversed.
+ * This prevents us from making easy use of packed structures.
+ * Style form pci_regs.h
+ */
+
+#define CDAT_HEADER_LENGTH_DW 4
+#define CDAT_HEADER_LENGTH_BYTES (CDAT_HEADER_LENGTH_DW * sizeof(u32))
+#define CDAT_HEADER_DW0_LENGTH		0xffffffff
+#define CDAT_HEADER_DW1_REVISION	0x000000ff
+#define CDAT_HEADER_DW1_CHECKSUM	0x0000ff00
+/* CDAT_HEADER_DW2_RESERVED	*/
+#define CDAT_HEADER_DW3_SEQUENCE	0xffffffff
+
+/* All structures have a common first DW */
+#define CDAT_STRUCTURE_DW0_TYPE		0x000000ff
+#define   CDAT_STRUCTURE_DW0_TYPE_DSMAS 0
+#define   CDAT_STRUCTURE_DW0_TYPE_DSLBIS 1
+#define   CDAT_STRUCTURE_DW0_TYPE_DSMSCIS 2
+#define   CDAT_STRUCTURE_DW0_TYPE_DSIS 3
+#define   CDAT_STRUCTURE_DW0_TYPE_DSEMTS 4
+#define   CDAT_STRUCTURE_DW0_TYPE_SSLBIS 5
+
+#define CDAT_STRUCTURE_DW0_LENGTH	0xffff0000
+
+/* Device Scoped Memory Affinity Structure */
+#define CDAT_DSMAS_DW1_DSMAD_HANDLE	0x000000ff
+#define CDAT_DSMAS_DW1_FLAGS		0x0000ff00
+#define CDAT_DSMAS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMAS_DPA_LEN(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
+#define CDAT_DSMAS_NON_VOLATILE(flags)  ((flags & 0x04) >> 2)
+
+/* Device Scoped Latency and Bandwidth Information Structure */
+#define CDAT_DSLBIS_DW1_HANDLE		0x000000ff
+#define CDAT_DSLBIS_DW1_FLAGS		0x0000ff00
+#define CDAT_DSLBIS_DW1_DATA_TYPE	0x00ff0000
+#define CDAT_DSLBIS_BASE_UNIT(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSLBIS_DW4_ENTRY_0		0x0000ffff
+#define CDAT_DSLBIS_DW4_ENTRY_1		0xffff0000
+#define CDAT_DSLBIS_DW5_ENTRY_2		0x0000ffff
+
+/* Device Scoped Memory Side Cache Information Structure */
+#define CDAT_DSMSCIS_DW1_HANDLE		0x000000ff
+#define CDAT_DSMSCIS_MEMORY_SIDE_CACHE_SIZE(entry) \
+	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMSCIS_DW4_MEMORY_SIDE_CACHE_ATTRS 0xffffffff
+
+/* Device Scoped Initiator Structure */
+#define CDAT_DSIS_DW1_FLAGS		0x000000ff
+#define CDAT_DSIS_DW1_HANDLE		0x0000ff00
+
+/* Device Scoped EFI Memory Type Structure */
+#define CDAT_DSEMTS_DW1_HANDLE		0x000000ff
+#define CDAT_DSEMTS_DW1_EFI_MEMORY_TYPE_ATTR	0x0000ff00
+#define CDAT_DSEMTS_DPA_OFFSET(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSEMTS_DPA_LENGTH(entry)	((u64)((entry)[5]) << 32 | (entry)[4])
+
+/* Switch Scoped Latency and Bandwidth Information Structure */
+#define CDAT_SSLBIS_DW1_DATA_TYPE	0x000000ff
+#define CDAT_SSLBIS_BASE_UNIT(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_SSLBIS_ENTRY_PORT_X(entry, i) ((entry)[4 + (i) * 2] & 0x0000ffff)
+#define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
+#define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
+
+/**
+ * struct cxl_cdat - CXL CDAT data
+ *
+ * @table: cache of CDAT table
+ * @length: length of cached CDAT table
+ */
+struct cxl_cdat {
+	void *table;
+	size_t length;
+};
+
+#endif /* !__CXL_CDAT_H__ */
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index ee0156419d06..a01068e98333 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -86,6 +86,35 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 	return sysfs_emit(buf, "%#llx\n", len);
 }
 
+static ssize_t CDAT_read(struct file *filp, struct kobject *kobj,
+			 struct bin_attribute *bin_attr, char *buf,
+			 loff_t offset, size_t count)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+
+	if (!cxlmd->cdat.table)
+		return 0;
+
+	return memory_read_from_buffer(buf, count, &offset,
+				       cxlmd->cdat.table,
+				       cxlmd->cdat.length);
+}
+
+static BIN_ATTR_RO(CDAT, 0);
+
+static umode_t cxl_memdev_bin_attr_is_visible(struct kobject *kobj,
+					      struct bin_attribute *attr, int i)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+
+	if ((attr == &bin_attr_CDAT) && cxlmd->cdat.table)
+		return 0400;
+
+	return 0;
+}
+
 static struct device_attribute dev_attr_pmem_size =
 	__ATTR(size, 0444, pmem_size_show, NULL);
 
@@ -115,6 +144,11 @@ static struct attribute *cxl_memdev_attributes[] = {
 	NULL,
 };
 
+static struct bin_attribute *cxl_memdev_bin_attributes[] = {
+	&bin_attr_CDAT,
+	NULL,
+};
+
 static struct attribute *cxl_memdev_pmem_attributes[] = {
 	&dev_attr_pmem_size.attr,
 	NULL,
@@ -136,6 +170,8 @@ static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
 static struct attribute_group cxl_memdev_attribute_group = {
 	.attrs = cxl_memdev_attributes,
 	.is_visible = cxl_memdev_visible,
+	.bin_attrs = cxl_memdev_bin_attributes,
+	.is_bin_visible = cxl_memdev_bin_attr_is_visible,
 };
 
 static struct attribute_group cxl_memdev_ram_attribute_group = {
@@ -320,6 +356,21 @@ static const struct file_operations cxl_memdev_fops = {
 	.llseek = noop_llseek,
 };
 
+static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
+{
+	struct device *dev = &cxlmd->dev;
+	size_t cdat_length;
+
+	if (cxl_mem_cdat_get_length(cxlds, &cdat_length))
+		return 0;
+
+	cxlmd->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
+	if (!cxlmd->cdat.table)
+		return -ENOMEM;
+	cxlmd->cdat.length = cdat_length;
+	return cxl_mem_cdat_read_table(cxlds, &cxlmd->cdat);
+}
+
 struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
 {
 	struct cxl_memdev *cxlmd;
@@ -336,6 +387,11 @@ struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
 	if (rc)
 		goto err;
 
+	/* Cache the data early to ensure is_visible() works */
+	rc = read_cdat_data(cxlmd, cxlds);
+	if (rc)
+		goto err;
+
 	/*
 	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
 	 * needed as this is ordered with cdev_add() publishing the device.
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 0fefe43951e3..15c653b20f37 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -5,6 +5,7 @@
 #include <uapi/linux/cxl_mem.h>
 #include <linux/cdev.h>
 #include "cxl.h"
+#include "cdat.h"
 
 /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
 #define CXLMDEV_STATUS_OFFSET 0x0
@@ -41,6 +42,7 @@ struct cxl_memdev {
 	struct device dev;
 	struct cdev cdev;
 	struct cxl_dev_state *cxlds;
+	struct cxl_cdat cdat;
 	struct work_struct detach_work;
 	int id;
 };
@@ -143,6 +145,10 @@ struct cxl_endpoint_dvsec_info {
  * @serial: PCIe Device Serial Number
  * @mbox_send: @dev specific transport for transmitting mailbox commands
  * @wait_media_ready: @dev specific method to await media ready
+ * @cdat_get_length: @dev specific function for reading the CDAT table length
+ *                   returns -errno if CDAT not supported on this device
+ * @cdat_read_table: @dev specific function for reading the table
+ *                   returns -errno if CDAT not supported on this device
  *
  * See section 8.2.9.5.2 Capacity Configuration and Label Storage for
  * details on capacity parameters.
@@ -179,6 +185,9 @@ struct cxl_dev_state {
 
 	int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
 	int (*wait_media_ready)(struct cxl_dev_state *cxlds);
+	int (*cdat_get_length)(struct cxl_dev_state *cxlds, size_t *length);
+	int (*cdat_read_table)(struct cxl_dev_state *cxlds,
+			       struct cxl_cdat *cdat);
 };
 
 enum cxl_opcode {
@@ -305,4 +314,20 @@ struct cxl_hdm {
 	unsigned int interleave_mask;
 	struct cxl_port *port;
 };
+
+static inline int cxl_mem_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
+{
+	if (cxlds->cdat_get_length)
+		return cxlds->cdat_get_length(cxlds, length);
+	return -EOPNOTSUPP;
+}
+
+static inline int cxl_mem_cdat_read_table(struct cxl_dev_state *cxlds,
+					  struct cxl_cdat *cdat)
+{
+	if (cxlds->cdat_read_table)
+		return cxlds->cdat_read_table(cxlds, cdat);
+	return -EOPNOTSUPP;
+}
+
 #endif /* __CXL_MEM_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index dcc55c4efd85..28b973a9e29e 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -13,6 +13,7 @@
 #include "cxlmem.h"
 #include "cxlpci.h"
 #include "cxl.h"
+#include "cdat.h"
 
 /**
  * DOC: cxl pci
@@ -585,6 +586,90 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
 	return 0;
 }
 
+#define CDAT_DOE_REQ(entry_handle)					\
+	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
+		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
+	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_TABLE_TYPE,			\
+		    CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA) |		\
+	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, (entry_handle)))
+
+static int cxl_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
+{
+	struct pci_doe_dev *doe_dev = cxlds->cdat_doe;
+	u32 cdat_request_pl = CDAT_DOE_REQ(0);
+	u32 cdat_response_pl[32];
+	struct pci_doe_exchange ex = {
+		.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+		.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+		.request_pl = &cdat_request_pl,
+		.request_pl_sz = sizeof(cdat_request_pl),
+		.response_pl = cdat_response_pl,
+		.response_pl_sz = sizeof(cdat_response_pl),
+	};
+
+	ssize_t rc;
+
+	rc = pci_doe_exchange_sync(doe_dev, &ex);
+	if (rc < 0)
+		return rc;
+	if (rc < 1)
+		return -EIO;
+
+	*length = cdat_response_pl[1];
+	return 0;
+}
+
+static int cxl_cdat_read_table(struct cxl_dev_state *cxlds,
+			       struct cxl_cdat *cdat)
+{
+	struct pci_doe_dev *doe_dev = cxlds->cdat_doe;
+	size_t length = cdat->length;
+	u32 *data = cdat->table;
+	int entry_handle = 0;
+	int rc;
+
+	do {
+		u32 cdat_request_pl = CDAT_DOE_REQ(entry_handle);
+		u32 cdat_response_pl[32];
+		struct pci_doe_exchange ex = {
+			.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+			.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+			.request_pl = &cdat_request_pl,
+			.request_pl_sz = sizeof(cdat_request_pl),
+			.response_pl = cdat_response_pl,
+			.response_pl_sz = sizeof(cdat_response_pl),
+		};
+		size_t entry_dw;
+		u32 *entry;
+
+		rc = pci_doe_exchange_sync(doe_dev, &ex);
+		if (rc < 0)
+			return rc;
+
+		entry = cdat_response_pl + 1;
+		entry_dw = rc / sizeof(u32);
+		/* Skip Header */
+		entry_dw -= 1;
+		entry_dw = min(length / 4, entry_dw);
+		memcpy(data, entry, entry_dw * sizeof(u32));
+		length -= entry_dw * sizeof(u32);
+		data += entry_dw;
+		entry_handle = FIELD_GET(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, cdat_response_pl[0]);
+
+	} while (entry_handle != 0xFFFF);
+
+	return 0;
+}
+
+static void cxl_initialize_cdat_callbacks(struct cxl_dev_state *cxlds)
+{
+	if (!cxlds->cdat_doe)
+		return;
+
+	cxlds->cdat_get_length = cxl_cdat_get_length;
+	cxlds->cdat_read_table = cxl_cdat_read_table;
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct cxl_register_map map;
@@ -657,6 +742,8 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	cxl_initialize_cdat_callbacks(cxlds);
+
 	rc = cxl_dvsec_ranges(cxlds);
 	if (rc)
 		dev_err(&pdev->dev,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V6 08/10] cxl/cdat: Introduce cdat_hdr_valid()
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (6 preceding siblings ...)
  2022-02-01  7:19 ` [PATCH V6 07/10] cxl/mem: Read CDAT table ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-01 18:56   ` Ben Widawsky
  2022-02-01  7:19 ` [PATCH V6 09/10] cxl/mem: Retry reading CDAT on failure ira.weiny
  2022-02-01  7:19 ` [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
  9 siblings, 1 reply; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

The CDAT data is protected by a checksum which should be checked when
the CDAT is read to ensure it is valid.  In addition the lengths
specified should be checked.

Introduce cdat_hdr_valid() to check the checksum.  While at it check and
store the sequence number.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V5
	New patch, split out
	Update cdat_hdr_valid()
		Remove revision and cs field parsing
			There is no point in these
		Add seq check and debug print.
---
 drivers/cxl/cdat.h |  2 ++
 drivers/cxl/pci.c  | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
index 4722b6bbbaf0..a7725d26f2d2 100644
--- a/drivers/cxl/cdat.h
+++ b/drivers/cxl/cdat.h
@@ -88,10 +88,12 @@
  *
  * @table: cache of CDAT table
  * @length: length of cached CDAT table
+ * @seq: Last read Sequence number of the CDAT table
  */
 struct cxl_cdat {
 	void *table;
 	size_t length;
+	u32 seq;
 };
 
 #endif /* !__CXL_CDAT_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 28b973a9e29e..c362c75feed2 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -586,6 +586,35 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
 	return 0;
 }
 
+static bool cxl_cdat_hdr_valid(struct device *dev, struct cxl_cdat *cdat)
+{
+	u32 *table = cdat->table;
+	u8 *data8 = cdat->table;
+	u32 length, seq;
+	u8 check;
+	int i;
+
+	length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, table[0]);
+	if (length < CDAT_HEADER_LENGTH_BYTES)
+		return false;
+
+	if (length > cdat->length)
+		return false;
+
+	seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, table[3]);
+
+	/* Store the sequence for now. */
+	if (cdat->seq != seq) {
+		dev_info(dev, "CDAT seq change %x -> %x\n", cdat->seq, seq);
+		cdat->seq = seq;
+	}
+
+	for (check = 0, i = 0; i < length; i++)
+		check += data8[i];
+
+	return check == 0;
+}
+
 #define CDAT_DOE_REQ(entry_handle)					\
 	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
 		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
@@ -658,6 +687,9 @@ static int cxl_cdat_read_table(struct cxl_dev_state *cxlds,
 
 	} while (entry_handle != 0xFFFF);
 
+	if (!cxl_cdat_hdr_valid(cxlds->dev, cdat))
+		return -EIO;
+
 	return 0;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V6 09/10] cxl/mem: Retry reading CDAT on failure
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (7 preceding siblings ...)
  2022-02-01  7:19 ` [PATCH V6 08/10] cxl/cdat: Introduce cdat_hdr_valid() ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-01 18:59   ` Ben Widawsky
  2022-02-01  7:19 ` [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
  9 siblings, 1 reply; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

The CDAT read may fail for a number of reasons but mainly it is possible
to get different parts of a valid state.  The checksum in the CDAT table
protects against this.

Now that the checksum is validated issue a retry if the CDAT read fails.
For now 2 retries are implemented.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
NOTE: Is 2 enough?  Should this just be delayed until the time when the
data is actually needed and not there?

Changes from V5:
	New patch -- easy to push off or drop.
---
 drivers/cxl/core/memdev.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index a01068e98333..11d721c56f08 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -356,7 +356,8 @@ static const struct file_operations cxl_memdev_fops = {
 	.llseek = noop_llseek,
 };
 
-static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
+static int __read_cdat_data(struct cxl_memdev *cxlmd,
+			    struct cxl_dev_state *cxlds)
 {
 	struct device *dev = &cxlmd->dev;
 	size_t cdat_length;
@@ -371,6 +372,20 @@ static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
 	return cxl_mem_cdat_read_table(cxlds, &cxlmd->cdat);
 }
 
+static int read_cdat_data(struct cxl_memdev *cxlmd,
+			  struct cxl_dev_state *cxlds)
+{
+	int retries = 2;
+	int rc;
+
+	while (--retries) {
+		rc = __read_cdat_data(cxlmd, cxlds);
+		if (!rc)
+			break;
+	}
+	return rc;
+}
+
 struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
 {
 	struct cxl_memdev *cxlmd;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table
  2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
                   ` (8 preceding siblings ...)
  2022-02-01  7:19 ` [PATCH V6 09/10] cxl/mem: Retry reading CDAT on failure ira.weiny
@ 2022-02-01  7:19 ` ira.weiny
  2022-02-01 19:05   ` Ben Widawsky
  2022-02-04 13:40   ` Jonathan Cameron
  9 siblings, 2 replies; 49+ messages in thread
From: ira.weiny @ 2022-02-01  7:19 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Bjorn Helgaas
  Cc: Alison Schofield, Vishal Verma, Ira Weiny, Ben Widawsky,
	linux-kernel, linux-cxl, linux-pci

From: Ira Weiny <ira.weiny@intel.com>

CXL memory devices need the information in the Device Scoped Memory
Affinity Structure (DSMAS).  This information is contained within the
CDAT table buffer which is already read and cached.

Parse and cache DSMAS data from the CDAT table.  Store this data in
unmarshaled struct dsmas data structures for ease of use.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V5
	Fix up sparse warnings
	Split out cdat_hdr_valid()
	Update cdat_hdr_valid()
		Remove revision and cs field parsing
			There is no point in these
		Add seq check and debug print.
	From Jonathan
		Add spaces around '+' and '/'
		use devm_krealloc() for dmas_ary

Changes from V4
	New patch
---
 drivers/cxl/cdat.h        | 21 ++++++++++++
 drivers/cxl/core/memdev.c | 70 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
index a7725d26f2d2..f8c126190d18 100644
--- a/drivers/cxl/cdat.h
+++ b/drivers/cxl/cdat.h
@@ -83,17 +83,38 @@
 #define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
 #define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
 
+/**
+ * struct cxl_dsmas - host unmarshaled version of DSMAS data
+ *
+ * As defined in the Coherent Device Attribute Table (CDAT) specification this
+ * represents a single DSMAS entry in that table.
+ *
+ * @dpa_base: The lowest DPA address associated with this DSMAD
+ * @dpa_length: Length in bytes of this DSMAD
+ * @non_volatile: If set, the memory region represents Non-Volatile memory
+ */
+struct cxl_dsmas {
+	u64 dpa_base;
+	u64 dpa_length;
+	/* Flags */
+	u8 non_volatile:1;
+};
+
 /**
  * struct cxl_cdat - CXL CDAT data
  *
  * @table: cache of CDAT table
  * @length: length of cached CDAT table
  * @seq: Last read Sequence number of the CDAT table
+ * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
+ * @nr_dsmas: Number of entries in dsmas_ary
  */
 struct cxl_cdat {
 	void *table;
 	size_t length;
 	u32 seq;
+	struct cxl_dsmas *dsmas_ary;
+	int nr_dsmas;
 };
 
 #endif /* !__CXL_CDAT_H__ */
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 11d721c56f08..32342a15e991 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -6,6 +6,7 @@
 #include <linux/idr.h>
 #include <linux/pci.h>
 #include <cxlmem.h>
+#include "cdat.h"
 #include "core.h"
 
 static DECLARE_RWSEM(cxl_memdev_rwsem);
@@ -386,6 +387,71 @@ static int read_cdat_data(struct cxl_memdev *cxlmd,
 	return rc;
 }
 
+static int parse_dsmas(struct cxl_memdev *cxlmd)
+{
+	struct cxl_dsmas *dsmas_ary = NULL;
+	u32 *data = cxlmd->cdat.table;
+	int bytes_left = cxlmd->cdat.length;
+	int nr_dsmas = 0;
+
+	if (!data)
+		return -ENXIO;
+
+	/* Skip header */
+	data += CDAT_HEADER_LENGTH_DW;
+	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
+
+	while (bytes_left > 0) {
+		u32 *cur_rec = data;
+		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
+		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
+
+		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
+			struct cxl_dsmas *new_ary;
+			u8 flags;
+
+			new_ary = devm_krealloc(&cxlmd->dev, dsmas_ary,
+					   sizeof(*dsmas_ary) * (nr_dsmas + 1),
+					   GFP_KERNEL);
+			if (!new_ary) {
+				dev_err(&cxlmd->dev,
+					"Failed to allocate memory for DSMAS data\n");
+				return -ENOMEM;
+			}
+			dsmas_ary = new_ary;
+
+			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
+
+			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
+			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
+			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
+
+			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
+				nr_dsmas,
+				dsmas_ary[nr_dsmas].dpa_base,
+				dsmas_ary[nr_dsmas].dpa_base +
+					dsmas_ary[nr_dsmas].dpa_length,
+				(dsmas_ary[nr_dsmas].non_volatile ?
+					"Persistent" : "Volatile")
+				);
+
+			nr_dsmas++;
+		}
+
+		data += (length / sizeof(u32));
+		bytes_left -= length;
+	}
+
+	if (nr_dsmas == 0)
+		return -ENXIO;
+
+	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
+	cxlmd->cdat.dsmas_ary = dsmas_ary;
+	cxlmd->cdat.nr_dsmas = nr_dsmas;
+
+	return 0;
+}
+
 struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
 {
 	struct cxl_memdev *cxlmd;
@@ -407,6 +473,10 @@ struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
 	if (rc)
 		goto err;
 
+	rc = parse_dsmas(cxlmd);
+	if (rc)
+		dev_warn(dev, "No DSMAS data found: %d\n", rc);
+
 	/*
 	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
 	 * needed as this is ordered with cdev_add() publishing the device.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 06/10] cxl/pci: Find the DOE mailbox which supports CDAT
  2022-02-01  7:19 ` [PATCH V6 06/10] cxl/pci: Find the DOE mailbox which supports CDAT ira.weiny
@ 2022-02-01 18:49   ` Ben Widawsky
  2022-02-01 22:18     ` Ira Weiny
  0 siblings, 1 reply; 49+ messages in thread
From: Ben Widawsky @ 2022-02-01 18:49 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On 22-01-31 23:19:48, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Memory devices need the CDAT data from the device.  This data is read
> from a DOE mailbox which supports the CDAT protocol.
> 
> Search the DOE auxiliary devices for the one which supports the CDAT
> protocol.  Cache that device to be used for future queries.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  drivers/cxl/cxl.h    |  3 +++
>  drivers/cxl/cxlmem.h |  2 ++
>  drivers/cxl/pci.c    | 43 ++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 962629c5775f..7169101db553 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -90,6 +90,9 @@ static inline int cxl_hdm_decoder_count(u32 cap_hdr)
>  #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18
>  #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
>  
> +#define CXL_DOE_PROTOCOL_COMPLIANCE 0
> +#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
> +
>  /*
>   * Using struct_group() allows for per register-block-type helper routines,
>   * without requiring block-type agnostic code to include the prefix.
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 5d33ce24fe09..0fefe43951e3 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -117,6 +117,7 @@ struct cxl_endpoint_dvsec_info {
>   * Currently only memory devices are represented.
>   *
>   * @dev: The device associated with this CXL state
> + * @cdat_doe: Auxiliary DOE device capabile of reading CDAT
>   * @regs: Parsed register blocks
>   * @cxl_dvsec: Offset to the PCIe device DVSEC
>   * @payload_size: Size of space for payload
> @@ -149,6 +150,7 @@ struct cxl_endpoint_dvsec_info {
>  struct cxl_dev_state {
>  	struct device *dev;
>  
> +	struct pci_doe_dev *cdat_doe;
>  	struct cxl_regs regs;
>  	int cxl_dvsec;
>  
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index d4ae79b62a14..dcc55c4efd85 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -536,12 +536,53 @@ static int cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
>  	return rc;
>  }
>  
> +static int cxl_match_cdat_doe_device(struct device *dev, const void *data)
> +{
> +	const struct cxl_dev_state *cxlds = data;
> +	struct auxiliary_device *adev;
> +	struct pci_doe_dev *doe_dev;
> +
> +	/* First determine if this auxiliary device belongs to the cxlds */
> +	if (cxlds->dev != dev->parent)
> +		return 0;

I don't understand auxiliary bus but I'm wondering why it's checking the parent
of the device?

> +
> +	adev = to_auxiliary_dev(dev);
> +	doe_dev = container_of(adev, struct pci_doe_dev, adev);
> +
> +	/* If it is one of ours check for the CDAT protocol */
> +	if (pci_doe_supports_prot(doe_dev, PCI_DVSEC_VENDOR_ID_CXL,
> +				  CXL_DOE_PROTOCOL_TABLE_ACCESS))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
>  {
>  	struct device *dev = cxlds->dev;
>  	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct auxiliary_device *adev;
> +	int rc;
>  
> -	return pci_doe_create_doe_devices(pdev);
> +	rc = pci_doe_create_doe_devices(pdev);
> +	if (rc)
> +		return rc;
> +
> +	adev = auxiliary_find_device(NULL, cxlds, &cxl_match_cdat_doe_device);
> +
> +	if (adev) {
> +		struct pci_doe_dev *doe_dev = container_of(adev,
> +							   struct pci_doe_dev,
> +							   adev);
> +
> +		/*
> +		 * No reference need be taken.  The DOE device lifetime is
> +		 * longer that the CXL device state lifetime
> +		 */

You're holding a reference to the adev here. Did you mean to drop it?

> +		cxlds->cdat_doe = doe_dev;
> +	}
> +
> +	return 0;
>  }
>  
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 08/10] cxl/cdat: Introduce cdat_hdr_valid()
  2022-02-01  7:19 ` [PATCH V6 08/10] cxl/cdat: Introduce cdat_hdr_valid() ira.weiny
@ 2022-02-01 18:56   ` Ben Widawsky
  2022-02-01 22:29     ` Ira Weiny
  0 siblings, 1 reply; 49+ messages in thread
From: Ben Widawsky @ 2022-02-01 18:56 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On 22-01-31 23:19:50, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The CDAT data is protected by a checksum which should be checked when
> the CDAT is read to ensure it is valid.  In addition the lengths
> specified should be checked.
> 
> Introduce cdat_hdr_valid() to check the checksum.  While at it check and
> store the sequence number.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from V5
> 	New patch, split out
> 	Update cdat_hdr_valid()
> 		Remove revision and cs field parsing
> 			There is no point in these
> 		Add seq check and debug print.
> ---
>  drivers/cxl/cdat.h |  2 ++
>  drivers/cxl/pci.c  | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> index 4722b6bbbaf0..a7725d26f2d2 100644
> --- a/drivers/cxl/cdat.h
> +++ b/drivers/cxl/cdat.h
> @@ -88,10 +88,12 @@
>   *
>   * @table: cache of CDAT table
>   * @length: length of cached CDAT table
> + * @seq: Last read Sequence number of the CDAT table
>   */
>  struct cxl_cdat {
>  	void *table;
>  	size_t length;
> +	u32 seq;
>  };
>  
>  #endif /* !__CXL_CDAT_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 28b973a9e29e..c362c75feed2 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -586,6 +586,35 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
>  	return 0;
>  }
>  
> +static bool cxl_cdat_hdr_valid(struct device *dev, struct cxl_cdat *cdat)
> +{
> +	u32 *table = cdat->table;
> +	u8 *data8 = cdat->table;
> +	u32 length, seq;
> +	u8 check;
> +	int i;
> +
> +	length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, table[0]);
> +	if (length < CDAT_HEADER_LENGTH_BYTES)
> +		return false;
> +
> +	if (length > cdat->length)
> +		return false;
> +
> +	seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, table[3]);
> +
> +	/* Store the sequence for now. */
> +	if (cdat->seq != seq) {
> +		dev_info(dev, "CDAT seq change %x -> %x\n", cdat->seq, seq);
> +		cdat->seq = seq;
> +	}

If sequence hasn't changed you could short-circuit the checksum.

> +
> +	for (check = 0, i = 0; i < length; i++)
> +		check += data8[i];
> +
> +	return check == 0;
> +}
> +
>  #define CDAT_DOE_REQ(entry_handle)					\
>  	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
>  		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
> @@ -658,6 +687,9 @@ static int cxl_cdat_read_table(struct cxl_dev_state *cxlds,
>  
>  	} while (entry_handle != 0xFFFF);
>  
> +	if (!cxl_cdat_hdr_valid(cxlds->dev, cdat))
> +		return -EIO;
> +
>  	return 0;
>  }
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 09/10] cxl/mem: Retry reading CDAT on failure
  2022-02-01  7:19 ` [PATCH V6 09/10] cxl/mem: Retry reading CDAT on failure ira.weiny
@ 2022-02-01 18:59   ` Ben Widawsky
  2022-02-01 22:31     ` Ira Weiny
  0 siblings, 1 reply; 49+ messages in thread
From: Ben Widawsky @ 2022-02-01 18:59 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On 22-01-31 23:19:51, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The CDAT read may fail for a number of reasons but mainly it is possible
> to get different parts of a valid state.  The checksum in the CDAT table
> protects against this.
> 
> Now that the checksum is validated issue a retry if the CDAT read fails.
> For now 2 retries are implemented.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> NOTE: Is 2 enough?  Should this just be delayed until the time when the
> data is actually needed and not there?

I can't speak to retries at all, but one small issue below. It might make sense
if we keep this to make it a modparam.

> 
> Changes from V5:
> 	New patch -- easy to push off or drop.
> ---
>  drivers/cxl/core/memdev.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index a01068e98333..11d721c56f08 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -356,7 +356,8 @@ static const struct file_operations cxl_memdev_fops = {
>  	.llseek = noop_llseek,
>  };
>  
> -static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
> +static int __read_cdat_data(struct cxl_memdev *cxlmd,
> +			    struct cxl_dev_state *cxlds)
>  {
>  	struct device *dev = &cxlmd->dev;
>  	size_t cdat_length;
> @@ -371,6 +372,20 @@ static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
>  	return cxl_mem_cdat_read_table(cxlds, &cxlmd->cdat);
>  }
>  
> +static int read_cdat_data(struct cxl_memdev *cxlmd,
> +			  struct cxl_dev_state *cxlds)
> +{
> +	int retries = 2;
> +	int rc;
> +
> +	while (--retries) {

You either want retries--, or retries = 3...

> +		rc = __read_cdat_data(cxlmd, cxlds);
> +		if (!rc)
> +			break;
> +	}
> +	return rc;
> +}
> +
>  struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
>  {
>  	struct cxl_memdev *cxlmd;
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table
  2022-02-01  7:19 ` [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
@ 2022-02-01 19:05   ` Ben Widawsky
  2022-02-01 22:37     ` Ira Weiny
  2022-02-04 13:40   ` Jonathan Cameron
  1 sibling, 1 reply; 49+ messages in thread
From: Ben Widawsky @ 2022-02-01 19:05 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On 22-01-31 23:19:52, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL memory devices need the information in the Device Scoped Memory
> Affinity Structure (DSMAS).  This information is contained within the
> CDAT table buffer which is already read and cached.
> 
> Parse and cache DSMAS data from the CDAT table.  Store this data in
> unmarshaled struct dsmas data structures for ease of use.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from V5
> 	Fix up sparse warnings
> 	Split out cdat_hdr_valid()
> 	Update cdat_hdr_valid()
> 		Remove revision and cs field parsing
> 			There is no point in these
> 		Add seq check and debug print.
> 	From Jonathan
> 		Add spaces around '+' and '/'
> 		use devm_krealloc() for dmas_ary
> 
> Changes from V4
> 	New patch
> ---
>  drivers/cxl/cdat.h        | 21 ++++++++++++
>  drivers/cxl/core/memdev.c | 70 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 91 insertions(+)
> 
> diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> index a7725d26f2d2..f8c126190d18 100644
> --- a/drivers/cxl/cdat.h
> +++ b/drivers/cxl/cdat.h
> @@ -83,17 +83,38 @@
>  #define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
>  #define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
>  
> +/**
> + * struct cxl_dsmas - host unmarshaled version of DSMAS data
> + *
> + * As defined in the Coherent Device Attribute Table (CDAT) specification this
> + * represents a single DSMAS entry in that table.
> + *
> + * @dpa_base: The lowest DPA address associated with this DSMAD
> + * @dpa_length: Length in bytes of this DSMAD
> + * @non_volatile: If set, the memory region represents Non-Volatile memory
> + */
> +struct cxl_dsmas {
> +	u64 dpa_base;
> +	u64 dpa_length;
> +	/* Flags */
> +	u8 non_volatile:1;
> +};
> +
>  /**
>   * struct cxl_cdat - CXL CDAT data
>   *
>   * @table: cache of CDAT table
>   * @length: length of cached CDAT table
>   * @seq: Last read Sequence number of the CDAT table
> + * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
> + * @nr_dsmas: Number of entries in dsmas_ary
>   */
>  struct cxl_cdat {
>  	void *table;
>  	size_t length;
>  	u32 seq;
> +	struct cxl_dsmas *dsmas_ary;
> +	int nr_dsmas;
>  };
>  
>  #endif /* !__CXL_CDAT_H__ */
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 11d721c56f08..32342a15e991 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -6,6 +6,7 @@
>  #include <linux/idr.h>
>  #include <linux/pci.h>
>  #include <cxlmem.h>
> +#include "cdat.h"
>  #include "core.h"
>  
>  static DECLARE_RWSEM(cxl_memdev_rwsem);
> @@ -386,6 +387,71 @@ static int read_cdat_data(struct cxl_memdev *cxlmd,
>  	return rc;
>  }
>  
> +static int parse_dsmas(struct cxl_memdev *cxlmd)
> +{
> +	struct cxl_dsmas *dsmas_ary = NULL;
> +	u32 *data = cxlmd->cdat.table;
> +	int bytes_left = cxlmd->cdat.length;
> +	int nr_dsmas = 0;
> +
> +	if (!data)
> +		return -ENXIO;
> +
> +	/* Skip header */
> +	data += CDAT_HEADER_LENGTH_DW;
> +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> +
> +	while (bytes_left > 0) {
> +		u32 *cur_rec = data;
> +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> +
> +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
> +			struct cxl_dsmas *new_ary;
> +			u8 flags;
> +
> +			new_ary = devm_krealloc(&cxlmd->dev, dsmas_ary,
> +					   sizeof(*dsmas_ary) * (nr_dsmas + 1),
> +					   GFP_KERNEL);
> +			if (!new_ary) {
> +				dev_err(&cxlmd->dev,
> +					"Failed to allocate memory for DSMAS data\n");
> +				return -ENOMEM;
> +			}

One thought here - it looks like there are at most 256 DSMAS entries. You could
allocate the full 256 up front, and then realloc *down* to the actual number.

> +			dsmas_ary = new_ary;
> +
> +			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
> +
> +			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
> +			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
> +			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
> +
> +			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
> +				nr_dsmas,
> +				dsmas_ary[nr_dsmas].dpa_base,
> +				dsmas_ary[nr_dsmas].dpa_base +
> +					dsmas_ary[nr_dsmas].dpa_length,
> +				(dsmas_ary[nr_dsmas].non_volatile ?
> +					"Persistent" : "Volatile")
> +				);
> +
> +			nr_dsmas++;
> +		}
> +
> +		data += (length / sizeof(u32));
> +		bytes_left -= length;
> +	}
> +
> +	if (nr_dsmas == 0)
> +		return -ENXIO;

Hmm is there documentation that suggests a DSMAS must be implemented? Could this
just return 0? I'd put maybe dev_dbg here if it's unexpected but not a failure
and return success.

> +
> +	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
> +	cxlmd->cdat.dsmas_ary = dsmas_ary;
> +	cxlmd->cdat.nr_dsmas = nr_dsmas;
> +
> +	return 0;
> +}
> +
>  struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
>  {
>  	struct cxl_memdev *cxlmd;
> @@ -407,6 +473,10 @@ struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
>  	if (rc)
>  		goto err;
>  
> +	rc = parse_dsmas(cxlmd);
> +	if (rc)
> +		dev_warn(dev, "No DSMAS data found: %d\n", rc);
> +
>  	/*
>  	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
>  	 * needed as this is ordered with cdev_add() publishing the device.
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 06/10] cxl/pci: Find the DOE mailbox which supports CDAT
  2022-02-01 18:49   ` Ben Widawsky
@ 2022-02-01 22:18     ` Ira Weiny
  2022-02-04 14:04       ` Jonathan Cameron
  0 siblings, 1 reply; 49+ messages in thread
From: Ira Weiny @ 2022-02-01 22:18 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On Tue, Feb 01, 2022 at 10:49:47AM -0800, Widawsky, Ben wrote:
> On 22-01-31 23:19:48, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Memory devices need the CDAT data from the device.  This data is read
> > from a DOE mailbox which supports the CDAT protocol.
> > 
> > Search the DOE auxiliary devices for the one which supports the CDAT
> > protocol.  Cache that device to be used for future queries.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>

[snip]

> >  
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index d4ae79b62a14..dcc55c4efd85 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -536,12 +536,53 @@ static int cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
> >  	return rc;
> >  }
> >  
> > +static int cxl_match_cdat_doe_device(struct device *dev, const void *data)
> > +{
> > +	const struct cxl_dev_state *cxlds = data;
> > +	struct auxiliary_device *adev;
> > +	struct pci_doe_dev *doe_dev;
> > +
> > +	/* First determine if this auxiliary device belongs to the cxlds */
> > +	if (cxlds->dev != dev->parent)
> > +		return 0;
> 
> I don't understand auxiliary bus but I'm wondering why it's checking the parent
> of the device?

auxiliary_find_device() iterates all the auxiliary devices in the system.  This
check was a way for the match function to know if the auxiliary device belongs
to the cxlds we are interested in...

But now that I think about it we could have other auxiliary devices attached
which are not DOE...  :-/  So this check is not complete.

FWIW I'm not thrilled with the way auxiliary_find_device() is defined.  And now
that I look at it I think the only user of it currently is wrong.  They too
have a check like this but it is after another check...  :-/

I was hoping to avoid having a list of DOE devices in the cxlds and simply let
the auxiliary bus infrastructure do that somehow.  IIRC Jonathan was thinking
along the same lines.  I think he actually suggested auxiliary_find_device()...

It would be nice if I could have an aux_find_child() or something which
iterated the auxiliary devices attached to a particular parent device.  I've
just not figured out exactly how to implement that better than what I did here.

> 
> > +
> > +	adev = to_auxiliary_dev(dev);
> > +	doe_dev = container_of(adev, struct pci_doe_dev, adev);
> > +
> > +	/* If it is one of ours check for the CDAT protocol */
> > +	if (pci_doe_supports_prot(doe_dev, PCI_DVSEC_VENDOR_ID_CXL,
> > +				  CXL_DOE_PROTOCOL_TABLE_ACCESS))
> > +		return 1;
> > +
> > +	return 0;
> > +}
> > +
> >  static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> >  {
> >  	struct device *dev = cxlds->dev;
> >  	struct pci_dev *pdev = to_pci_dev(dev);
> > +	struct auxiliary_device *adev;
> > +	int rc;
> >  
> > -	return pci_doe_create_doe_devices(pdev);
> > +	rc = pci_doe_create_doe_devices(pdev);
> > +	if (rc)
> > +		return rc;
> > +
> > +	adev = auxiliary_find_device(NULL, cxlds, &cxl_match_cdat_doe_device);
> > +
> > +	if (adev) {
> > +		struct pci_doe_dev *doe_dev = container_of(adev,
> > +							   struct pci_doe_dev,
> > +							   adev);
> > +
> > +		/*
> > +		 * No reference need be taken.  The DOE device lifetime is
> > +		 * longer that the CXL device state lifetime
> > +		 */
> 
> You're holding a reference to the adev here. Did you mean to drop it?

Does find device get a reference? ...  Ah shoot I did not see that.

Yea the reference should be dropped somewhere.

Thanks,
Ira

> 
> > +		cxlds->cdat_doe = doe_dev;
> > +	}
> > +
> > +	return 0;
> >  }
> >  
> >  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > -- 
> > 2.31.1
> > 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 08/10] cxl/cdat: Introduce cdat_hdr_valid()
  2022-02-01 18:56   ` Ben Widawsky
@ 2022-02-01 22:29     ` Ira Weiny
  2022-02-04 13:17       ` Jonathan Cameron
  0 siblings, 1 reply; 49+ messages in thread
From: Ira Weiny @ 2022-02-01 22:29 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On Tue, Feb 01, 2022 at 10:56:40AM -0800, Widawsky, Ben wrote:
> On 22-01-31 23:19:50, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The CDAT data is protected by a checksum which should be checked when
> > the CDAT is read to ensure it is valid.  In addition the lengths
> > specified should be checked.
> > 
> > Introduce cdat_hdr_valid() to check the checksum.  While at it check and
> > store the sequence number.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from V5
> > 	New patch, split out
> > 	Update cdat_hdr_valid()
> > 		Remove revision and cs field parsing
> > 			There is no point in these
> > 		Add seq check and debug print.
> > ---
> >  drivers/cxl/cdat.h |  2 ++
> >  drivers/cxl/pci.c  | 32 ++++++++++++++++++++++++++++++++
> >  2 files changed, 34 insertions(+)
> > 
> > diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> > index 4722b6bbbaf0..a7725d26f2d2 100644
> > --- a/drivers/cxl/cdat.h
> > +++ b/drivers/cxl/cdat.h
> > @@ -88,10 +88,12 @@
> >   *
> >   * @table: cache of CDAT table
> >   * @length: length of cached CDAT table
> > + * @seq: Last read Sequence number of the CDAT table
> >   */
> >  struct cxl_cdat {
> >  	void *table;
> >  	size_t length;
> > +	u32 seq;
> >  };
> >  
> >  #endif /* !__CXL_CDAT_H__ */
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 28b973a9e29e..c362c75feed2 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -586,6 +586,35 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> >  	return 0;
> >  }
> >  
> > +static bool cxl_cdat_hdr_valid(struct device *dev, struct cxl_cdat *cdat)
> > +{
> > +	u32 *table = cdat->table;
> > +	u8 *data8 = cdat->table;
> > +	u32 length, seq;
> > +	u8 check;
> > +	int i;
> > +
> > +	length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, table[0]);
> > +	if (length < CDAT_HEADER_LENGTH_BYTES)
> > +		return false;
> > +
> > +	if (length > cdat->length)
> > +		return false;
> > +
> > +	seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, table[3]);
> > +
> > +	/* Store the sequence for now. */
> > +	if (cdat->seq != seq) {
> > +		dev_info(dev, "CDAT seq change %x -> %x\n", cdat->seq, seq);
> > +		cdat->seq = seq;
> > +	}
> 
> If sequence hasn't changed you could short-circuit the checksum.

I'm not sure.  Jonathan mentioned that reading may race with updates and that
the correct thing to do is re-read.[1]

But I should probably check the CS first...

Ira

[1] https://lore.kernel.org/linux-cxl/20211108145239.000010a5@Huawei.com/

> 
> > +
> > +	for (check = 0, i = 0; i < length; i++)
> > +		check += data8[i];
> > +
> > +	return check == 0;
> > +}
> > +
> >  #define CDAT_DOE_REQ(entry_handle)					\
> >  	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
> >  		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
> > @@ -658,6 +687,9 @@ static int cxl_cdat_read_table(struct cxl_dev_state *cxlds,
> >  
> >  	} while (entry_handle != 0xFFFF);
> >  
> > +	if (!cxl_cdat_hdr_valid(cxlds->dev, cdat))
> > +		return -EIO;
> > +
> >  	return 0;
> >  }
> >  
> > -- 
> > 2.31.1
> > 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 09/10] cxl/mem: Retry reading CDAT on failure
  2022-02-01 18:59   ` Ben Widawsky
@ 2022-02-01 22:31     ` Ira Weiny
  2022-02-04 13:20       ` Jonathan Cameron
  0 siblings, 1 reply; 49+ messages in thread
From: Ira Weiny @ 2022-02-01 22:31 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On Tue, Feb 01, 2022 at 10:59:28AM -0800, Widawsky, Ben wrote:
> On 22-01-31 23:19:51, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The CDAT read may fail for a number of reasons but mainly it is possible
> > to get different parts of a valid state.  The checksum in the CDAT table
> > protects against this.
> > 
> > Now that the checksum is validated issue a retry if the CDAT read fails.
> > For now 2 retries are implemented.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > NOTE: Is 2 enough?  Should this just be delayed until the time when the
> > data is actually needed and not there?
> 
> I can't speak to retries at all, but one small issue below. It might make sense
> if we keep this to make it a modparam.

Not a bad idea.

> 
> > 
> > Changes from V5:
> > 	New patch -- easy to push off or drop.
> > ---
> >  drivers/cxl/core/memdev.c | 17 ++++++++++++++++-
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > index a01068e98333..11d721c56f08 100644
> > --- a/drivers/cxl/core/memdev.c
> > +++ b/drivers/cxl/core/memdev.c
> > @@ -356,7 +356,8 @@ static const struct file_operations cxl_memdev_fops = {
> >  	.llseek = noop_llseek,
> >  };
> >  
> > -static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
> > +static int __read_cdat_data(struct cxl_memdev *cxlmd,
> > +			    struct cxl_dev_state *cxlds)
> >  {
> >  	struct device *dev = &cxlmd->dev;
> >  	size_t cdat_length;
> > @@ -371,6 +372,20 @@ static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
> >  	return cxl_mem_cdat_read_table(cxlds, &cxlmd->cdat);
> >  }
> >  
> > +static int read_cdat_data(struct cxl_memdev *cxlmd,
> > +			  struct cxl_dev_state *cxlds)
> > +{
> > +	int retries = 2;
> > +	int rc;
> > +
> > +	while (--retries) {
> 
> You either want retries--, or retries = 3...

Opps yea.

Thanks,
Ira

> 
> > +		rc = __read_cdat_data(cxlmd, cxlds);
> > +		if (!rc)
> > +			break;
> > +	}
> > +	return rc;
> > +}
> > +
> >  struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> >  {
> >  	struct cxl_memdev *cxlmd;
> > -- 
> > 2.31.1
> > 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table
  2022-02-01 19:05   ` Ben Widawsky
@ 2022-02-01 22:37     ` Ira Weiny
  2022-02-04 13:33       ` Jonathan Cameron
  2022-02-04 13:41       ` Jonathan Cameron
  0 siblings, 2 replies; 49+ messages in thread
From: Ira Weiny @ 2022-02-01 22:37 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On Tue, Feb 01, 2022 at 11:05:32AM -0800, Widawsky, Ben wrote:
> On 22-01-31 23:19:52, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > CXL memory devices need the information in the Device Scoped Memory
> > Affinity Structure (DSMAS).  This information is contained within the
> > CDAT table buffer which is already read and cached.
> > 
> > Parse and cache DSMAS data from the CDAT table.  Store this data in
> > unmarshaled struct dsmas data structures for ease of use.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from V5
> > 	Fix up sparse warnings
> > 	Split out cdat_hdr_valid()
> > 	Update cdat_hdr_valid()
> > 		Remove revision and cs field parsing
> > 			There is no point in these
> > 		Add seq check and debug print.
> > 	From Jonathan
> > 		Add spaces around '+' and '/'
> > 		use devm_krealloc() for dmas_ary
> > 
> > Changes from V4
> > 	New patch
> > ---
> >  drivers/cxl/cdat.h        | 21 ++++++++++++
> >  drivers/cxl/core/memdev.c | 70 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 91 insertions(+)
> > 
> > diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> > index a7725d26f2d2..f8c126190d18 100644
> > --- a/drivers/cxl/cdat.h
> > +++ b/drivers/cxl/cdat.h
> > @@ -83,17 +83,38 @@
> >  #define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
> >  #define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
> >  
> > +/**
> > + * struct cxl_dsmas - host unmarshaled version of DSMAS data
> > + *
> > + * As defined in the Coherent Device Attribute Table (CDAT) specification this
> > + * represents a single DSMAS entry in that table.
> > + *
> > + * @dpa_base: The lowest DPA address associated with this DSMAD
> > + * @dpa_length: Length in bytes of this DSMAD
> > + * @non_volatile: If set, the memory region represents Non-Volatile memory
> > + */
> > +struct cxl_dsmas {
> > +	u64 dpa_base;
> > +	u64 dpa_length;
> > +	/* Flags */
> > +	u8 non_volatile:1;
> > +};
> > +
> >  /**
> >   * struct cxl_cdat - CXL CDAT data
> >   *
> >   * @table: cache of CDAT table
> >   * @length: length of cached CDAT table
> >   * @seq: Last read Sequence number of the CDAT table
> > + * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
> > + * @nr_dsmas: Number of entries in dsmas_ary
> >   */
> >  struct cxl_cdat {
> >  	void *table;
> >  	size_t length;
> >  	u32 seq;
> > +	struct cxl_dsmas *dsmas_ary;
> > +	int nr_dsmas;
> >  };
> >  
> >  #endif /* !__CXL_CDAT_H__ */
> > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > index 11d721c56f08..32342a15e991 100644
> > --- a/drivers/cxl/core/memdev.c
> > +++ b/drivers/cxl/core/memdev.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/idr.h>
> >  #include <linux/pci.h>
> >  #include <cxlmem.h>
> > +#include "cdat.h"
> >  #include "core.h"
> >  
> >  static DECLARE_RWSEM(cxl_memdev_rwsem);
> > @@ -386,6 +387,71 @@ static int read_cdat_data(struct cxl_memdev *cxlmd,
> >  	return rc;
> >  }
> >  
> > +static int parse_dsmas(struct cxl_memdev *cxlmd)
> > +{
> > +	struct cxl_dsmas *dsmas_ary = NULL;
> > +	u32 *data = cxlmd->cdat.table;
> > +	int bytes_left = cxlmd->cdat.length;
> > +	int nr_dsmas = 0;
> > +
> > +	if (!data)
> > +		return -ENXIO;
> > +
> > +	/* Skip header */
> > +	data += CDAT_HEADER_LENGTH_DW;
> > +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> > +
> > +	while (bytes_left > 0) {
> > +		u32 *cur_rec = data;
> > +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> > +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> > +
> > +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
> > +			struct cxl_dsmas *new_ary;
> > +			u8 flags;
> > +
> > +			new_ary = devm_krealloc(&cxlmd->dev, dsmas_ary,
> > +					   sizeof(*dsmas_ary) * (nr_dsmas + 1),
> > +					   GFP_KERNEL);
> > +			if (!new_ary) {
> > +				dev_err(&cxlmd->dev,
> > +					"Failed to allocate memory for DSMAS data\n");
> > +				return -ENOMEM;
> > +			}
> 
> One thought here - it looks like there are at most 256 DSMAS entries. You could
> allocate the full 256 up front, and then realloc *down* to the actual number.
> 
> > +			dsmas_ary = new_ary;
> > +
> > +			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
> > +
> > +			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
> > +			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
> > +			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
> > +
> > +			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
> > +				nr_dsmas,
> > +				dsmas_ary[nr_dsmas].dpa_base,
> > +				dsmas_ary[nr_dsmas].dpa_base +
> > +					dsmas_ary[nr_dsmas].dpa_length,
> > +				(dsmas_ary[nr_dsmas].non_volatile ?
> > +					"Persistent" : "Volatile")
> > +				);
> > +
> > +			nr_dsmas++;
> > +		}
> > +
> > +		data += (length / sizeof(u32));
> > +		bytes_left -= length;
> > +	}
> > +
> > +	if (nr_dsmas == 0)
> > +		return -ENXIO;
> 
> Hmm is there documentation that suggests a DSMAS must be implemented? Could this
> just return 0? I'd put maybe dev_dbg here if it's unexpected but not a failure
> and return success.

For this call I was not envisioning this as an error.  I wanted to leave it up
to the caller.

I think it would make more sense to return the number of DSMAS' found or
negative errno on failure...

I'll clean it up.  Including below...

> 
> > +
> > +	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
> > +	cxlmd->cdat.dsmas_ary = dsmas_ary;
> > +	cxlmd->cdat.nr_dsmas = nr_dsmas;
> > +
> > +	return 0;
> > +}
> > +
> >  struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> >  {
> >  	struct cxl_memdev *cxlmd;
> > @@ -407,6 +473,10 @@ struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> >  	if (rc)
> >  		goto err;
> >  
> > +	rc = parse_dsmas(cxlmd);
> > +	if (rc)
> > +		dev_warn(dev, "No DSMAS data found: %d\n", rc);
> > +

This was changed to dev_warn() because I think here we do expect dsmas data?
Don't we?

Thanks,
Ira

> >  	/*
> >  	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
> >  	 * needed as this is ordered with cdev_add() publishing the device.
> > -- 
> > 2.31.1
> > 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 01/10] PCI: Add vendor ID for the PCI SIG
  2022-02-01  7:19 ` [PATCH V6 01/10] PCI: Add vendor ID for the PCI SIG ira.weiny
@ 2022-02-03 17:11   ` Bjorn Helgaas
  2022-02-03 20:28     ` Ira Weiny
  0 siblings, 1 reply; 49+ messages in thread
From: Bjorn Helgaas @ 2022-02-03 17:11 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Mon, Jan 31, 2022 at 11:19:43PM -0800, ira.weiny@intel.com wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> This ID is used in DOE headers to identify protocols that are defined
> within the PCI Express Base Specification.
> 
> Specified in Table 7-x2 of the Data Object Exchange ECN (approved 12 March
> 2020) available from https://members.pcisig.com/wg/PCI-SIG/document/14143

Please update this citation to PCIe r6.0, sec 6.30.1.1, table 6-32.

> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  include/linux/pci_ids.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index 011f2f1ea5bb..849f514cd7db 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -149,6 +149,7 @@
>  #define PCI_CLASS_OTHERS		0xff
>  
>  /* Vendors and devices.  Sort key: vendor first, device next. */
> +#define PCI_VENDOR_ID_PCI_SIG		0x0001
>  
>  #define PCI_VENDOR_ID_LOONGSON		0x0014
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 01/10] PCI: Add vendor ID for the PCI SIG
  2022-02-03 17:11   ` Bjorn Helgaas
@ 2022-02-03 20:28     ` Ira Weiny
  0 siblings, 0 replies; 49+ messages in thread
From: Ira Weiny @ 2022-02-03 20:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, Feb 03, 2022 at 11:11:12AM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 31, 2022 at 11:19:43PM -0800, ira.weiny@intel.com wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > This ID is used in DOE headers to identify protocols that are defined
> > within the PCI Express Base Specification.
> > 
> > Specified in Table 7-x2 of the Data Object Exchange ECN (approved 12 March
> > 2020) available from https://members.pcisig.com/wg/PCI-SIG/document/14143
> 
> Please update this citation to PCIe r6.0, sec 6.30.1.1, table 6-32.

Done.

Thanks,
Ira

> 
> > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > ---
> >  include/linux/pci_ids.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> > index 011f2f1ea5bb..849f514cd7db 100644
> > --- a/include/linux/pci_ids.h
> > +++ b/include/linux/pci_ids.h
> > @@ -149,6 +149,7 @@
> >  #define PCI_CLASS_OTHERS		0xff
> >  
> >  /* Vendors and devices.  Sort key: vendor first, device next. */
> > +#define PCI_VENDOR_ID_PCI_SIG		0x0001
> >  
> >  #define PCI_VENDOR_ID_LOONGSON		0x0014
> >  
> > -- 
> > 2.31.1
> > 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-01  7:19 ` [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
@ 2022-02-03 22:40   ` Bjorn Helgaas
  2022-03-15 21:48     ` Ira Weiny
  2022-02-09  0:59   ` Dan Williams
  1 sibling, 1 reply; 49+ messages in thread
From: Bjorn Helgaas @ 2022-02-03 22:40 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Mon, Jan 31, 2022 at 11:19:45PM -0800, ira.weiny@intel.com wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> with standard protocol discovery.  Each mailbox is accessed through a
> DOE Extended Capability.
> 
> Define an auxiliary device driver which control DOE auxiliary devices
> registered on the auxiliary bus.
> 
> A DOE mailbox is allowed to support any number of protocols while some
> DOE protocol specifications apply additional restrictions.
> 
> The protocols supported are queried and cached.  pci_doe_supports_prot()
> can be used to determine if the DOE device supports the protocol
> specified.
> 
> A synchronous interface is provided in pci_doe_exchange_sync() to
> perform a single query / response exchange from the driver through the
> device specified.
> 
> Testing was conducted against QEMU using:
> 
> https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
> 
> This code is based on Jonathan's V4 series here:
> 
> https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/

Details like references to previous versions can go below the "---"
so they are omitted from the merged commit.  Many/most maintainers now
include a Link: tag that facilitates tracing back from a commit to the
mailing list history.

> [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
>     Data Object Exchange (DOE) - Approved 12 March 2020

Please update the "PCI ECN" text above and this citation to PCIe r6.0,
sec 6.30.  No need to reference the ECN now that it's part of the
published spec.

> +config PCI_DOE_DRIVER
> +	tristate "PCI Data Object Exchange (DOE) driver"
> +	select AUXILIARY_BUS
> +	help
> +	  Driver for DOE auxiliary devices.
> +
> +	  DOE provides a simple mailbox in PCI config space that is used by a
> +	  number of different protocols.  DOE is defined in the Data Object
> +	  Exchange ECN to the PCIe r5.0 spec.

Not sure this is relevant in Kconfig help, but if it is, update the
citation to PCIe r6.0, sec 6.30.

> +obj-$(CONFIG_PCI_DOE_DRIVER)	+= pci-doe.o
>  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>  
> +pci-doe-y := doe.o

Why do we need this doe.o to pci-doe.o dance?  Why not just rename
doe.c to pci-doe.c?  It looks like that's what we do with pci-stub.c
and pci-pf-stub.c, which are also tristate.

> +++ b/drivers/pci/doe.c
> @@ -0,0 +1,675 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Object Exchange ECN
> + * https://members.pcisig.com/wg/PCI-SIG/document/14143

Update citation.  Maybe copyright dates, too.

> + * Copyright (C) 2021 Huawei

> +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */

Update citation.

> +/**
> + * struct pci_doe - A single DOE mailbox driver
> + *
> + * @doe_dev: The DOE Auxiliary device being driven
> + * @abort_c: Completion used for initial abort handling
> + * @irq: Interrupt used for signaling DOE ready or abort
> + * @irq_name: Name used to identify the irq for a particular DOE

s/ irq / IRQ /

> +static int pci_doe_cache_protocols(struct pci_doe *doe)
> +{
> +	u8 index = 0;
> +	int num_prots;
> +	int rc;
> +
> +	/* Discovery protocol must always be supported and must report itself */
> +	num_prots = 1;
> +	doe->prots = devm_kcalloc(&doe->doe_dev->adev.dev, num_prots,
> +				  sizeof(*doe->prots), GFP_KERNEL);
> +	if (doe->prots == NULL)

More idiomatic (and as you did below):

  if (!doe->prots)

> +		return -ENOMEM;
> +
> +	do {
> +		struct pci_doe_protocol *prot;
> +
> +		prot = &doe->prots[num_prots - 1];
> +		rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
> +		if (rc)
> +			return rc;
> +
> +		if (index) {
> +			struct pci_doe_protocol *prot_new;
> +
> +			num_prots++;
> +			prot_new = devm_krealloc(&doe->doe_dev->adev.dev,
> +						 doe->prots,
> +						 sizeof(*doe->prots) *
> +							num_prots,
> +						 GFP_KERNEL);
> +			if (prot_new == NULL)

Ditto.

> +				return -ENOMEM;
> +			doe->prots = prot_new;
> +		}
> +	} while (index);
> +
> +	doe->num_prots = num_prots;
> +	return 0;
> +}

> +static int pci_doe_reg_irq(struct pci_doe *doe)
> +{
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	bool poll = !pci_dev_msi_enabled(pdev);
> +	int offset = doe->doe_dev->cap_offset;
> +	int rc, irq;
> +	u32 val;
> +

  if (poll)
    return 0;

or maybe just:

  if (!pci_dev_msi_enabled(pdev))
    return 0;

No need to read PCI_DOE_CAP or indent all this code.

> +	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> +
> +	if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
> +		irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
> +		if (irq < 0)
> +			return irq;
> +
> +		doe->irq_name = devm_kasprintf(&doe->doe_dev->adev.dev,
> +						GFP_KERNEL,
> +						"DOE[%s]",

Fill line.

> +						doe->doe_dev->adev.name);
> +		if (!doe->irq_name)
> +			return -ENOMEM;
> +
> +		rc = devm_request_irq(&pdev->dev, irq, pci_doe_irq, 0,
> +				      doe->irq_name, doe);
> +		if (rc)
> +			return rc;
> +
> +		doe->irq = irq;
> +		pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> +				       PCI_DOE_CTRL_INT_EN);
> +	}
> +
> +	return 0;
> +}

> +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> +			 const struct auxiliary_device_id *id)
> +{
> +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> +					struct pci_doe_dev,
> +					adev);

Fill line.

> +	struct pci_doe *doe;
> +	int rc;
> +
> +	doe = devm_kzalloc(&aux_dev->dev, sizeof(*doe), GFP_KERNEL);
> +	if (!doe)
> +		return -ENOMEM;
> +
> +	mutex_init(&doe->state_lock);
> +	init_completion(&doe->abort_c);
> +	doe->doe_dev = doe_dev;
> +	init_waitqueue_head(&doe->wq);
> +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> +	dev_set_drvdata(&aux_dev->dev, doe);
> +
> +	rc = pci_doe_reg_irq(doe);

"request_irq" or "setup_irq" or something?  "reg" is a little
ambiguous.

> +	if (rc)
> +		return rc;
> +
> +	/* Reset the mailbox by issuing an abort */
> +	rc = pci_doe_abort(doe);
> +	if (rc)
> +		return rc;
> +
> +	rc = pci_doe_cache_protocols(doe);
> +	if (rc)
> +		return rc;
> +
> +	return 0;

Same as:

  return pci_doe_cache_protocols(doe);

> +static int __init pci_doe_init_module(void)
> +{
> +	int ret;
> +
> +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> +	if (ret) {
> +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> +		       ret);
> +		return ret;
> +	}
> +
> +	return 0;

Same as:

  if (ret)
    pr_err(...);

  return ret;

> +++ b/include/linux/pci-doe.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.

Update citation.

> +struct pci_doe_dev {
> +	struct auxiliary_device adev;
> +	struct pci_dev *pdev;
> +	int cap_offset;

Can you name this "doe_cap", in the style of "msi_cap", "msix_cap",
etc?

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices
  2022-02-01  7:19 ` [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices ira.weiny
@ 2022-02-03 22:44   ` Bjorn Helgaas
  2022-02-04 14:51     ` Jonathan Cameron
  2022-03-24  0:26     ` Ira Weiny
  0 siblings, 2 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2022-02-03 22:44 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL and/or PCI devices can define DOE mailboxes.  

In concrete terms, "DOE mailbox" refers to a DOE Capability, right?
PCIe devices are allowed to implement several instances of the DOE
Capability, of course.  I'm kind of partial to concreteness because it
makes it easier to map between the code and the spec.

> Normally the kernel will want to maintain control of all of these
> mailboxes.  However, under a limited number of use cases users may
> want to allow user space access to some of these mailboxes while the
> kernel retains control of the rest.

Is there something in this patch related to user-space vs kernel
control of things?  To me this patch looks like "for every DOE
Capability on a device, create an auxiliary device and try to attach
an auxiliary device driver to it."

If part of creating the auxiliary devices is adding things in sysfs, I
think it would be useful to mention that here.

> An example of this is for CXL Compliance Testing (see CXL 2.0
> 14.16.4 Compliance Mode DOE) which offers a mechanism to set
> different test modes for a device.

Not sure exactly what this contributes here.  I guess you're saying
you might want user-space access to this, but I don't see anything in
this patch related to that.

> Rather than re-invent the wheel the architecture creates auxiliary
> devices for each DOE mailbox which can then be driven by a generic
> DOE mailbox driver.  If access to an individual mailbox is required
> by user space the driver for that mailbox can be unloaded and access
> handed to user space.

IIUC a device can have several DOE Capabilities, and each Capability
can support several protocols.  So I would think the granularity might
be "protocol" rather than "mailbox" (DOE Capability).

But either way this text seems like it would go with a different patch
since this patch has nothing to specify a particular protocol or even
a particular mailbox/DOE Capability.

> Create the helper pci_doe_create_doe_devices() which iterates each DOE
> mailbox found in the device and creates a DOE auxiliary device on the
> auxiliary bus.  While doing so ensure that the auxiliary DOE driver
> loads to drive that device.

Here's a case where "iterating over DOE mailboxes found in the device"
is slightly abstract.  The code obviously iterates over DOE
*Capabilities* (PCI_EXT_CAP_ID_DOE), and that's something I can easily
find in the spec.

Knowing that this is a PCIe Capability is useful because it puts it in
the context of other capabilities ("optional things that live in
config space") and the mechanisms for synchronization and user-space
access.

> +/**
> + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> + *                              mailboxes found
> + * @pci_dev: The PCI device to scan for DOE mailboxes
> + *
> + * There is no coresponding destroy of these devices.  This function associates
> + * the DOE auxiliary devices created with the pci_dev passed in.  That
> + * association is device managed (devm_*) such that the DOE auxiliary device
> + * lifetime is always greater than or equal to the lifetime of the pci_dev.

This seems backwards.  What does it mean if the DOE aux dev lifetime
is *greater* than that of the pci_dev?  Surely you can't access a PCI
DOE Capability if the pci_dev is gone?

> + * RETURNS: 0 on success -ERRNO on failure.
> + */
> +int pci_doe_create_doe_devices(struct pci_dev *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	int irqs, rc;
> +	u16 pos = 0;
> +
> +	/*
> +	 * An implementation may support an unknown number of interrupts.
> +	 * Assume that number is not that large and request them all.

This doesn't really inspire confidence :)  Playing devil's advocate,
since pdev is an arbitrary device, I would assume the number *is*
large.

> +	irqs = pci_msix_vec_count(pdev);
> +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);

pci_msix_vec_count() is apparently sort of discouraged; see
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/msi-howto.rst?id=v5.16#n179

A DOE Capability may be implemented by any device, e.g., a NIC or
storage HBA, etc.  I'm a little queasy about IRQ alloc happening both
here and in the driver for the device's primary functionality.  Can
you reassure me that this is actually OK and safe?

Sorry if I've asked this before.  If I have, perhaps a comment would
be useful.

> +	if (rc != irqs) {
> +		/* No interrupt available - carry on */
> +		pci_dbg(pdev, "No interrupts available for DOE\n");
> +	} else {
> +		/*
> +		 * Enabling bus mastering is require for MSI/MSIx.  It could be

s/require/required/
s/MSIx/MSI-X/ to match spec usage.

But I think you only support MSI-X, since you passed "PCI_IRQ_MSIX", not
"PCI_IRQ_MSI | PCI_IRQ_MSIX" above?

> +		 * done later within the DOE initialization, but as it
> +		 * potentially has other impacts keep it here when setting up
> +		 * the IRQ's.

s/IRQ's/IRQs/

"Potentially has other impacts" is too vague, and this doesn't explain
why bus mastering should be enabled here rather than later.  The
device should not issue an MSI-X until DOE Interrupt Enable is set, so
near there seems like a logical place.

> +		 */
> +		pci_set_master(pdev);
> +		rc = devm_add_action_or_reset(dev,
> +					      pci_doe_free_irq_vectors,
> +					      pdev);
> +		if (rc)
> +			return rc;
> +	}

> +++ b/include/linux/pci-doe.h
> @@ -13,6 +13,8 @@
>  #ifndef LINUX_PCI_DOE_H
>  #define LINUX_PCI_DOE_H
>  
> +#define DOE_DEV_NAME "doe"

This is only used once, above.  Why not just use the string there
directly and skip the #define?  If it's needed elsewhere eventually,
we can add a #define then.

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 08/10] cxl/cdat: Introduce cdat_hdr_valid()
  2022-02-01 22:29     ` Ira Weiny
@ 2022-02-04 13:17       ` Jonathan Cameron
  0 siblings, 0 replies; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-04 13:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Ben Widawsky, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On Tue, 1 Feb 2022 14:29:03 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Tue, Feb 01, 2022 at 10:56:40AM -0800, Widawsky, Ben wrote:
> > On 22-01-31 23:19:50, ira.weiny@intel.com wrote:  
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > The CDAT data is protected by a checksum which should be checked when
> > > the CDAT is read to ensure it is valid.  In addition the lengths
> > > specified should be checked.
> > > 
> > > Introduce cdat_hdr_valid() to check the checksum.  While at it check and
> > > store the sequence number.
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > ---
> > > Changes from V5
> > > 	New patch, split out
> > > 	Update cdat_hdr_valid()
> > > 		Remove revision and cs field parsing
> > > 			There is no point in these
> > > 		Add seq check and debug print.
> > > ---
> > >  drivers/cxl/cdat.h |  2 ++
> > >  drivers/cxl/pci.c  | 32 ++++++++++++++++++++++++++++++++
> > >  2 files changed, 34 insertions(+)
> > > 
> > > diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> > > index 4722b6bbbaf0..a7725d26f2d2 100644
> > > --- a/drivers/cxl/cdat.h
> > > +++ b/drivers/cxl/cdat.h
> > > @@ -88,10 +88,12 @@
> > >   *
> > >   * @table: cache of CDAT table
> > >   * @length: length of cached CDAT table
> > > + * @seq: Last read Sequence number of the CDAT table
> > >   */
> > >  struct cxl_cdat {
> > >  	void *table;
> > >  	size_t length;
> > > +	u32 seq;
> > >  };
> > >  
> > >  #endif /* !__CXL_CDAT_H__ */
> > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > > index 28b973a9e29e..c362c75feed2 100644
> > > --- a/drivers/cxl/pci.c
> > > +++ b/drivers/cxl/pci.c
> > > @@ -586,6 +586,35 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> > >  	return 0;
> > >  }
> > >  
> > > +static bool cxl_cdat_hdr_valid(struct device *dev, struct cxl_cdat *cdat)
> > > +{
> > > +	u32 *table = cdat->table;
> > > +	u8 *data8 = cdat->table;
> > > +	u32 length, seq;
> > > +	u8 check;
> > > +	int i;
> > > +
> > > +	length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, table[0]);
> > > +	if (length < CDAT_HEADER_LENGTH_BYTES)
> > > +		return false;
> > > +
> > > +	if (length > cdat->length)
> > > +		return false;
> > > +
> > > +	seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, table[3]);
> > > +
> > > +	/* Store the sequence for now. */
> > > +	if (cdat->seq != seq) {
> > > +		dev_info(dev, "CDAT seq change %x -> %x\n", cdat->seq, seq);
> > > +		cdat->seq = seq;
> > > +	}  
> > 
> > If sequence hasn't changed you could short-circuit the checksum.  
> 
> I'm not sure.  Jonathan mentioned that reading may race with updates and that
> the correct thing to do is re-read.[1]

As things stand I 'think' a failure of the checksum on a previous run wouldn't
mean we didn't store the sequence number.

Now we only call this once at the moment so that doesn't matter yet..

If on each call we rerun to hopefully get an update after the race with
a good checksum / sequence number and don't store it on failure to validate
then we could indeed just use the sequence check to skip the checksum validation.
Mind you this isn't a hot path... Do we really care? 


> 
> But I should probably check the CS first...
> 
> Ira
> 
> [1] https://lore.kernel.org/linux-cxl/20211108145239.000010a5@Huawei.com/
> 
> >   
> > > +
> > > +	for (check = 0, i = 0; i < length; i++)
> > > +		check += data8[i];
> > > +
> > > +	return check == 0;
> > > +}
> > > +
> > >  #define CDAT_DOE_REQ(entry_handle)					\
> > >  	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
> > >  		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
> > > @@ -658,6 +687,9 @@ static int cxl_cdat_read_table(struct cxl_dev_state *cxlds,
> > >  
> > >  	} while (entry_handle != 0xFFFF);
> > >  
> > > +	if (!cxl_cdat_hdr_valid(cxlds->dev, cdat))
> > > +		return -EIO;
> > > +
> > >  	return 0;
> > >  }
> > >  
> > > -- 
> > > 2.31.1
> > >   


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 09/10] cxl/mem: Retry reading CDAT on failure
  2022-02-01 22:31     ` Ira Weiny
@ 2022-02-04 13:20       ` Jonathan Cameron
  0 siblings, 0 replies; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-04 13:20 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Ben Widawsky, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On Tue, 1 Feb 2022 14:31:59 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Tue, Feb 01, 2022 at 10:59:28AM -0800, Widawsky, Ben wrote:
> > On 22-01-31 23:19:51, ira.weiny@intel.com wrote:  
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > The CDAT read may fail for a number of reasons but mainly it is possible
> > > to get different parts of a valid state.  The checksum in the CDAT table
> > > protects against this.
> > > 
> > > Now that the checksum is validated issue a retry if the CDAT read fails.
> > > For now 2 retries are implemented.
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > ---
> > > NOTE: Is 2 enough?  Should this just be delayed until the time when the
> > > data is actually needed and not there?  
> > 
> > I can't speak to retries at all, but one small issue below. It might make sense
> > if we keep this to make it a modparam.  
> 
> Not a bad idea.

Ah. Here is the retry - I should have read the rest of the thread :)

This whole cycle isn't in a hot path and is fairly cheap. I'd just
go with c. 5 and assume that is enough for anyone.  If we need a module
parameter later because this race turns out to be something that
actually happens then it is easy enough to add then.




> 
> >   
> > > 
> > > Changes from V5:
> > > 	New patch -- easy to push off or drop.
> > > ---
> > >  drivers/cxl/core/memdev.c | 17 ++++++++++++++++-
> > >  1 file changed, 16 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > > index a01068e98333..11d721c56f08 100644
> > > --- a/drivers/cxl/core/memdev.c
> > > +++ b/drivers/cxl/core/memdev.c
> > > @@ -356,7 +356,8 @@ static const struct file_operations cxl_memdev_fops = {
> > >  	.llseek = noop_llseek,
> > >  };
> > >  
> > > -static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
> > > +static int __read_cdat_data(struct cxl_memdev *cxlmd,
> > > +			    struct cxl_dev_state *cxlds)
> > >  {
> > >  	struct device *dev = &cxlmd->dev;
> > >  	size_t cdat_length;
> > > @@ -371,6 +372,20 @@ static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
> > >  	return cxl_mem_cdat_read_table(cxlds, &cxlmd->cdat);
> > >  }
> > >  
> > > +static int read_cdat_data(struct cxl_memdev *cxlmd,
> > > +			  struct cxl_dev_state *cxlds)
> > > +{
> > > +	int retries = 2;
> > > +	int rc;
> > > +
> > > +	while (--retries) {  
> > 
> > You either want retries--, or retries = 3...  
> 
> Opps yea.
> 
> Thanks,
> Ira
> 
> >   
> > > +		rc = __read_cdat_data(cxlmd, cxlds);
> > > +		if (!rc)
> > > +			break;
> > > +	}
> > > +	return rc;
> > > +}
> > > +
> > >  struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> > >  {
> > >  	struct cxl_memdev *cxlmd;
> > > -- 
> > > 2.31.1
> > >   


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table
  2022-02-01 22:37     ` Ira Weiny
@ 2022-02-04 13:33       ` Jonathan Cameron
  2022-02-04 13:41       ` Jonathan Cameron
  1 sibling, 0 replies; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-04 13:33 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Ben Widawsky, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On Tue, 1 Feb 2022 14:37:17 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Tue, Feb 01, 2022 at 11:05:32AM -0800, Widawsky, Ben wrote:
> > On 22-01-31 23:19:52, ira.weiny@intel.com wrote:  
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > CXL memory devices need the information in the Device Scoped Memory
> > > Affinity Structure (DSMAS).  This information is contained within the
> > > CDAT table buffer which is already read and cached.
> > > 
> > > Parse and cache DSMAS data from the CDAT table.  Store this data in
> > > unmarshaled struct dsmas data structures for ease of use.
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > ---
> > > Changes from V5
> > > 	Fix up sparse warnings
> > > 	Split out cdat_hdr_valid()
> > > 	Update cdat_hdr_valid()
> > > 		Remove revision and cs field parsing
> > > 			There is no point in these
> > > 		Add seq check and debug print.
> > > 	From Jonathan
> > > 		Add spaces around '+' and '/'
> > > 		use devm_krealloc() for dmas_ary
> > > 
> > > Changes from V4
> > > 	New patch
> > > ---
> > >  drivers/cxl/cdat.h        | 21 ++++++++++++
> > >  drivers/cxl/core/memdev.c | 70 +++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 91 insertions(+)
> > > 
> > > diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> > > index a7725d26f2d2..f8c126190d18 100644
> > > --- a/drivers/cxl/cdat.h
> > > +++ b/drivers/cxl/cdat.h
> > > @@ -83,17 +83,38 @@
> > >  #define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
> > >  #define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
> > >  
> > > +/**
> > > + * struct cxl_dsmas - host unmarshaled version of DSMAS data
> > > + *
> > > + * As defined in the Coherent Device Attribute Table (CDAT) specification this
> > > + * represents a single DSMAS entry in that table.
> > > + *
> > > + * @dpa_base: The lowest DPA address associated with this DSMAD
> > > + * @dpa_length: Length in bytes of this DSMAD
> > > + * @non_volatile: If set, the memory region represents Non-Volatile memory
> > > + */
> > > +struct cxl_dsmas {
> > > +	u64 dpa_base;
> > > +	u64 dpa_length;
> > > +	/* Flags */
> > > +	u8 non_volatile:1;
> > > +};
> > > +
> > >  /**
> > >   * struct cxl_cdat - CXL CDAT data
> > >   *
> > >   * @table: cache of CDAT table
> > >   * @length: length of cached CDAT table
> > >   * @seq: Last read Sequence number of the CDAT table
> > > + * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
> > > + * @nr_dsmas: Number of entries in dsmas_ary
> > >   */
> > >  struct cxl_cdat {
> > >  	void *table;
> > >  	size_t length;
> > >  	u32 seq;
> > > +	struct cxl_dsmas *dsmas_ary;
> > > +	int nr_dsmas;
> > >  };
> > >  
> > >  #endif /* !__CXL_CDAT_H__ */
> > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > > index 11d721c56f08..32342a15e991 100644
> > > --- a/drivers/cxl/core/memdev.c
> > > +++ b/drivers/cxl/core/memdev.c
> > > @@ -6,6 +6,7 @@
> > >  #include <linux/idr.h>
> > >  #include <linux/pci.h>
> > >  #include <cxlmem.h>
> > > +#include "cdat.h"
> > >  #include "core.h"
> > >  
> > >  static DECLARE_RWSEM(cxl_memdev_rwsem);
> > > @@ -386,6 +387,71 @@ static int read_cdat_data(struct cxl_memdev *cxlmd,
> > >  	return rc;
> > >  }
> > >  
> > > +static int parse_dsmas(struct cxl_memdev *cxlmd)
> > > +{
> > > +	struct cxl_dsmas *dsmas_ary = NULL;
> > > +	u32 *data = cxlmd->cdat.table;
> > > +	int bytes_left = cxlmd->cdat.length;
> > > +	int nr_dsmas = 0;
> > > +
> > > +	if (!data)
> > > +		return -ENXIO;
> > > +
> > > +	/* Skip header */
> > > +	data += CDAT_HEADER_LENGTH_DW;
> > > +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> > > +
> > > +	while (bytes_left > 0) {
> > > +		u32 *cur_rec = data;
> > > +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> > > +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> > > +
> > > +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
> > > +			struct cxl_dsmas *new_ary;
> > > +			u8 flags;
> > > +
> > > +			new_ary = devm_krealloc(&cxlmd->dev, dsmas_ary,
> > > +					   sizeof(*dsmas_ary) * (nr_dsmas + 1),
> > > +					   GFP_KERNEL);
> > > +			if (!new_ary) {
> > > +				dev_err(&cxlmd->dev,
> > > +					"Failed to allocate memory for DSMAS data\n");
> > > +				return -ENOMEM;
> > > +			}  
> > 
> > One thought here - it looks like there are at most 256 DSMAS entries. You could
> > allocate the full 256 up front, and then realloc *down* to the actual number.
> >   
> > > +			dsmas_ary = new_ary;
> > > +
> > > +			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
> > > +
> > > +			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
> > > +			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
> > > +			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
> > > +
> > > +			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
> > > +				nr_dsmas,
> > > +				dsmas_ary[nr_dsmas].dpa_base,
> > > +				dsmas_ary[nr_dsmas].dpa_base +
> > > +					dsmas_ary[nr_dsmas].dpa_length,
> > > +				(dsmas_ary[nr_dsmas].non_volatile ?
> > > +					"Persistent" : "Volatile")
> > > +				);
> > > +
> > > +			nr_dsmas++;
> > > +		}
> > > +
> > > +		data += (length / sizeof(u32));
> > > +		bytes_left -= length;
> > > +	}
> > > +
> > > +	if (nr_dsmas == 0)
> > > +		return -ENXIO;  
> > 
> > Hmm is there documentation that suggests a DSMAS must be implemented? Could this
> > just return 0? I'd put maybe dev_dbg here if it's unexpected but not a failure
> > and return success.  
> 
> For this call I was not envisioning this as an error.  I wanted to leave it up
> to the caller.
> 
> I think it would make more sense to return the number of DSMAS' found or
> negative errno on failure...
> 
> I'll clean it up.  Including below...
> 
> >   
> > > +
> > > +	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
> > > +	cxlmd->cdat.dsmas_ary = dsmas_ary;
> > > +	cxlmd->cdat.nr_dsmas = nr_dsmas;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> > >  {
> > >  	struct cxl_memdev *cxlmd;
> > > @@ -407,6 +473,10 @@ struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
> > >  	if (rc)
> > >  		goto err;
> > >  
> > > +	rc = parse_dsmas(cxlmd);
> > > +	if (rc)
> > > +		dev_warn(dev, "No DSMAS data found: %d\n", rc);
> > > +  
> 
> This was changed to dev_warn() because I think here we do expect dsmas data?
> Don't we?

There are flags in the CXL Range registers that specify that some stuff
is communicated via CDAT.  That includes one for whether it is nonvolatile
or not which is a DSMAS flag.  So if that is set we definitely expect them.

We would also expect them if QTG _DSM is in use as that has specific references
to DMSAS regions.

More generally a switch implementing CDAT wouldn't have DSMAS but
we are fairly safe that if a memory device has CDAT at all, DSMAS is expected
because most of the other structures reference the DSMAS handle.

Message is in general wrong though as it could be a memory failure
in parse_dsmas() so you need to check the actual error code.

J

> 
> Thanks,
> Ira
> 
> > >  	/*
> > >  	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
> > >  	 * needed as this is ordered with cdev_add() publishing the device.
> > > -- 
> > > 2.31.1
> > >   


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table
  2022-02-01  7:19 ` [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
  2022-02-01 19:05   ` Ben Widawsky
@ 2022-02-04 13:40   ` Jonathan Cameron
  1 sibling, 0 replies; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-04 13:40 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Mon, 31 Jan 2022 23:19:52 -0800
ira.weiny@intel.com wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL memory devices need the information in the Device Scoped Memory
> Affinity Structure (DSMAS).  This information is contained within the
> CDAT table buffer which is already read and cached.
> 
> Parse and cache DSMAS data from the CDAT table.  Store this data in
> unmarshaled struct dsmas data structures for ease of use.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

A few suggests inline but this basically looks good to me.
I'll hold off on tags until we resolve the warn or not question
Ben raised.

> 
> ---
> Changes from V5
> 	Fix up sparse warnings
> 	Split out cdat_hdr_valid()
> 	Update cdat_hdr_valid()
> 		Remove revision and cs field parsing
> 			There is no point in these
> 		Add seq check and debug print.
> 	From Jonathan
> 		Add spaces around '+' and '/'
> 		use devm_krealloc() for dmas_ary
> 
> Changes from V4
> 	New patch
> ---
>  drivers/cxl/cdat.h        | 21 ++++++++++++
>  drivers/cxl/core/memdev.c | 70 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 91 insertions(+)
> 
> diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> index a7725d26f2d2..f8c126190d18 100644
> --- a/drivers/cxl/cdat.h
> +++ b/drivers/cxl/cdat.h
> @@ -83,17 +83,38 @@
>  #define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
>  #define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
>  
> +/**
> + * struct cxl_dsmas - host unmarshaled version of DSMAS data
> + *
> + * As defined in the Coherent Device Attribute Table (CDAT) specification this
> + * represents a single DSMAS entry in that table.
> + *
> + * @dpa_base: The lowest DPA address associated with this DSMAD
> + * @dpa_length: Length in bytes of this DSMAD
> + * @non_volatile: If set, the memory region represents Non-Volatile memory
> + */
> +struct cxl_dsmas {
> +	u64 dpa_base;
> +	u64 dpa_length;
> +	/* Flags */
> +	u8 non_volatile:1;
> +};
> +
>  /**
>   * struct cxl_cdat - CXL CDAT data
>   *
>   * @table: cache of CDAT table
>   * @length: length of cached CDAT table
>   * @seq: Last read Sequence number of the CDAT table
> + * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
> + * @nr_dsmas: Number of entries in dsmas_ary
>   */
>  struct cxl_cdat {
>  	void *table;
>  	size_t length;
>  	u32 seq;
> +	struct cxl_dsmas *dsmas_ary;
> +	int nr_dsmas;
>  };
>  
>  #endif /* !__CXL_CDAT_H__ */
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 11d721c56f08..32342a15e991 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -6,6 +6,7 @@
>  #include <linux/idr.h>
>  #include <linux/pci.h>
>  #include <cxlmem.h>
> +#include "cdat.h"
>  #include "core.h"
>  
>  static DECLARE_RWSEM(cxl_memdev_rwsem);
> @@ -386,6 +387,71 @@ static int read_cdat_data(struct cxl_memdev *cxlmd,
>  	return rc;
>  }
>  
> +static int parse_dsmas(struct cxl_memdev *cxlmd)

Looking forwards, it's more than possible this code might be of
use for type2 devices.  As such, maybe it should avoid
taking the cxlmd as a parameter?  Really need a dev for the
allocations, and the cdat structure.

Could fix that up once someone else wants it of course.

> +{
> +	struct cxl_dsmas *dsmas_ary = NULL;
> +	u32 *data = cxlmd->cdat.table;
> +	int bytes_left = cxlmd->cdat.length;
> +	int nr_dsmas = 0;
> +
> +	if (!data)
> +		return -ENXIO;
> +
> +	/* Skip header */
> +	data += CDAT_HEADER_LENGTH_DW;
> +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> +
> +	while (bytes_left > 0) {
> +		u32 *cur_rec = data;
> +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> +
> +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {

Again, maybe something to do later, but a
for_each_cdat_struct() loop (possibly with a type specified) would
give us something we are sure to want later when doing switches etc.

> +			struct cxl_dsmas *new_ary;
> +			u8 flags;
> +
> +			new_ary = devm_krealloc(&cxlmd->dev, dsmas_ary,
> +					   sizeof(*dsmas_ary) * (nr_dsmas + 1),
> +					   GFP_KERNEL);
> +			if (!new_ary) {
> +				dev_err(&cxlmd->dev,
> +					"Failed to allocate memory for DSMAS data\n");
> +				return -ENOMEM;
> +			}
> +			dsmas_ary = new_ary;
> +
> +			flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
> +
> +			dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
> +			dsmas_ary[nr_dsmas].dpa_length = CDAT_DSMAS_DPA_LEN(cur_rec);
> +			dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
> +
> +			dev_dbg(&cxlmd->dev, "DSMAS %d: %llx:%llx %s\n",
> +				nr_dsmas,
> +				dsmas_ary[nr_dsmas].dpa_base,
> +				dsmas_ary[nr_dsmas].dpa_base +
> +					dsmas_ary[nr_dsmas].dpa_length,
> +				(dsmas_ary[nr_dsmas].non_volatile ?
> +					"Persistent" : "Volatile")
> +				);
> +
> +			nr_dsmas++;
> +		}
> +
> +		data += (length / sizeof(u32));
> +		bytes_left -= length;
> +	}
> +
> +	if (nr_dsmas == 0)
> +		return -ENXIO;
> +
> +	dev_dbg(&cxlmd->dev, "Found %d DSMAS entries\n", nr_dsmas);
> +	cxlmd->cdat.dsmas_ary = dsmas_ary;
> +	cxlmd->cdat.nr_dsmas = nr_dsmas;
> +
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table
  2022-02-01 22:37     ` Ira Weiny
  2022-02-04 13:33       ` Jonathan Cameron
@ 2022-02-04 13:41       ` Jonathan Cameron
  1 sibling, 0 replies; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-04 13:41 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Ben Widawsky, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On Tue, 1 Feb 2022 14:37:17 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Tue, Feb 01, 2022 at 11:05:32AM -0800, Widawsky, Ben wrote:
> > On 22-01-31 23:19:52, ira.weiny@intel.com wrote:  
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > CXL memory devices need the information in the Device Scoped Memory
> > > Affinity Structure (DSMAS).  This information is contained within the
> > > CDAT table buffer which is already read and cached.
> > > 
> > > Parse and cache DSMAS data from the CDAT table.  Store this data in
> > > unmarshaled struct dsmas data structures for ease of use.
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > ---
> > > Changes from V5
> > > 	Fix up sparse warnings
> > > 	Split out cdat_hdr_valid()
> > > 	Update cdat_hdr_valid()
> > > 		Remove revision and cs field parsing
> > > 			There is no point in these
> > > 		Add seq check and debug print.
> > > 	From Jonathan
> > > 		Add spaces around '+' and '/'
> > > 		use devm_krealloc() for dmas_ary
> > > 
> > > Changes from V4
> > > 	New patch
> > > ---
> > >  drivers/cxl/cdat.h        | 21 ++++++++++++
> > >  drivers/cxl/core/memdev.c | 70 +++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 91 insertions(+)
> > > 
> > > diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> > > index a7725d26f2d2..f8c126190d18 100644
> > > --- a/drivers/cxl/cdat.h
> > > +++ b/drivers/cxl/cdat.h
> > > @@ -83,17 +83,38 @@
> > >  #define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
> > >  #define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
> > >  
> > > +/**
> > > + * struct cxl_dsmas - host unmarshaled version of DSMAS data
> > > + *
> > > + * As defined in the Coherent Device Attribute Table (CDAT) specification this
> > > + * represents a single DSMAS entry in that table.
> > > + *
> > > + * @dpa_base: The lowest DPA address associated with this DSMAD
> > > + * @dpa_length: Length in bytes of this DSMAD
> > > + * @non_volatile: If set, the memory region represents Non-Volatile memory
> > > + */
> > > +struct cxl_dsmas {
> > > +	u64 dpa_base;
> > > +	u64 dpa_length;
> > > +	/* Flags */
> > > +	u8 non_volatile:1;
> > > +};
> > > +
> > >  /**
> > >   * struct cxl_cdat - CXL CDAT data
> > >   *
> > >   * @table: cache of CDAT table
> > >   * @length: length of cached CDAT table
> > >   * @seq: Last read Sequence number of the CDAT table
> > > + * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
> > > + * @nr_dsmas: Number of entries in dsmas_ary
> > >   */
> > >  struct cxl_cdat {
> > >  	void *table;
> > >  	size_t length;
> > >  	u32 seq;
> > > +	struct cxl_dsmas *dsmas_ary;
> > > +	int nr_dsmas;
> > >  };
> > >  
> > >  #endif /* !__CXL_CDAT_H__ */
> > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> > > index 11d721c56f08..32342a15e991 100644
> > > --- a/drivers/cxl/core/memdev.c
> > > +++ b/drivers/cxl/core/memdev.c
> > > @@ -6,6 +6,7 @@
> > >  #include <linux/idr.h>
> > >  #include <linux/pci.h>
> > >  #include <cxlmem.h>
> > > +#include "cdat.h"
> > >  #include "core.h"
> > >  
> > >  static DECLARE_RWSEM(cxl_memdev_rwsem);
> > > @@ -386,6 +387,71 @@ static int read_cdat_data(struct cxl_memdev *cxlmd,
> > >  	return rc;
> > >  }
> > >  
> > > +static int parse_dsmas(struct cxl_memdev *cxlmd)
> > > +{
> > > +	struct cxl_dsmas *dsmas_ary = NULL;
> > > +	u32 *data = cxlmd->cdat.table;
> > > +	int bytes_left = cxlmd->cdat.length;
> > > +	int nr_dsmas = 0;
> > > +
> > > +	if (!data)
> > > +		return -ENXIO;
> > > +
> > > +	/* Skip header */
> > > +	data += CDAT_HEADER_LENGTH_DW;
> > > +	bytes_left -= CDAT_HEADER_LENGTH_BYTES;
> > > +
> > > +	while (bytes_left > 0) {
> > > +		u32 *cur_rec = data;
> > > +		u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
> > > +		u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
> > > +
> > > +		if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
> > > +			struct cxl_dsmas *new_ary;
> > > +			u8 flags;
> > > +
> > > +			new_ary = devm_krealloc(&cxlmd->dev, dsmas_ary,
> > > +					   sizeof(*dsmas_ary) * (nr_dsmas + 1),
> > > +					   GFP_KERNEL);
> > > +			if (!new_ary) {
> > > +				dev_err(&cxlmd->dev,
> > > +					"Failed to allocate memory for DSMAS data\n");
> > > +				return -ENOMEM;
> > > +			}  
> > 
> > One thought here - it looks like there are at most 256 DSMAS entries. You could
> > allocate the full 256 up front, and then realloc *down* to the actual number.

Gut feeling is there will be 1 or 2 on at typical device, so not sure it is worth
a large allocation and shrink.
Plus not a hot path and this is easy to follow.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 07/10] cxl/mem: Read CDAT table
  2022-02-01  7:19 ` [PATCH V6 07/10] cxl/mem: Read CDAT table ira.weiny
@ 2022-02-04 13:46   ` Jonathan Cameron
  0 siblings, 0 replies; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-04 13:46 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Mon, 31 Jan 2022 23:19:49 -0800
ira.weiny@intel.com wrote:

> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> The OS will need CDAT data from the CXL devices to properly set up
> interleave sets.
> 
> Search the DOE driver/devices attached to the CXL device for one which
> supports the CDAT protocol.  If found, read the CDAT data from that
> mailbox.
> 
> Currently this is only supported by a PCI CXL object through a DOE
> mailbox which supports CDAT.  But any cxl_mem type object can provide
> this data later if need be.  For example for testing.
> 
> Cache this data for later parsing.  Provide a sysfs binary attribute to
> allow dumping of the CDAT.
> 
> Binary dumping is modeled on /sys/firmware/ACPI/tables/
> 
> The ability to dump this table will be very useful for emulation of real
> devices once they become available as QEMU CXL type 3 device emulation will
> be able to load this file in.
> 
> This does not support table updates at runtime. It will always provide
> whatever was there when first cached. Handling of table updates can be
> implemented later.
> 
> Once there are more users, this code can move out to driver/cxl/cdat.c
> or similar.
> 
> Finally create a complete list of DOE defines within cdat.h for anyone
> wishing to decode the CDAT table.
> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
Changes look fine to me.

Thanks,

Jonathan

> ---
> Changes from V5:
> 	Add proper guards around cdat.h
> 	Split out finding the CDAT DOE mailbox
> 	Use cxl_cdat to group CDAT data together
> 	Adjust to use auxiliary_find_device() to find the DOE device
> 		which supplies the CDAT protocol.
> 	Rebased to latest
> 	Remove dev_dbg(length)
> 	Remove unneeded DOE Table access defines
> 	Move CXL_DOE_PROTOCOL_TABLE_ACCESS define into this patch where
> 		it is used
> 
> Changes from V4:
> 	Split this into it's own patch
> 	Rearchitect this such that the memdev driver calls into the DOE
> 	driver via the cxl_mem state object.  This allows CDAT data to
> 	come from any type of cxl_mem object not just PCI DOE.
> 	Rebase on new struct cxl_dev_state
> ---
>  drivers/cxl/cdat.h        | 97 +++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/memdev.c | 56 ++++++++++++++++++++++
>  drivers/cxl/cxlmem.h      | 25 ++++++++++
>  drivers/cxl/pci.c         | 87 +++++++++++++++++++++++++++++++++++
>  4 files changed, 265 insertions(+)
>  create mode 100644 drivers/cxl/cdat.h
> 
> diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> new file mode 100644
> index 000000000000..4722b6bbbaf0
> --- /dev/null
> +++ b/drivers/cxl/cdat.h
> @@ -0,0 +1,97 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __CXL_CDAT_H__
> +#define __CXL_CDAT_H__
> +
> +/*
> + * Coherent Device Attribute table (CDAT)
> + *
> + * Specification available from UEFI.org
> + *
> + * Whilst CDAT is defined as a single table, the access via DOE maiboxes is
> + * done one entry at a time, where the first entry is the header.
> + */
> +
> +#define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
> +#define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
> +#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
> +#define   CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA	0
> +#define CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE	0xffff0000
> +
> +/*
> + * CDAT entries are little endian and are read from PCI config space which
> + * is also little endian.
> + * As such, on a big endian system these will have been reversed.
> + * This prevents us from making easy use of packed structures.
> + * Style form pci_regs.h
> + */
> +
> +#define CDAT_HEADER_LENGTH_DW 4
> +#define CDAT_HEADER_LENGTH_BYTES (CDAT_HEADER_LENGTH_DW * sizeof(u32))
> +#define CDAT_HEADER_DW0_LENGTH		0xffffffff
> +#define CDAT_HEADER_DW1_REVISION	0x000000ff
> +#define CDAT_HEADER_DW1_CHECKSUM	0x0000ff00
> +/* CDAT_HEADER_DW2_RESERVED	*/
> +#define CDAT_HEADER_DW3_SEQUENCE	0xffffffff
> +
> +/* All structures have a common first DW */
> +#define CDAT_STRUCTURE_DW0_TYPE		0x000000ff
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSMAS 0
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSLBIS 1
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSMSCIS 2
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSIS 3
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSEMTS 4
> +#define   CDAT_STRUCTURE_DW0_TYPE_SSLBIS 5
> +
> +#define CDAT_STRUCTURE_DW0_LENGTH	0xffff0000
> +
> +/* Device Scoped Memory Affinity Structure */
> +#define CDAT_DSMAS_DW1_DSMAD_HANDLE	0x000000ff
> +#define CDAT_DSMAS_DW1_FLAGS		0x0000ff00
> +#define CDAT_DSMAS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_DSMAS_DPA_LEN(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
> +#define CDAT_DSMAS_NON_VOLATILE(flags)  ((flags & 0x04) >> 2)
> +
> +/* Device Scoped Latency and Bandwidth Information Structure */
> +#define CDAT_DSLBIS_DW1_HANDLE		0x000000ff
> +#define CDAT_DSLBIS_DW1_FLAGS		0x0000ff00
> +#define CDAT_DSLBIS_DW1_DATA_TYPE	0x00ff0000
> +#define CDAT_DSLBIS_BASE_UNIT(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_DSLBIS_DW4_ENTRY_0		0x0000ffff
> +#define CDAT_DSLBIS_DW4_ENTRY_1		0xffff0000
> +#define CDAT_DSLBIS_DW5_ENTRY_2		0x0000ffff
> +
> +/* Device Scoped Memory Side Cache Information Structure */
> +#define CDAT_DSMSCIS_DW1_HANDLE		0x000000ff
> +#define CDAT_DSMSCIS_MEMORY_SIDE_CACHE_SIZE(entry) \
> +	((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_DSMSCIS_DW4_MEMORY_SIDE_CACHE_ATTRS 0xffffffff
> +
> +/* Device Scoped Initiator Structure */
> +#define CDAT_DSIS_DW1_FLAGS		0x000000ff
> +#define CDAT_DSIS_DW1_HANDLE		0x0000ff00
> +
> +/* Device Scoped EFI Memory Type Structure */
> +#define CDAT_DSEMTS_DW1_HANDLE		0x000000ff
> +#define CDAT_DSEMTS_DW1_EFI_MEMORY_TYPE_ATTR	0x0000ff00
> +#define CDAT_DSEMTS_DPA_OFFSET(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_DSEMTS_DPA_LENGTH(entry)	((u64)((entry)[5]) << 32 | (entry)[4])
> +
> +/* Switch Scoped Latency and Bandwidth Information Structure */
> +#define CDAT_SSLBIS_DW1_DATA_TYPE	0x000000ff
> +#define CDAT_SSLBIS_BASE_UNIT(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_SSLBIS_ENTRY_PORT_X(entry, i) ((entry)[4 + (i) * 2] & 0x0000ffff)
> +#define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
> +#define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
> +
> +/**
> + * struct cxl_cdat - CXL CDAT data
> + *
> + * @table: cache of CDAT table
> + * @length: length of cached CDAT table
> + */
> +struct cxl_cdat {
> +	void *table;
> +	size_t length;
> +};
> +
> +#endif /* !__CXL_CDAT_H__ */
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index ee0156419d06..a01068e98333 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -86,6 +86,35 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>  	return sysfs_emit(buf, "%#llx\n", len);
>  }
>  
> +static ssize_t CDAT_read(struct file *filp, struct kobject *kobj,
> +			 struct bin_attribute *bin_attr, char *buf,
> +			 loff_t offset, size_t count)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +
> +	if (!cxlmd->cdat.table)
> +		return 0;
> +
> +	return memory_read_from_buffer(buf, count, &offset,
> +				       cxlmd->cdat.table,
> +				       cxlmd->cdat.length);
> +}
> +
> +static BIN_ATTR_RO(CDAT, 0);
> +
> +static umode_t cxl_memdev_bin_attr_is_visible(struct kobject *kobj,
> +					      struct bin_attribute *attr, int i)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +
> +	if ((attr == &bin_attr_CDAT) && cxlmd->cdat.table)
> +		return 0400;
> +
> +	return 0;
> +}
> +
>  static struct device_attribute dev_attr_pmem_size =
>  	__ATTR(size, 0444, pmem_size_show, NULL);
>  
> @@ -115,6 +144,11 @@ static struct attribute *cxl_memdev_attributes[] = {
>  	NULL,
>  };
>  
> +static struct bin_attribute *cxl_memdev_bin_attributes[] = {
> +	&bin_attr_CDAT,
> +	NULL,
> +};
> +
>  static struct attribute *cxl_memdev_pmem_attributes[] = {
>  	&dev_attr_pmem_size.attr,
>  	NULL,
> @@ -136,6 +170,8 @@ static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
>  static struct attribute_group cxl_memdev_attribute_group = {
>  	.attrs = cxl_memdev_attributes,
>  	.is_visible = cxl_memdev_visible,
> +	.bin_attrs = cxl_memdev_bin_attributes,
> +	.is_bin_visible = cxl_memdev_bin_attr_is_visible,
>  };
>  
>  static struct attribute_group cxl_memdev_ram_attribute_group = {
> @@ -320,6 +356,21 @@ static const struct file_operations cxl_memdev_fops = {
>  	.llseek = noop_llseek,
>  };
>  
> +static int read_cdat_data(struct cxl_memdev *cxlmd, struct cxl_dev_state *cxlds)
> +{
> +	struct device *dev = &cxlmd->dev;
> +	size_t cdat_length;
> +
> +	if (cxl_mem_cdat_get_length(cxlds, &cdat_length))
> +		return 0;
> +
> +	cxlmd->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
> +	if (!cxlmd->cdat.table)
> +		return -ENOMEM;
> +	cxlmd->cdat.length = cdat_length;
> +	return cxl_mem_cdat_read_table(cxlds, &cxlmd->cdat);
> +}
> +
>  struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
>  {
>  	struct cxl_memdev *cxlmd;
> @@ -336,6 +387,11 @@ struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
>  	if (rc)
>  		goto err;
>  
> +	/* Cache the data early to ensure is_visible() works */
> +	rc = read_cdat_data(cxlmd, cxlds);
> +	if (rc)
> +		goto err;
> +
>  	/*
>  	 * Activate ioctl operations, no cxl_memdev_rwsem manipulation
>  	 * needed as this is ordered with cdev_add() publishing the device.
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 0fefe43951e3..15c653b20f37 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -5,6 +5,7 @@
>  #include <uapi/linux/cxl_mem.h>
>  #include <linux/cdev.h>
>  #include "cxl.h"
> +#include "cdat.h"
>  
>  /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
>  #define CXLMDEV_STATUS_OFFSET 0x0
> @@ -41,6 +42,7 @@ struct cxl_memdev {
>  	struct device dev;
>  	struct cdev cdev;
>  	struct cxl_dev_state *cxlds;
> +	struct cxl_cdat cdat;
>  	struct work_struct detach_work;
>  	int id;
>  };
> @@ -143,6 +145,10 @@ struct cxl_endpoint_dvsec_info {
>   * @serial: PCIe Device Serial Number
>   * @mbox_send: @dev specific transport for transmitting mailbox commands
>   * @wait_media_ready: @dev specific method to await media ready
> + * @cdat_get_length: @dev specific function for reading the CDAT table length
> + *                   returns -errno if CDAT not supported on this device
> + * @cdat_read_table: @dev specific function for reading the table
> + *                   returns -errno if CDAT not supported on this device
>   *
>   * See section 8.2.9.5.2 Capacity Configuration and Label Storage for
>   * details on capacity parameters.
> @@ -179,6 +185,9 @@ struct cxl_dev_state {
>  
>  	int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
>  	int (*wait_media_ready)(struct cxl_dev_state *cxlds);
> +	int (*cdat_get_length)(struct cxl_dev_state *cxlds, size_t *length);
> +	int (*cdat_read_table)(struct cxl_dev_state *cxlds,
> +			       struct cxl_cdat *cdat);
>  };
>  
>  enum cxl_opcode {
> @@ -305,4 +314,20 @@ struct cxl_hdm {
>  	unsigned int interleave_mask;
>  	struct cxl_port *port;
>  };
> +
> +static inline int cxl_mem_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
> +{
> +	if (cxlds->cdat_get_length)
> +		return cxlds->cdat_get_length(cxlds, length);
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline int cxl_mem_cdat_read_table(struct cxl_dev_state *cxlds,
> +					  struct cxl_cdat *cdat)
> +{
> +	if (cxlds->cdat_read_table)
> +		return cxlds->cdat_read_table(cxlds, cdat);
> +	return -EOPNOTSUPP;
> +}
> +
>  #endif /* __CXL_MEM_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index dcc55c4efd85..28b973a9e29e 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -13,6 +13,7 @@
>  #include "cxlmem.h"
>  #include "cxlpci.h"
>  #include "cxl.h"
> +#include "cdat.h"
>  
>  /**
>   * DOC: cxl pci
> @@ -585,6 +586,90 @@ static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
>  	return 0;
>  }
>  
> +#define CDAT_DOE_REQ(entry_handle)					\
> +	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
> +		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
> +	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_TABLE_TYPE,			\
> +		    CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA) |		\
> +	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, (entry_handle)))
> +
> +static int cxl_cdat_get_length(struct cxl_dev_state *cxlds, size_t *length)
> +{
> +	struct pci_doe_dev *doe_dev = cxlds->cdat_doe;
> +	u32 cdat_request_pl = CDAT_DOE_REQ(0);
> +	u32 cdat_response_pl[32];
> +	struct pci_doe_exchange ex = {
> +		.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
> +		.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
> +		.request_pl = &cdat_request_pl,
> +		.request_pl_sz = sizeof(cdat_request_pl),
> +		.response_pl = cdat_response_pl,
> +		.response_pl_sz = sizeof(cdat_response_pl),
> +	};
> +
> +	ssize_t rc;
> +
> +	rc = pci_doe_exchange_sync(doe_dev, &ex);
> +	if (rc < 0)
> +		return rc;
> +	if (rc < 1)
> +		return -EIO;
> +
> +	*length = cdat_response_pl[1];
> +	return 0;
> +}
> +
> +static int cxl_cdat_read_table(struct cxl_dev_state *cxlds,
> +			       struct cxl_cdat *cdat)
> +{
> +	struct pci_doe_dev *doe_dev = cxlds->cdat_doe;
> +	size_t length = cdat->length;
> +	u32 *data = cdat->table;
> +	int entry_handle = 0;
> +	int rc;
> +
> +	do {
> +		u32 cdat_request_pl = CDAT_DOE_REQ(entry_handle);
> +		u32 cdat_response_pl[32];
> +		struct pci_doe_exchange ex = {
> +			.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
> +			.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
> +			.request_pl = &cdat_request_pl,
> +			.request_pl_sz = sizeof(cdat_request_pl),
> +			.response_pl = cdat_response_pl,
> +			.response_pl_sz = sizeof(cdat_response_pl),
> +		};
> +		size_t entry_dw;
> +		u32 *entry;
> +
> +		rc = pci_doe_exchange_sync(doe_dev, &ex);
> +		if (rc < 0)
> +			return rc;
> +
> +		entry = cdat_response_pl + 1;
> +		entry_dw = rc / sizeof(u32);
> +		/* Skip Header */
> +		entry_dw -= 1;
> +		entry_dw = min(length / 4, entry_dw);
> +		memcpy(data, entry, entry_dw * sizeof(u32));
> +		length -= entry_dw * sizeof(u32);
> +		data += entry_dw;
> +		entry_handle = FIELD_GET(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, cdat_response_pl[0]);
> +
> +	} while (entry_handle != 0xFFFF);
> +
> +	return 0;
> +}
> +
> +static void cxl_initialize_cdat_callbacks(struct cxl_dev_state *cxlds)
> +{
> +	if (!cxlds->cdat_doe)
> +		return;
> +
> +	cxlds->cdat_get_length = cxl_cdat_get_length;
> +	cxlds->cdat_read_table = cxl_cdat_read_table;
> +}
> +
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct cxl_register_map map;
> @@ -657,6 +742,8 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	cxl_initialize_cdat_callbacks(cxlds);
> +
>  	rc = cxl_dvsec_ranges(cxlds);
>  	if (rc)
>  		dev_err(&pdev->dev,


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 06/10] cxl/pci: Find the DOE mailbox which supports CDAT
  2022-02-01 22:18     ` Ira Weiny
@ 2022-02-04 14:04       ` Jonathan Cameron
  0 siblings, 0 replies; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-04 14:04 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Ben Widawsky, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, linux-kernel, linux-cxl, linux-pci

On Tue, 1 Feb 2022 14:18:41 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Tue, Feb 01, 2022 at 10:49:47AM -0800, Widawsky, Ben wrote:
> > On 22-01-31 23:19:48, ira.weiny@intel.com wrote:  
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > Memory devices need the CDAT data from the device.  This data is read
> > > from a DOE mailbox which supports the CDAT protocol.
> > > 
> > > Search the DOE auxiliary devices for the one which supports the CDAT
> > > protocol.  Cache that device to be used for future queries.
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>  
> 
> [snip]
> 
> > >  
> > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > > index d4ae79b62a14..dcc55c4efd85 100644
> > > --- a/drivers/cxl/pci.c
> > > +++ b/drivers/cxl/pci.c
> > > @@ -536,12 +536,53 @@ static int cxl_dvsec_ranges(struct cxl_dev_state *cxlds)
> > >  	return rc;
> > >  }
> > >  
> > > +static int cxl_match_cdat_doe_device(struct device *dev, const void *data)
> > > +{
> > > +	const struct cxl_dev_state *cxlds = data;
> > > +	struct auxiliary_device *adev;
> > > +	struct pci_doe_dev *doe_dev;
> > > +
> > > +	/* First determine if this auxiliary device belongs to the cxlds */
> > > +	if (cxlds->dev != dev->parent)
> > > +		return 0;  
> > 
> > I don't understand auxiliary bus but I'm wondering why it's checking the parent
> > of the device?  
> 
> auxiliary_find_device() iterates all the auxiliary devices in the system.  This
> check was a way for the match function to know if the auxiliary device belongs
> to the cxlds we are interested in...
> 
> But now that I think about it we could have other auxiliary devices attached
> which are not DOE...  :-/  So this check is not complete.
> 
> FWIW I'm not thrilled with the way auxiliary_find_device() is defined.  And now
> that I look at it I think the only user of it currently is wrong.  They too
> have a check like this but it is after another check...  :-/
> 
> I was hoping to avoid having a list of DOE devices in the cxlds and simply let
> the auxiliary bus infrastructure do that somehow.  IIRC Jonathan was thinking
> along the same lines.  I think he actually suggested auxiliary_find_device()...

Ah.. I think I'd been thinking it was scoped to a single parent rather than
all devices in the system.  Definitely rather horrible.
Can we do something with device_for_each_child() instead with a match on
bus type to check its an auxilliary bus device then I guess a name based
check on whether that is a doe.  etc.


> 
> It would be nice if I could have an aux_find_child() or something which
> iterated the auxiliary devices attached to a particular parent device.  I've
> just not figured out exactly how to implement that better than what I did here.
> 
> >   
> > > +
> > > +	adev = to_auxiliary_dev(dev);
> > > +	doe_dev = container_of(adev, struct pci_doe_dev, adev);
> > > +
> > > +	/* If it is one of ours check for the CDAT protocol */
> > > +	if (pci_doe_supports_prot(doe_dev, PCI_DVSEC_VENDOR_ID_CXL,
> > > +				  CXL_DOE_PROTOCOL_TABLE_ACCESS))
> > > +		return 1;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  static int cxl_setup_doe_devices(struct cxl_dev_state *cxlds)
> > >  {
> > >  	struct device *dev = cxlds->dev;
> > >  	struct pci_dev *pdev = to_pci_dev(dev);
> > > +	struct auxiliary_device *adev;
> > > +	int rc;
> > >  
> > > -	return pci_doe_create_doe_devices(pdev);
> > > +	rc = pci_doe_create_doe_devices(pdev);
> > > +	if (rc)
> > > +		return rc;
> > > +
> > > +	adev = auxiliary_find_device(NULL, cxlds, &cxl_match_cdat_doe_device);
> > > +
> > > +	if (adev) {
> > > +		struct pci_doe_dev *doe_dev = container_of(adev,
> > > +							   struct pci_doe_dev,
> > > +							   adev);
> > > +
> > > +		/*
> > > +		 * No reference need be taken.  The DOE device lifetime is
> > > +		 * longer that the CXL device state lifetime
> > > +		 */  
> > 
> > You're holding a reference to the adev here. Did you mean to drop it?  
> 
> Does find device get a reference? ...  Ah shoot I did not see that.
> 
> Yea the reference should be dropped somewhere.
> 
> Thanks,
> Ira
> 
> >   
> > > +		cxlds->cdat_doe = doe_dev;
> > > +	}
> > > +
> > > +	return 0;
> > >  }
> > >  
> > >  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > > -- 
> > > 2.31.1
> > >   


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices
  2022-02-03 22:44   ` Bjorn Helgaas
@ 2022-02-04 14:51     ` Jonathan Cameron
  2022-02-04 16:27       ` Bjorn Helgaas
  2022-03-24  0:26     ` Ira Weiny
  1 sibling, 1 reply; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-04 14:51 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ira.weiny, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 3 Feb 2022 16:44:37 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > CXL and/or PCI devices can define DOE mailboxes.    
> 
> In concrete terms, "DOE mailbox" refers to a DOE Capability, right?
> PCIe devices are allowed to implement several instances of the DOE
> Capability, of course.  I'm kind of partial to concreteness because it
> makes it easier to map between the code and the spec.

I'll throw my opinions in here, though a lot of this is really
about the justification for representing this via an Aux bus.
Personally I wouldn't mind if we decided the kernel was always in
charge of DOEs.  Others have a different opinion :)

Changing references to be the DOE Capability makes sense to me to me
though may be worth a one line statement that each DOE capability
provides a single mailbox interface.

> 
> > Normally the kernel will want to maintain control of all of these
> > mailboxes.  However, under a limited number of use cases users may
> > want to allow user space access to some of these mailboxes while the
> > kernel retains control of the rest.  
> 
> Is there something in this patch related to user-space vs kernel
> control of things?  To me this patch looks like "for every DOE
> Capability on a device, create an auxiliary device and try to attach
> an auxiliary device driver to it."

Only the 'unbind' and access it directly (setpci etc) route.
These section of commentary was introduced because some people felt
we needed a standard way to do that unbind - that's one of the main
arguments in favour of using an auxbus as I understand it.

> 
> If part of creating the auxiliary devices is adding things in sysfs, I
> think it would be useful to mention that here.

The sysfs side of things is protocol specific. We dropped the plan
to have a generic interface. It will add a device though, with basic
controls, so good to call that out here (it's mentioned in a later
patch description).

> 
> > An example of this is for CXL Compliance Testing (see CXL 2.0
> > 14.16.4 Compliance Mode DOE) which offers a mechanism to set
> > different test modes for a device.  
> 
> Not sure exactly what this contributes here.  I guess you're saying
> you might want user-space access to this, but I don't see anything in
> this patch related to that.

Again, was part of the requirement raised to allow unbinding of the
DOE driver. Don't need to mention specifics here though so I'd be
fine with dropping this text.

> 
> > Rather than re-invent the wheel the architecture creates auxiliary
> > devices for each DOE mailbox which can then be driven by a generic
> > DOE mailbox driver.  If access to an individual mailbox is required
> > by user space the driver for that mailbox can be unloaded and access
> > handed to user space.  
> 
> IIUC a device can have several DOE Capabilities, and each Capability
> can support several protocols.  So I would think the granularity might
> be "protocol" rather than "mailbox" (DOE Capability).

The granularity of whether to have control from a driver or not has
to be at the capability level, because we need to mediate access
to the capability, not the protocol (that may need additional mediation,
but that's a problem for whatever controls the capability).
So either the capability is in control of userspace, or it is in control
of kernel.

> 
> But either way this text seems like it would go with a different patch
> since this patch has nothing to specify a particular protocol or even
> a particular mailbox/DOE Capability.
> 
> > Create the helper pci_doe_create_doe_devices() which iterates each DOE
> > mailbox found in the device and creates a DOE auxiliary device on the
> > auxiliary bus.  While doing so ensure that the auxiliary DOE driver
> > loads to drive that device.  
> 
> Here's a case where "iterating over DOE mailboxes found in the device"
> is slightly abstract.  The code obviously iterates over DOE
> *Capabilities* (PCI_EXT_CAP_ID_DOE), and that's something I can easily
> find in the spec.
> 
> Knowing that this is a PCIe Capability is useful because it puts it in
> the context of other capabilities ("optional things that live in
> config space") and the mechanisms for synchronization and user-space
> access.

Agreed.

> 
> > +/**
> > + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> > + *                              mailboxes found
> > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > + *
> > + * There is no coresponding destroy of these devices.  This function associates
> > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > + * association is device managed (devm_*) such that the DOE auxiliary device
> > + * lifetime is always greater than or equal to the lifetime of the pci_dev.  
> 
> This seems backwards.  What does it mean if the DOE aux dev lifetime
> is *greater* than that of the pci_dev?  Surely you can't access a PCI
> DOE Capability if the pci_dev is gone?

I think the description is inaccurate - the end of life is the same
as that of the PCI driver binding to the pci_dev.  It'll get cleared
up if that is unbound etc.


> 
> > + * RETURNS: 0 on success -ERRNO on failure.
> > + */
> > +int pci_doe_create_doe_devices(struct pci_dev *pdev)
> > +{
> > +	struct device *dev = &pdev->dev;
> > +	int irqs, rc;
> > +	u16 pos = 0;
> > +
> > +	/*
> > +	 * An implementation may support an unknown number of interrupts.
> > +	 * Assume that number is not that large and request them all.  
> 
> This doesn't really inspire confidence :)  Playing devil's advocate,
> since pdev is an arbitrary device, I would assume the number *is*
> large.

Thomas's recent series on enabling expanding msi-X vectors after enabling
msi-X may be useful.

https://lore.kernel.org/linux-arm-kernel/20211210221642.869015045@linutronix.de/
  
But as I understand it that is still a work in progress and only applies
to MSI-X anwyay.

Perhaps we are better off just making this the callers problem and
documenting that MSI or MSI-X must be enabled and vectors requested
before we call this function.

> 
> > +	irqs = pci_msix_vec_count(pdev);
> > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);  
> 
> pci_msix_vec_count() is apparently sort of discouraged; see
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/msi-howto.rst?id=v5.16#n179

We can just pass in INT_MAX then if we want the maximum available.

> 
> A DOE Capability may be implemented by any device, e.g., a NIC or
> storage HBA, etc.  I'm a little queasy about IRQ alloc happening both
> here and in the driver for the device's primary functionality.  Can
> you reassure me that this is actually OK and safe?

> Sorry if I've asked this before.  If I have, perhaps a comment would
> be useful.

I went looking and earlier versions (pre Aux bus) made these calls in
the main driver before calling the DOE specific element.  So I don't think this came up
in previous reviews.

> 
> > +	if (rc != irqs) {
> > +		/* No interrupt available - carry on */
> > +		pci_dbg(pdev, "No interrupts available for DOE\n");
> > +	} else {
> > +		/*
> > +		 * Enabling bus mastering is require for MSI/MSIx.  It could be  
> 
> s/require/required/
> s/MSIx/MSI-X/ to match spec usage.
> 
> But I think you only support MSI-X, since you passed "PCI_IRQ_MSIX", not
> "PCI_IRQ_MSI | PCI_IRQ_MSIX" above?

Good point. I think it should be both.

> 
> > +		 * done later within the DOE initialization, but as it
> > +		 * potentially has other impacts keep it here when setting up
> > +		 * the IRQ's.  
> 
> s/IRQ's/IRQs/
> 
> "Potentially has other impacts" is too vague, and this doesn't explain
> why bus mastering should be enabled here rather than later.  The
> device should not issue an MSI-X until DOE Interrupt Enable is set, so
> near there seems like a logical place.

I can't remember what lead to that comment so hopefully moving to just
before the enable would be fine - if there was somewhere to do it.
I'm not sure there is as the IRQ enable is in the Auxilliary
Bus driver.  If we pull the pci_alloc_irq_vectors() out of here
into the caller, then the pci_set_master() should go with it.


> 
> > +		 */
> > +		pci_set_master(pdev);
> > +		rc = devm_add_action_or_reset(dev,
> > +					      pci_doe_free_irq_vectors,
> > +					      pdev);
> > +		if (rc)
> > +			return rc;
> > +	}  
> 
> > +++ b/include/linux/pci-doe.h
> > @@ -13,6 +13,8 @@
> >  #ifndef LINUX_PCI_DOE_H
> >  #define LINUX_PCI_DOE_H
> >  
> > +#define DOE_DEV_NAME "doe"  
> 
> This is only used once, above.  Why not just use the string there
> directly and skip the #define?  If it's needed elsewhere eventually,
> we can add a #define then.
> 
> Bjorn


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices
  2022-02-04 14:51     ` Jonathan Cameron
@ 2022-02-04 16:27       ` Bjorn Helgaas
  2022-02-11  2:54         ` Dan Williams
  0 siblings, 1 reply; 49+ messages in thread
From: Bjorn Helgaas @ 2022-02-04 16:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: ira.weiny, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Fri, Feb 04, 2022 at 02:51:16PM +0000, Jonathan Cameron wrote:
> On Thu, 3 Feb 2022 16:44:37 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:

> > > + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> > > + *                              mailboxes found
> > > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > > + *
> > > + * There is no coresponding destroy of these devices.  This function associates
> > > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > > + * association is device managed (devm_*) such that the DOE auxiliary device
> > > + * lifetime is always greater than or equal to the lifetime of the pci_dev.  
> > 
> > This seems backwards.  What does it mean if the DOE aux dev
> > lifetime is *greater* than that of the pci_dev?  Surely you can't
> > access a PCI DOE Capability if the pci_dev is gone?
> 
> I think the description is inaccurate - the end of life is the same
> as that of the PCI driver binding to the pci_dev.  It'll get cleared
> up if that is unbound etc.

I don't know much about devm, but I *think* the devm things get
released by devres_release_all(), which is called by
__device_release_driver() after it calls the bus or driver's .remove()
method (pci_device_remove(), in this case).

So in this case, I think the aux dev is created after the pci_dev and
released after the PCI driver and the PCI core are done with the
pci_dev.  I assume some refcounting prevents the pci_dev from actually
being deallocated until the aux dev is done with it.

I'm not confident that this is a robust situation.

> > > +		 * done later within the DOE initialization, but as it
> > > +		 * potentially has other impacts keep it here when setting up
> > > +		 * the IRQ's.  
> > 
> > s/IRQ's/IRQs/
> > 
> > "Potentially has other impacts" is too vague, and this doesn't
> > explain why bus mastering should be enabled here rather than
> > later.  The device should not issue an MSI-X until DOE Interrupt
> > Enable is set, so near there seems like a logical place.
> 
> I can't remember what lead to that comment so hopefully moving to
> just before the enable would be fine - if there was somewhere to do
> it.  I'm not sure there is as the IRQ enable is in the Auxilliary
> Bus driver.  If we pull the pci_alloc_irq_vectors() out of here into
> the caller, then the pci_set_master() should go with it.

I think pci_set_master() is tied to setting PCI_DOE_CTRL_INT_EN, not
to pci_alloc_irq_vectors().

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 02/10] PCI: Replace magic constant for PCI Sig Vendor ID
  2022-02-01  7:19 ` [PATCH V6 02/10] PCI: Replace magic constant for PCI Sig Vendor ID ira.weiny
@ 2022-02-04 21:16   ` Dan Williams
  2022-02-04 21:49   ` Bjorn Helgaas
  1 sibling, 0 replies; 49+ messages in thread
From: Dan Williams @ 2022-02-04 21:16 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Jonathan Cameron, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Mon, Jan 31, 2022 at 11:20 PM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> Based on Bjorn's suggestion[1], now that the PCI Sig Vendor ID is
> defined the define should be used in pci_bus_crs_vendor_id() rather than
> the hard coded magic value.
>
> Replace the magic value in pci_bus_crs_vendor_id() with
> PCI_VENDOR_ID_PCI_SIG.
>
> [1] https://lore.kernel.org/linux-cxl/20211117215044.GA1777828@bhelgaas/
>
> Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  drivers/pci/probe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 087d3658f75c..d92dbb136fc9 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2318,7 +2318,7 @@ EXPORT_SYMBOL(pci_alloc_dev);
>
>  static bool pci_bus_crs_vendor_id(u32 l)
>  {
> -       return (l & 0xffff) == 0x0001;
> +       return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG;
>  }

Looks good to me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 02/10] PCI: Replace magic constant for PCI Sig Vendor ID
  2022-02-01  7:19 ` [PATCH V6 02/10] PCI: Replace magic constant for PCI Sig Vendor ID ira.weiny
  2022-02-04 21:16   ` Dan Williams
@ 2022-02-04 21:49   ` Bjorn Helgaas
  2022-03-15 21:48     ` Ira Weiny
  1 sibling, 1 reply; 49+ messages in thread
From: Bjorn Helgaas @ 2022-02-04 21:49 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Mon, Jan 31, 2022 at 11:19:44PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Based on Bjorn's suggestion[1], now that the PCI Sig Vendor ID is
> defined the define should be used in pci_bus_crs_vendor_id() rather than
> the hard coded magic value.
> 
> Replace the magic value in pci_bus_crs_vendor_id() with
> PCI_VENDOR_ID_PCI_SIG.
 
This sentence is plenty; no attribution or link needed.  I appreciate
the acknowledgement, but replacing a magic value isn't a better idea
simply because *I* suggested it ;)

> [1] https://lore.kernel.org/linux-cxl/20211117215044.GA1777828@bhelgaas/
> 
> Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/probe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 087d3658f75c..d92dbb136fc9 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2318,7 +2318,7 @@ EXPORT_SYMBOL(pci_alloc_dev);
>  
>  static bool pci_bus_crs_vendor_id(u32 l)
>  {
> -	return (l & 0xffff) == 0x0001;
> +	return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG;
>  }
>  
>  static bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l,
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-01  7:19 ` [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
  2022-02-03 22:40   ` Bjorn Helgaas
@ 2022-02-09  0:59   ` Dan Williams
  2022-02-09 10:13     ` Jonathan Cameron
  2022-03-16 22:50     ` Ira Weiny
  1 sibling, 2 replies; 49+ messages in thread
From: Dan Williams @ 2022-02-09  0:59 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Jonathan Cameron, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Mon, Jan 31, 2022 at 11:20 PM <ira.weiny@intel.com> wrote:
>
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
> Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> with standard protocol discovery.  Each mailbox is accessed through a
> DOE Extended Capability.
>
> Define an auxiliary device driver which control DOE auxiliary devices

s/control/controls/

> registered on the auxiliary bus.
>
> A DOE mailbox is allowed to support any number of protocols while some
> DOE protocol specifications apply additional restrictions.
>
> The protocols supported are queried and cached.  pci_doe_supports_prot()
> can be used to determine if the DOE device supports the protocol
> specified.
>
> A synchronous interface is provided in pci_doe_exchange_sync() to
> perform a single query / response exchange from the driver through the
> device specified.
>
> Testing was conducted against QEMU using:
>
> https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
>
> This code is based on Jonathan's V4 series here:
>
> https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
>
> [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
>     Data Object Exchange (DOE) - Approved 12 March 2020
>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
> ---
> NOTE: Bjorn mentioned that the signed off by's are backwards but
>         checkpatch complains no mater what I do.  Either the
>         co-developed by is wrong or the signed off by is wrong because
>         Jonathan is the original author.  The above order is acceptable
>         to checkpatch so I left it that way.
>
> Changes from V5
>         From Bjorn
>                 s/pci_WARN/pci_warn
>                         Add timeout period to print
>                 Trim to 80 chars
>                 Use Tabs for DOE define spacing
>                 Use %#x for clarity
>         From Jonathan
>                 Addresses concerns about the order of unwinding stuff
>                 s/doe/doe_dev in pci_doe_exhcnage_sync
>                 Correct kernel Doc comment
>                 Move pci_doe_task_complete() down in the file.
>                 Rework pci_doe_irq()
>                         process STATUS_ERROR first
>                         Return IRQ_NONE if the irq is not processed
>                         Use PCI_DOE_STATUS_INT_STATUS explicitly to
>                                 clear the irq
>         Clean up goto label s/err_free_irqs/err_free_irq
>         use devm_kzalloc for doe struct
>         clean up error paths in pci_doe_probe
>         s/pci_doe_drv/pci_doe
>         remove include mutex.h
>         remove device name and define, move it in the next patch which uses it
>         use devm_kasprintf() for irq_name
>         use devm_request_irq()
>         remove pci_doe_unregister()
>                 [get/put]_device() were unneeded and with the use of
>                 devm_* this function can be removed completely.
>         refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq
>                 make this function just a registration of the irq and
>                 move pci_doe_abort() into pci_doe_probe()
>         use devm_* to allocate the protocol array
>
> Changes from Jonathan's V4
>         Move the DOE MB code into the DOE auxiliary driver
>         Remove Task List in favor of a wait queue
>
> Changes from Ben
>         remove CXL references
>         propagate rc from pci functions on error
> ---
>  drivers/pci/Kconfig           |  10 +
>  drivers/pci/Makefile          |   3 +
>  drivers/pci/doe.c             | 675 ++++++++++++++++++++++++++++++++++
>  include/linux/pci-doe.h       |  60 +++
>  include/uapi/linux/pci_regs.h |  29 +-
>  5 files changed, 776 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/pci/doe.c
>  create mode 100644 include/linux/pci-doe.h
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 43e615aa12ff..8de51b64067c 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND
>           The PCI device frontend driver allows the kernel to import arbitrary
>           PCI devices from a PCI backend to support PCI driver domains.
>
> +config PCI_DOE_DRIVER
> +       tristate "PCI Data Object Exchange (DOE) driver"
> +       select AUXILIARY_BUS

See below near the comment about the odd usage of MODULE_DEVICE_TABLE,
perhaps the auxiliary device / driver should be registered by the
client of this core code, not the core itself.

> +       help
> +         Driver for DOE auxiliary devices.
> +
> +         DOE provides a simple mailbox in PCI config space that is used by a
> +         number of different protocols.  DOE is defined in the Data Object
> +         Exchange ECN to the PCIe r5.0 spec.
> +
>  config PCI_ATS
>         bool
>
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index d62c4ac4ae1b..afd9d7bd2b82 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB)               += pci-stub.o
>  obj-$(CONFIG_PCI_PF_STUB)      += pci-pf-stub.o
>  obj-$(CONFIG_PCI_ECAM)         += ecam.o
>  obj-$(CONFIG_PCI_P2PDMA)       += p2pdma.o
> +obj-$(CONFIG_PCI_DOE_DRIVER)   += pci-doe.o
>  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>
> +pci-doe-y := doe.o
> +
>  # Endpoint library must be initialized before its users
>  obj-$(CONFIG_PCI_ENDPOINT)     += endpoint/
>
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> new file mode 100644
> index 000000000000..4ff54bade8ec
> --- /dev/null
> +++ b/drivers/pci/doe.c
> @@ -0,0 +1,675 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Object Exchange ECN
> + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci-doe.h>
> +#include <linux/workqueue.h>
> +#include <linux/module.h>
> +
> +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> +
> +#define PCI_DOE_BUSY_MAX_RETRIES 16
> +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> +
> +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> +#define PCI_DOE_TIMEOUT HZ
> +
> +enum pci_doe_state {
> +       DOE_IDLE,
> +       DOE_WAIT_RESP,
> +       DOE_WAIT_ABORT,
> +       DOE_WAIT_ABORT_ON_ERR,
> +};
> +
> +/**
> + * struct pci_doe_task - description of a query / response task
> + * @ex: The details of the task to be done
> + * @rv: Return value.  Length of received response or error
> + * @cb: Callback for completion of task
> + * @private: Private data passed to callback on completion
> + */
> +struct pci_doe_task {
> +       struct pci_doe_exchange *ex;
> +       int rv;
> +       void (*cb)(void *private);

s/cb/end_task/?

Why does this need to abandon all semblance of type safety?

I would expect:

void (*end_task)(struct pci_doe_task *task);

...and let the caller attach any follow on data to task->private if necessary.

> +       void *private;
> +};
> +
> +/**
> + * struct pci_doe - A single DOE mailbox driver

This is driver *state*, right? I.e. not something that wraps "struct
device_driver" which is what I would expect something claiming to be a
"driver" would do.

> + *
> + * @doe_dev: The DOE Auxiliary device being driven
> + * @abort_c: Completion used for initial abort handling
> + * @irq: Interrupt used for signaling DOE ready or abort
> + * @irq_name: Name used to identify the irq for a particular DOE
> + * @prots: Array of identifiers for protocols supported

"prot" already has a meaning in the kernel, just spell out
"protocols". This also looks like something that can be allocated
inline rather than out of line i.e.:

struct pci_doe {
...
    int nr_protocols
    struct pci_doe_protocol protocols[];
}

...and then use struct_size() to allocate it.

> + * @num_prots: Size of prots array
> + * @cur_task: Current task the state machine is working on
> + * @wq: Wait queue to wait on if a query is in progress
> + * @state_lock: Protect the state of cur_task, abort, and dead
> + * @statemachine: Work item for the DOE state machine
> + * @state: Current state of this DOE
> + * @timeout_jiffies: 1 second after GO set
> + * @busy_retries: Count of retry attempts
> + * @abort: Request a manual abort (e.g. on init)
> + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> + *        will immediately be aborted with error
> + */
> +struct pci_doe {
> +       struct pci_doe_dev *doe_dev;
> +       struct completion abort_c;
> +       int irq;
> +       char *irq_name;
> +       struct pci_doe_protocol *prots;
> +       int num_prots;
> +
> +       struct pci_doe_task *cur_task;
> +       wait_queue_head_t wq;
> +       struct mutex state_lock;
> +       struct delayed_work statemachine;
> +       enum pci_doe_state state;
> +       unsigned long timeout_jiffies;
> +       unsigned int busy_retries;
> +       unsigned int abort:1;
> +       unsigned int dead:1;
> +};
> +
> +static irqreturn_t pci_doe_irq(int irq, void *data)
> +{
> +       struct pci_doe *doe = data;
> +       struct pci_dev *pdev = doe->doe_dev->pdev;
> +       int offset = doe->doe_dev->cap_offset;
> +       u32 val;
> +
> +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +       /* Leave the error case to be handled outside IRQ */
> +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +               mod_delayed_work(system_wq, &doe->statemachine, 0);
> +               return IRQ_HANDLED;
> +       }
> +
> +       if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> +               pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> +                                       PCI_DOE_STATUS_INT_STATUS);
> +               mod_delayed_work(system_wq, &doe->statemachine, 0);
> +               return IRQ_HANDLED;
> +       }
> +
> +       return IRQ_NONE;
> +}
> +
> +/*
> + * Only call when safe to directly access the DOE, either because no tasks yet
> + * queued, or called from doe_statemachine_work() which has exclusive access to
> + * the DOE config space.

It doesn't have exclusive access unless the patch to lock out
userspace config writes are revived. Instead, I like Bjorn's idea of
tracking and warning / tainting, but not blocking conflicting
userspace access to sensitive configuration registers.

Yes, it was somewhat of a throw-away comment from Bjorn in that
thread, "(and IMO should taint the kernel)", but DOE can do so much
subtle damage (compliance test modes, link-encryption / disruption,
vendor private who-knows-what...) that I think it behooves us as
kernel developers to know when we are debugging system behavior that
may be the result of non-kernel mitigated DOE access. The proposal is
that when kernel lockdown is not enabled, use the approach from the
exclusive config access patch [2] to trap, warn (once per device?),
and taint when userspace writes to DOE registers that have been
claimed by the kernel. This lets strict environments use
kernel-lockdown to block userspace DOE access altogether, in
non-strict environment it discourages userspace from clobbering DOE
driver state, and it allows a warn-free path if userspace takes the
step of at least unbinding the kernel DOE driver before running
userspace DOE cycles.

[1]: https://lore.kernel.org/r/20211203235617.GA3036259@bhelgaas
[2]: https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com/

> + */
> +static void pci_doe_abort_start(struct pci_doe *doe)
> +{
> +       struct pci_dev *pdev = doe->doe_dev->pdev;
> +       int offset = doe->doe_dev->cap_offset;
> +       u32 val;
> +
> +       val = PCI_DOE_CTRL_ABORT;
> +       if (doe->irq)
> +               val |= PCI_DOE_CTRL_INT_EN;
> +       pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +
> +       doe->timeout_jiffies = jiffies + HZ;
> +       schedule_delayed_work(&doe->statemachine, HZ);

Given the spec timeout is 1 second and the device clock might be
slightly off from the host clock how about make this a more generous
1.5 or 2 seconds?

> +}
> +
> +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)

The relationship between tasks, requests, responses, and exchanges is
not immediately clear to me. For example, can this helper be renamed
in terms of its relationship to a task? A theory of operation document
would help, but it seems there is also room for the implementation to
be more self documenting.

> +{
> +       struct pci_dev *pdev = doe->doe_dev->pdev;
> +       int offset = doe->doe_dev->cap_offset;
> +       u32 val;
> +       int i;
> +
> +       /*
> +        * Check the DOE busy bit is not set. If it is set, this could indicate
> +        * someone other than Linux (e.g. firmware) is using the mailbox. Note
> +        * it is expected that firmware and OS will negotiate access rights via
> +        * an, as yet to be defined method.
> +        */
> +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +       if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> +               return -EBUSY;
> +
> +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +               return -EIO;
> +
> +       /* Write DOE Header */
> +       val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) |
> +               FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type);
> +       pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> +       /* Length is 2 DW of header + length of payload in DW */
> +       pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +                              FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> +                                         2 + ex->request_pl_sz /
> +                                               sizeof(u32)));
> +       for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++)
> +               pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +                                      ex->request_pl[i]);
> +
> +       val = PCI_DOE_CTRL_GO;
> +       if (doe->irq)
> +               val |= PCI_DOE_CTRL_INT_EN;
> +
> +       pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +       /* Request is sent - now wait for poll or IRQ */
> +       return 0;
> +}
> +
> +static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex)
> +{
> +       struct pci_dev *pdev = doe->doe_dev->pdev;
> +       int offset = doe->doe_dev->cap_offset;
> +       size_t length;
> +       u32 val;
> +       int i;
> +
> +       /* Read the first dword to get the protocol */
> +       pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +       if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) ||
> +           (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) {
> +               pci_err(pdev,
> +                       "Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",
> +                       ex->prot.vid, ex->prot.type,
> +                       FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> +                       FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> +               return -EIO;
> +       }
> +
> +       pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +       /* Read the second dword to get the length */
> +       pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +       pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +
> +       length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> +       if (length > SZ_1M || length < 2)
> +               return -EIO;
> +
> +       /* First 2 dwords have already been read */
> +       length -= 2;
> +       /* Read the rest of the response payload */
> +       for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) {
> +               pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> +                                     &ex->response_pl[i]);
> +               pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +       }
> +
> +       /* Flush excess length */
> +       for (; i < length; i++) {
> +               pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +               pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +       }
> +       /* Final error check to pick up on any since Data Object Ready */
> +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +               return -EIO;
> +
> +       return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32);
> +}
> +
> +static void doe_statemachine_work(struct work_struct *work)
> +{
> +       struct delayed_work *w = to_delayed_work(work);
> +       struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> +       struct pci_dev *pdev = doe->doe_dev->pdev;
> +       int offset = doe->doe_dev->cap_offset;
> +       struct pci_doe_task *task;
> +       bool abort;
> +       u32 val;
> +       int rc;
> +
> +       mutex_lock(&doe->state_lock);
> +       task = doe->cur_task;
> +       abort = doe->abort;
> +       doe->abort = false;
> +       mutex_unlock(&doe->state_lock);
> +
> +       if (abort) {
> +               /*
> +                * Currently only used during init - care needed if
> +                * pci_doe_abort() is generally exposed as it would impact
> +                * queries in flight.
> +                */
> +               WARN_ON(task);

Why is it worth potentially crashing the kernel here? Is this purely a
situation that will only happen during development  and refactoring of
the driver? Otherwise I would expect handling the error without WARN.


> +               doe->state = DOE_WAIT_ABORT;
> +               pci_doe_abort_start(doe);
> +               return;
> +       }
> +
> +       switch (doe->state) {
> +       case DOE_IDLE:
> +               if (task == NULL)
> +                       return;
> +
> +               /* Nothing currently in flight so queue a task */
> +               rc = pci_doe_send_req(doe, task->ex);
> +               /*
> +                * The specification does not provide any guidance on how long
> +                * some other entity could keep the DOE busy, so try for 1
> +                * second then fail. Busy handling is best effort only, because
> +                * there is no way of avoiding racing against another user of
> +                * the DOE.
> +                */
> +               if (rc == -EBUSY) {
> +                       doe->busy_retries++;
> +                       if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> +                               /* Long enough, fail this request */
> +                               pci_warn(pdev,
> +                                       "DOE busy for too long (> 1 sec)\n");
> +                               doe->busy_retries = 0;
> +                               goto err_busy;
> +                       }
> +                       schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> +                       return;
> +               }
> +               if (rc)
> +                       goto err_abort;
> +               doe->busy_retries = 0;
> +
> +               doe->state = DOE_WAIT_RESP;
> +               doe->timeout_jiffies = jiffies + HZ;
> +               /* Now poll or wait for IRQ with timeout */
> +               if (doe->irq > 0)
> +                       schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> +               else
> +                       schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +               return;
> +
> +       case DOE_WAIT_RESP:
> +               /* Not possible to get here with NULL task */
> +               pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +               if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +                       rc = -EIO;
> +                       goto err_abort;
> +               }
> +
> +               if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> +                       /* If not yet at timeout reschedule otherwise abort */
> +                       if (time_after(jiffies, doe->timeout_jiffies)) {
> +                               rc = -ETIMEDOUT;
> +                               goto err_abort;
> +                       }
> +                       schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +                       return;
> +               }
> +
> +               rc  = pci_doe_recv_resp(doe, task->ex);
> +               if (rc < 0)
> +                       goto err_abort;
> +
> +               doe->state = DOE_IDLE;
> +
> +               mutex_lock(&doe->state_lock);
> +               doe->cur_task = NULL;
> +               mutex_unlock(&doe->state_lock);
> +               wake_up_interruptible(&doe->wq);
> +
> +               /* Set the return value to the length of received payload */
> +               task->rv = rc;
> +               task->cb(task->private);
> +
> +               return;
> +
> +       case DOE_WAIT_ABORT:
> +       case DOE_WAIT_ABORT_ON_ERR:
> +               pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +               if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> +                   !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> +                       /* Back to normal state - carry on */
> +                       mutex_lock(&doe->state_lock);
> +                       doe->cur_task = NULL;
> +                       mutex_unlock(&doe->state_lock);
> +                       wake_up_interruptible(&doe->wq);
> +
> +                       /*
> +                        * For deliberately triggered abort, someone is
> +                        * waiting.
> +                        */
> +                       if (doe->state == DOE_WAIT_ABORT)
> +                               complete(&doe->abort_c);

Why is a completion and waitqueue needed? I.e. a waiter could simply
look for an abort completion flag to be set instead.


> +
> +                       doe->state = DOE_IDLE;
> +                       return;
> +               }
> +               if (time_after(jiffies, doe->timeout_jiffies)) {
> +                       /* Task has timed out and is dead - abort */
> +                       pci_err(pdev, "DOE ABORT timed out\n");
> +                       mutex_lock(&doe->state_lock);
> +                       doe->dead = true;
> +                       doe->cur_task = NULL;
> +                       mutex_unlock(&doe->state_lock);
> +                       wake_up_interruptible(&doe->wq);
> +
> +                       if (doe->state == DOE_WAIT_ABORT)
> +                               complete(&doe->abort_c);
> +               }
> +               return;
> +       }
> +
> +err_abort:
> +       doe->state = DOE_WAIT_ABORT_ON_ERR;
> +       pci_doe_abort_start(doe);
> +err_busy:
> +       task->rv = rc;
> +       task->cb(task->private);
> +       /* If here via err_busy, signal the task done. */
> +       if (doe->state == DOE_IDLE) {
> +               mutex_lock(&doe->state_lock);
> +               doe->cur_task = NULL;
> +               mutex_unlock(&doe->state_lock);
> +               wake_up_interruptible(&doe->wq);
> +       }
> +}
> +
> +static void pci_doe_task_complete(void *private)
> +{
> +       complete(private);
> +}
> +
> +/**
> + * pci_doe_exchange_sync() - Send a request, then wait for and receive a
> + *                          response
> + * @doe_dev: DOE mailbox state structure
> + * @ex: Description of the buffers and Vendor ID + type used in this
> + *      request/response pair
> + *
> + * Excess data will be discarded.
> + *
> + * RETURNS: payload in bytes on success, < 0 on error
> + */
> +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
> +                         struct pci_doe_exchange *ex)
> +{
> +       struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +       struct pci_doe_task task;
> +       DECLARE_COMPLETION_ONSTACK(c);
> +
> +       if (!doe)
> +               return -EAGAIN;
> +
> +       /* DOE requests must be a whole number of DW */
> +       if (ex->request_pl_sz % sizeof(u32))
> +               return -EINVAL;
> +
> +       task.ex = ex;
> +       task.cb = pci_doe_task_complete;
> +       task.private = &c;
> +
> +again:
> +       mutex_lock(&doe->state_lock);
> +       if (doe->cur_task) {
> +               mutex_unlock(&doe->state_lock);
> +               wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> +               goto again;
> +       }
> +
> +       if (doe->dead) {
> +               mutex_unlock(&doe->state_lock);
> +               return -EIO;
> +       }
> +       doe->cur_task = &task;
> +       schedule_delayed_work(&doe->statemachine, 0);
> +       mutex_unlock(&doe->state_lock);
> +
> +       wait_for_completion(&c);

I would expect that the caller of this routine would want to specify
the task and end_task() callback and use that as the completion
signal. It may also want "no wait" behavior where it is prepared for
the DOE result to come back sometime later. With that change the
exchange fields can move into the task directly.

> +
> +       return task.rv;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> +
> +/**
> + * pci_doe_supports_prot() - Return if the DOE instance supports the given
> + *                          protocol
> + * @pdev: Device on which to find the DOE instance
> + * @vid: Protocol Vendor ID
> + * @type: protocol type
> + *
> + * This device can then be passed to pci_doe_exchange_sync() to execute a
> + * mailbox exchange through that DOE mailbox.
> + *
> + * RETURNS: True if the DOE device supports the protocol specified
> + */
> +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> +{
> +       struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +       int i;
> +
> +       if (!doe)
> +               return false;
> +
> +       for (i = 0; i < doe->num_prots; i++)
> +               if ((doe->prots[i].vid == vid) &&
> +                   (doe->prots[i].type == type))
> +                       return true;
> +
> +       return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> +
> +static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid,
> +                            u8 *protocol)
> +{
> +       u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
> +                                   *index);
> +       u32 response_pl;
> +       struct pci_doe_exchange ex = {
> +               .prot.vid = PCI_VENDOR_ID_PCI_SIG,
> +               .prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> +               .request_pl = &request_pl,
> +               .request_pl_sz = sizeof(request_pl),
> +               .response_pl = &response_pl,
> +               .response_pl_sz = sizeof(response_pl),
> +       };
> +       int ret;
> +
> +       ret = pci_doe_exchange_sync(doe->doe_dev, &ex);
> +       if (ret < 0)
> +               return ret;
> +
> +       if (ret != sizeof(response_pl))
> +               return -EIO;
> +
> +       *vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> +       *protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> +                             response_pl);
> +       *index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> +                          response_pl);
> +
> +       return 0;
> +}
> +
> +static int pci_doe_cache_protocols(struct pci_doe *doe)
> +{
> +       u8 index = 0;
> +       int num_prots;
> +       int rc;
> +
> +       /* Discovery protocol must always be supported and must report itself */
> +       num_prots = 1;
> +       doe->prots = devm_kcalloc(&doe->doe_dev->adev.dev, num_prots,
> +                                 sizeof(*doe->prots), GFP_KERNEL);
> +       if (doe->prots == NULL)
> +               return -ENOMEM;
> +
> +       do {
> +               struct pci_doe_protocol *prot;
> +
> +               prot = &doe->prots[num_prots - 1];
> +               rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
> +               if (rc)
> +                       return rc;
> +
> +               if (index) {
> +                       struct pci_doe_protocol *prot_new;
> +
> +                       num_prots++;
> +                       prot_new = devm_krealloc(&doe->doe_dev->adev.dev,
> +                                                doe->prots,
> +                                                sizeof(*doe->prots) *
> +                                                       num_prots,
> +                                                GFP_KERNEL);
> +                       if (prot_new == NULL)
> +                               return -ENOMEM;
> +                       doe->prots = prot_new;
> +               }
> +       } while (index);
> +
> +       doe->num_prots = num_prots;
> +       return 0;
> +}
> +
> +static int pci_doe_abort(struct pci_doe *doe)
> +{
> +       reinit_completion(&doe->abort_c);
> +       mutex_lock(&doe->state_lock);
> +       doe->abort = true;

Why not a flags field where atomic bitops can be used without need for a mutex.

> +       mutex_unlock(&doe->state_lock);
> +       schedule_delayed_work(&doe->statemachine, 0);
> +       wait_for_completion(&doe->abort_c);
> +
> +       if (doe->dead)

dead could also be another atomic flag.

> +               return -EIO;
> +
> +       return 0;
> +}
> +
> +static int pci_doe_reg_irq(struct pci_doe *doe)
> +{
> +       struct pci_dev *pdev = doe->doe_dev->pdev;
> +       bool poll = !pci_dev_msi_enabled(pdev);
> +       int offset = doe->doe_dev->cap_offset;
> +       int rc, irq;
> +       u32 val;
> +
> +       pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> +
> +       if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
> +               irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
> +               if (irq < 0)
> +                       return irq;
> +
> +               doe->irq_name = devm_kasprintf(&doe->doe_dev->adev.dev,
> +                                               GFP_KERNEL,
> +                                               "DOE[%s]",
> +                                               doe->doe_dev->adev.name);
> +               if (!doe->irq_name)
> +                       return -ENOMEM;
> +
> +               rc = devm_request_irq(&pdev->dev, irq, pci_doe_irq, 0,
> +                                     doe->irq_name, doe);
> +               if (rc)
> +                       return rc;
> +
> +               doe->irq = irq;
> +               pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> +                                      PCI_DOE_CTRL_INT_EN);
> +       }
> +
> +       return 0;
> +}
> +
> +/*
> + * pci_doe_probe() - Set up the Mailbox
> + * @aux_dev: Auxiliary Device
> + * @id: Auxiliary device ID
> + *
> + * Probe the mailbox found for all protocols and set up the Mailbox
> + *
> + * RETURNS: 0 on success, < 0 on error
> + */
> +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> +                        const struct auxiliary_device_id *id)
> +{
> +       struct pci_doe_dev *doe_dev = container_of(aux_dev,
> +                                       struct pci_doe_dev,
> +                                       adev);
> +       struct pci_doe *doe;
> +       int rc;
> +
> +       doe = devm_kzalloc(&aux_dev->dev, sizeof(*doe), GFP_KERNEL);
> +       if (!doe)
> +               return -ENOMEM;
> +
> +       mutex_init(&doe->state_lock);
> +       init_completion(&doe->abort_c);
> +       doe->doe_dev = doe_dev;
> +       init_waitqueue_head(&doe->wq);
> +       INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> +       dev_set_drvdata(&aux_dev->dev, doe);
> +
> +       rc = pci_doe_reg_irq(doe);
> +       if (rc)
> +               return rc;
> +
> +       /* Reset the mailbox by issuing an abort */
> +       rc = pci_doe_abort(doe);
> +       if (rc)
> +               return rc;
> +
> +       rc = pci_doe_cache_protocols(doe);
> +       if (rc)
> +               return rc;

This can just be:

 return pci_doe_cache_protocols(doe);

> +
> +       return 0;
> +}
> +
> +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> +{
> +       struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> +
> +       /* First halt the state machine */
> +       cancel_delayed_work_sync(&doe->statemachine);
> +}
> +
> +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> +       {},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);

Why is this empty table here?

> +
> +struct auxiliary_driver pci_doe_auxiliary_drv = {
> +       .name = "pci_doe",
> +       .id_table = pci_doe_auxiliary_id_table,
> +       .probe = pci_doe_probe,
> +       .remove = pci_doe_remove
> +};

I expect that these helpers would be provided by the PCI core, but
then a subsystem like CXL would have code to register their auxiliary
devices and drivers that mostly just wrap the PCI core DOE
implementation.

> +
> +static int __init pci_doe_init_module(void)
> +{
> +       int ret;
> +
> +       ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> +       if (ret) {
> +               pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> +                      ret);
> +               return ret;
> +       }
> +
> +       return 0;
> +}
> +
> +static void __exit pci_doe_exit_module(void)
> +{
> +       auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> +}
> +
> +module_init(pci_doe_init_module);
> +module_exit(pci_doe_exit_module);
> +MODULE_LICENSE("GPL v2");
> diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> new file mode 100644
> index 000000000000..2f52b31c6f32
> --- /dev/null
> +++ b/include/linux/pci-doe.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/completion.h>
> +#include <linux/list.h>
> +#include <linux/auxiliary_bus.h>
> +
> +#ifndef LINUX_PCI_DOE_H
> +#define LINUX_PCI_DOE_H
> +
> +struct pci_doe_protocol {
> +       u16 vid;
> +       u8 type;
> +};
> +
> +/**
> + * struct pci_doe_exchange - represents a single query/response
> + *
> + * @prot: DOE Protocol
> + * @request_pl: The request payload
> + * @request_pl_sz: Size of the request payload
> + * @response_pl: The response payload
> + * @response_pl_sz: Size of the response payload
> + */
> +struct pci_doe_exchange {
> +       struct pci_doe_protocol prot;
> +       u32 *request_pl;
> +       size_t request_pl_sz;
> +       u32 *response_pl;
> +       size_t response_pl_sz;
> +};
> +
> +/**
> + * struct pci_doe_dev - DOE mailbox device
> + *
> + * @adrv: Auxiliary Driver data
> + * @pdev: PCI device this belongs to
> + * @offset: Capability offset
> + *
> + * This represents a single DOE mailbox device.  Devices should create this
> + * device and register it on the Auxiliary bus for the DOE driver to maintain.
> + *
> + */
> +struct pci_doe_dev {
> +       struct auxiliary_device adev;
> +       struct pci_dev *pdev;
> +       int cap_offset;
> +};
> +
> +/* Library operations */
> +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
> +                                struct pci_doe_exchange *ex);
> +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type);
> +
> +#endif
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index ff6ccbc6efe9..c04aad391669 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -736,7 +736,8 @@
>  #define PCI_EXT_CAP_ID_DVSEC   0x23    /* Designated Vendor-Specific */
>  #define PCI_EXT_CAP_ID_DLF     0x25    /* Data Link Feature */
>  #define PCI_EXT_CAP_ID_PL_16GT 0x26    /* Physical Layer 16.0 GT/s */
> -#define PCI_EXT_CAP_ID_MAX     PCI_EXT_CAP_ID_PL_16GT
> +#define PCI_EXT_CAP_ID_DOE     0x2E    /* Data Object Exchange */
> +#define PCI_EXT_CAP_ID_MAX     PCI_EXT_CAP_ID_DOE
>
>  #define PCI_EXT_CAP_DSN_SIZEOF 12
>  #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
> @@ -1098,4 +1099,30 @@
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK                0x000000F0
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT       4
>
> +/* Data Object Exchange */
> +#define PCI_DOE_CAP            0x04    /* DOE Capabilities Register */
> +#define  PCI_DOE_CAP_INT                       0x00000001  /* Interrupt Support */
> +#define  PCI_DOE_CAP_IRQ                       0x00000ffe  /* Interrupt Message Number */
> +#define PCI_DOE_CTRL           0x08    /* DOE Control Register */
> +#define  PCI_DOE_CTRL_ABORT                    0x00000001  /* DOE Abort */
> +#define  PCI_DOE_CTRL_INT_EN                   0x00000002  /* DOE Interrupt Enable */
> +#define  PCI_DOE_CTRL_GO                       0x80000000  /* DOE Go */
> +#define PCI_DOE_STATUS         0x0c    /* DOE Status Register */
> +#define  PCI_DOE_STATUS_BUSY                   0x00000001  /* DOE Busy */
> +#define  PCI_DOE_STATUS_INT_STATUS             0x00000002  /* DOE Interrupt Status */
> +#define  PCI_DOE_STATUS_ERROR                  0x00000004  /* DOE Error */
> +#define  PCI_DOE_STATUS_DATA_OBJECT_READY      0x80000000  /* Data Object Ready */
> +#define PCI_DOE_WRITE          0x10    /* DOE Write Data Mailbox Register */
> +#define PCI_DOE_READ           0x14    /* DOE Read Data Mailbox Register */
> +
> +/* DOE Data Object - note not actually registers */
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID               0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE              0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH            0x0003ffff
> +
> +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX           0x000000ff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID             0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL                0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX      0xff000000
> +
>  #endif /* LINUX_PCI_REGS_H */
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-09  0:59   ` Dan Williams
@ 2022-02-09 10:13     ` Jonathan Cameron
  2022-02-09 16:26       ` Dan Williams
  2022-03-16 22:50     ` Ira Weiny
  1 sibling, 1 reply; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-09 10:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Weiny, Ira, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Tue, 8 Feb 2022 16:59:39 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Mon, Jan 31, 2022 at 11:20 PM <ira.weiny@intel.com> wrote:
> >
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >
> > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> >
> > Define an auxiliary device driver which control DOE auxiliary devices  
> 
> s/control/controls/
> 
> > registered on the auxiliary bus.
> >
> > A DOE mailbox is allowed to support any number of protocols while some
> > DOE protocol specifications apply additional restrictions.
> >
> > The protocols supported are queried and cached.  pci_doe_supports_prot()
> > can be used to determine if the DOE device supports the protocol
> > specified.
> >
> > A synchronous interface is provided in pci_doe_exchange_sync() to
> > perform a single query / response exchange from the driver through the
> > device specified.
> >
> > Testing was conducted against QEMU using:
> >
> > https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
> >
> > This code is based on Jonathan's V4 series here:
> >
> > https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> >
> > [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
> >     Data Object Exchange (DOE) - Approved 12 March 2020
> >
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Replies to some comments inline. 



> >
> > ---
> > NOTE: Bjorn mentioned that the signed off by's are backwards but
> >         checkpatch complains no mater what I do.  Either the
> >         co-developed by is wrong or the signed off by is wrong because
> >         Jonathan is the original author.  The above order is acceptable
> >         to checkpatch so I left it that way.
> >
> > Changes from V5
> >         From Bjorn
> >                 s/pci_WARN/pci_warn
> >                         Add timeout period to print
> >                 Trim to 80 chars
> >                 Use Tabs for DOE define spacing
> >                 Use %#x for clarity
> >         From Jonathan
> >                 Addresses concerns about the order of unwinding stuff
> >                 s/doe/doe_dev in pci_doe_exhcnage_sync
> >                 Correct kernel Doc comment
> >                 Move pci_doe_task_complete() down in the file.
> >                 Rework pci_doe_irq()
> >                         process STATUS_ERROR first
> >                         Return IRQ_NONE if the irq is not processed
> >                         Use PCI_DOE_STATUS_INT_STATUS explicitly to
> >                                 clear the irq
> >         Clean up goto label s/err_free_irqs/err_free_irq
> >         use devm_kzalloc for doe struct
> >         clean up error paths in pci_doe_probe
> >         s/pci_doe_drv/pci_doe
> >         remove include mutex.h
> >         remove device name and define, move it in the next patch which uses it
> >         use devm_kasprintf() for irq_name
> >         use devm_request_irq()
> >         remove pci_doe_unregister()
> >                 [get/put]_device() were unneeded and with the use of
> >                 devm_* this function can be removed completely.
> >         refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq
> >                 make this function just a registration of the irq and
> >                 move pci_doe_abort() into pci_doe_probe()
> >         use devm_* to allocate the protocol array
> >
> > Changes from Jonathan's V4
> >         Move the DOE MB code into the DOE auxiliary driver
> >         Remove Task List in favor of a wait queue
> >
> > Changes from Ben
> >         remove CXL references
> >         propagate rc from pci functions on error
> > ---
> >  drivers/pci/Kconfig           |  10 +
> >  drivers/pci/Makefile          |   3 +
> >  drivers/pci/doe.c             | 675 ++++++++++++++++++++++++++++++++++
> >  include/linux/pci-doe.h       |  60 +++
> >  include/uapi/linux/pci_regs.h |  29 +-
> >  5 files changed, 776 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/pci/doe.c
> >  create mode 100644 include/linux/pci-doe.h
> >
> > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> > index 43e615aa12ff..8de51b64067c 100644
> > --- a/drivers/pci/Kconfig
> > +++ b/drivers/pci/Kconfig
> > @@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND
> >           The PCI device frontend driver allows the kernel to import arbitrary
> >           PCI devices from a PCI backend to support PCI driver domains.
> >
> > +config PCI_DOE_DRIVER
> > +       tristate "PCI Data Object Exchange (DOE) driver"
> > +       select AUXILIARY_BUS  
> 
> See below near the comment about the odd usage of MODULE_DEVICE_TABLE,
> perhaps the auxiliary device / driver should be registered by the
> client of this core code, not the core itself.
> 
> > +       help
> > +         Driver for DOE auxiliary devices.
> > +
> > +         DOE provides a simple mailbox in PCI config space that is used by a
> > +         number of different protocols.  DOE is defined in the Data Object
> > +         Exchange ECN to the PCIe r5.0 spec.
> > +
> >  config PCI_ATS
> >         bool
> >
> > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> > index d62c4ac4ae1b..afd9d7bd2b82 100644
> > --- a/drivers/pci/Makefile
> > +++ b/drivers/pci/Makefile
> > @@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB)               += pci-stub.o
> >  obj-$(CONFIG_PCI_PF_STUB)      += pci-pf-stub.o
> >  obj-$(CONFIG_PCI_ECAM)         += ecam.o
> >  obj-$(CONFIG_PCI_P2PDMA)       += p2pdma.o
> > +obj-$(CONFIG_PCI_DOE_DRIVER)   += pci-doe.o
> >  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
> >
> > +pci-doe-y := doe.o
> > +
> >  # Endpoint library must be initialized before its users
> >  obj-$(CONFIG_PCI_ENDPOINT)     += endpoint/
> >
> > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> > new file mode 100644
> > index 000000000000..4ff54bade8ec
> > --- /dev/null
> > +++ b/drivers/pci/doe.c
> > @@ -0,0 +1,675 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Data Object Exchange ECN
> > + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> > + *
> > + * Copyright (C) 2021 Huawei
> > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + */
> > +
> > +#include <linux/bitfield.h>
> > +#include <linux/delay.h>
> > +#include <linux/jiffies.h>
> > +#include <linux/list.h>
> > +#include <linux/mutex.h>
> > +#include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/module.h>
> > +
> > +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> > +
> > +#define PCI_DOE_BUSY_MAX_RETRIES 16
> > +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> > +
> > +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> > +#define PCI_DOE_TIMEOUT HZ
> > +
> > +enum pci_doe_state {
> > +       DOE_IDLE,
> > +       DOE_WAIT_RESP,
> > +       DOE_WAIT_ABORT,
> > +       DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +/**
> > + * struct pci_doe_task - description of a query / response task
> > + * @ex: The details of the task to be done
> > + * @rv: Return value.  Length of received response or error
> > + * @cb: Callback for completion of task
> > + * @private: Private data passed to callback on completion
> > + */
> > +struct pci_doe_task {
> > +       struct pci_doe_exchange *ex;
> > +       int rv;
> > +       void (*cb)(void *private);  
> 
> s/cb/end_task/?
> 
> Why does this need to abandon all semblance of type safety?
> 
> I would expect:
> 
> void (*end_task)(struct pci_doe_task *task);
> 
> ...and let the caller attach any follow on data to task->private if necessary.
> 

Sure. can do that. 

> > +       void *private;
> > +};
> > +
> > +/**
> > + * struct pci_doe - A single DOE mailbox driver  
> 
> This is driver *state*, right? I.e. not something that wraps "struct
> device_driver" which is what I would expect something claiming to be a
> "driver" would do.

Agreed.

> 
> > + *
> > + * @doe_dev: The DOE Auxiliary device being driven
> > + * @abort_c: Completion used for initial abort handling
> > + * @irq: Interrupt used for signaling DOE ready or abort
> > + * @irq_name: Name used to identify the irq for a particular DOE
> > + * @prots: Array of identifiers for protocols supported  
> 
> "prot" already has a meaning in the kernel, just spell out
> "protocols". This also looks like something that can be allocated
> inline rather than out of line i.e.:
> 
> struct pci_doe {
> ...
>     int nr_protocols
>     struct pci_doe_protocol protocols[];
> }
> 
> ...and then use struct_size() to allocate it.

Can't do that. The size isn't known when we first start using
this structure - We need to use it to query what protocols are
supported.  It's initially set to 1 to cover the discovery
protocol and then we realloc to expand it as we discover more
protocols.

> 
> > + * @num_prots: Size of prots array
> > + * @cur_task: Current task the state machine is working on
> > + * @wq: Wait queue to wait on if a query is in progress
> > + * @state_lock: Protect the state of cur_task, abort, and dead
> > + * @statemachine: Work item for the DOE state machine
> > + * @state: Current state of this DOE
> > + * @timeout_jiffies: 1 second after GO set
> > + * @busy_retries: Count of retry attempts
> > + * @abort: Request a manual abort (e.g. on init)
> > + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> > + *        will immediately be aborted with error
> > + */
> > +struct pci_doe {
> > +       struct pci_doe_dev *doe_dev;
> > +       struct completion abort_c;
> > +       int irq;
> > +       char *irq_name;
> > +       struct pci_doe_protocol *prots;
> > +       int num_prots;
> > +
> > +       struct pci_doe_task *cur_task;
> > +       wait_queue_head_t wq;
> > +       struct mutex state_lock;
> > +       struct delayed_work statemachine;
> > +       enum pci_doe_state state;
> > +       unsigned long timeout_jiffies;
> > +       unsigned int busy_retries;
> > +       unsigned int abort:1;
> > +       unsigned int dead:1;
> > +};
> > +
> > +static irqreturn_t pci_doe_irq(int irq, void *data)
> > +{
> > +       struct pci_doe *doe = data;
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       u32 val;
> > +
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +       /* Leave the error case to be handled outside IRQ */
> > +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +               mod_delayed_work(system_wq, &doe->statemachine, 0);
> > +               return IRQ_HANDLED;
> > +       }
> > +
> > +       if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > +                                       PCI_DOE_STATUS_INT_STATUS);
> > +               mod_delayed_work(system_wq, &doe->statemachine, 0);
> > +               return IRQ_HANDLED;
> > +       }
> > +
> > +       return IRQ_NONE;
> > +}
> > +
> > +/*
> > + * Only call when safe to directly access the DOE, either because no tasks yet
> > + * queued, or called from doe_statemachine_work() which has exclusive access to
> > + * the DOE config space.  
> 
> It doesn't have exclusive access unless the patch to lock out
> userspace config writes are revived. Instead, I like Bjorn's idea of
> tracking and warning / tainting, but not blocking conflicting
> userspace access to sensitive configuration registers.
> 
> Yes, it was somewhat of a throw-away comment from Bjorn in that
> thread, "(and IMO should taint the kernel)", but DOE can do so much
> subtle damage (compliance test modes, link-encryption / disruption,
> vendor private who-knows-what...) that I think it behooves us as
> kernel developers to know when we are debugging system behavior that
> may be the result of non-kernel mitigated DOE access. The proposal is
> that when kernel lockdown is not enabled, use the approach from the
> exclusive config access patch [2] to trap, warn (once per device?),
> and taint when userspace writes to DOE registers that have been
> claimed by the kernel. This lets strict environments use
> kernel-lockdown to block userspace DOE access altogether, in
> non-strict environment it discourages userspace from clobbering DOE
> driver state, and it allows a warn-free path if userspace takes the
> step of at least unbinding the kernel DOE driver before running
> userspace DOE cycles.
> 
> [1]: https://lore.kernel.org/r/20211203235617.GA3036259@bhelgaas
> [2]: https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com/

Good info. I'd missed some of the subtle parts of that discussion.

> 
> > + */
> > +static void pci_doe_abort_start(struct pci_doe *doe)
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       u32 val;
> > +
> > +       val = PCI_DOE_CTRL_ABORT;
> > +       if (doe->irq)
> > +               val |= PCI_DOE_CTRL_INT_EN;
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +
> > +       doe->timeout_jiffies = jiffies + HZ;
> > +       schedule_delayed_work(&doe->statemachine, HZ);  
> 
> Given the spec timeout is 1 second and the device clock might be
> slightly off from the host clock how about make this a more generous
> 1.5 or 2 seconds?

Makes sense. Though if a clock is bad enough we need 2 seconds that
is pretty awful! :)

> 
> > +}
> > +
> > +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)  
> 
> The relationship between tasks, requests, responses, and exchanges is
> not immediately clear to me. For example, can this helper be renamed
> in terms of its relationship to a task? A theory of operation document
> would help, but it seems there is also room for the implementation to
> be more self documenting.

Not totally sure what such naming would be.

A task is the management wrapper around an exchange which is a request
+ response pair.  In the sense you queue a task which will carry out
and exchange by sending a request and receiving a response.

Could rename this pci_doe_start_exchange() but that then obscures
that we mean send the request to the hardware and removes the resemblance
to what I recall the specification uses.

> 
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       u32 val;
> > +       int i;
> > +
> > +       /*
> > +        * Check the DOE busy bit is not set. If it is set, this could indicate
> > +        * someone other than Linux (e.g. firmware) is using the mailbox. Note
> > +        * it is expected that firmware and OS will negotiate access rights via
> > +        * an, as yet to be defined method.
> > +        */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +       if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> > +               return -EBUSY;
> > +
> > +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +               return -EIO;
> > +
> > +       /* Write DOE Header */
> > +       val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) |
> > +               FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type);
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> > +       /* Length is 2 DW of header + length of payload in DW */
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +                              FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> > +                                         2 + ex->request_pl_sz /
> > +                                               sizeof(u32)));
> > +       for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++)
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +                                      ex->request_pl[i]);
> > +
> > +       val = PCI_DOE_CTRL_GO;
> > +       if (doe->irq)
> > +               val |= PCI_DOE_CTRL_INT_EN;
> > +
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +       /* Request is sent - now wait for poll or IRQ */
> > +       return 0;
> > +}
> > +
> > +static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex)
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       size_t length;
> > +       u32 val;
> > +       int i;
> > +
> > +       /* Read the first dword to get the protocol */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +       if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) ||
> > +           (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) {
> > +               pci_err(pdev,
> > +                       "Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",
> > +                       ex->prot.vid, ex->prot.type,
> > +                       FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> > +                       FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> > +               return -EIO;
> > +       }
> > +
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +       /* Read the second dword to get the length */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +
> > +       length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > +       if (length > SZ_1M || length < 2)
> > +               return -EIO;
> > +
> > +       /* First 2 dwords have already been read */
> > +       length -= 2;
> > +       /* Read the rest of the response payload */
> > +       for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) {
> > +               pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > +                                     &ex->response_pl[i]);
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +       }
> > +
> > +       /* Flush excess length */
> > +       for (; i < length; i++) {
> > +               pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +       }
> > +       /* Final error check to pick up on any since Data Object Ready */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +               return -EIO;
> > +
> > +       return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32);
> > +}
> > +
> > +static void doe_statemachine_work(struct work_struct *work)
> > +{
> > +       struct delayed_work *w = to_delayed_work(work);
> > +       struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       struct pci_doe_task *task;
> > +       bool abort;
> > +       u32 val;
> > +       int rc;
> > +
> > +       mutex_lock(&doe->state_lock);
> > +       task = doe->cur_task;
> > +       abort = doe->abort;
> > +       doe->abort = false;
> > +       mutex_unlock(&doe->state_lock);
> > +
> > +       if (abort) {
> > +               /*
> > +                * Currently only used during init - care needed if
> > +                * pci_doe_abort() is generally exposed as it would impact
> > +                * queries in flight.
> > +                */
> > +               WARN_ON(task);  
> 
> Why is it worth potentially crashing the kernel here? Is this purely a
> situation that will only happen during development  and refactoring of
> the driver? Otherwise I would expect handling the error without WARN.

From what I recall driver bug only.

> 
> 
> > +               doe->state = DOE_WAIT_ABORT;
> > +               pci_doe_abort_start(doe);
> > +               return;
> > +       }
> > +
> > +       switch (doe->state) {
> > +       case DOE_IDLE:
> > +               if (task == NULL)
> > +                       return;
> > +
> > +               /* Nothing currently in flight so queue a task */
> > +               rc = pci_doe_send_req(doe, task->ex);
> > +               /*
> > +                * The specification does not provide any guidance on how long
> > +                * some other entity could keep the DOE busy, so try for 1
> > +                * second then fail. Busy handling is best effort only, because
> > +                * there is no way of avoiding racing against another user of
> > +                * the DOE.
> > +                */
> > +               if (rc == -EBUSY) {
> > +                       doe->busy_retries++;
> > +                       if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> > +                               /* Long enough, fail this request */
> > +                               pci_warn(pdev,
> > +                                       "DOE busy for too long (> 1 sec)\n");
> > +                               doe->busy_retries = 0;
> > +                               goto err_busy;
> > +                       }
> > +                       schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> > +                       return;
> > +               }
> > +               if (rc)
> > +                       goto err_abort;
> > +               doe->busy_retries = 0;
> > +
> > +               doe->state = DOE_WAIT_RESP;
> > +               doe->timeout_jiffies = jiffies + HZ;
> > +               /* Now poll or wait for IRQ with timeout */
> > +               if (doe->irq > 0)
> > +                       schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> > +               else
> > +                       schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +               return;
> > +
> > +       case DOE_WAIT_RESP:
> > +               /* Not possible to get here with NULL task */
> > +               pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +               if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +                       rc = -EIO;
> > +                       goto err_abort;
> > +               }
> > +
> > +               if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> > +                       /* If not yet at timeout reschedule otherwise abort */
> > +                       if (time_after(jiffies, doe->timeout_jiffies)) {
> > +                               rc = -ETIMEDOUT;
> > +                               goto err_abort;
> > +                       }
> > +                       schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +                       return;
> > +               }
> > +
> > +               rc  = pci_doe_recv_resp(doe, task->ex);
> > +               if (rc < 0)
> > +                       goto err_abort;
> > +
> > +               doe->state = DOE_IDLE;
> > +
> > +               mutex_lock(&doe->state_lock);
> > +               doe->cur_task = NULL;
> > +               mutex_unlock(&doe->state_lock);
> > +               wake_up_interruptible(&doe->wq);
> > +
> > +               /* Set the return value to the length of received payload */
> > +               task->rv = rc;
> > +               task->cb(task->private);
> > +
> > +               return;
> > +
> > +       case DOE_WAIT_ABORT:
> > +       case DOE_WAIT_ABORT_ON_ERR:
> > +               pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +               if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > +                   !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > +                       /* Back to normal state - carry on */
> > +                       mutex_lock(&doe->state_lock);
> > +                       doe->cur_task = NULL;
> > +                       mutex_unlock(&doe->state_lock);
> > +                       wake_up_interruptible(&doe->wq);
> > +
> > +                       /*
> > +                        * For deliberately triggered abort, someone is
> > +                        * waiting.
> > +                        */
> > +                       if (doe->state == DOE_WAIT_ABORT)
> > +                               complete(&doe->abort_c);  
> 
> Why is a completion and waitqueue needed? I.e. a waiter could simply
> look for an abort completion flag to be set instead.

You mean use the main completion (the one for the non abort case)
and a flag? 

Or a wait_event() with appropriate check?

Could do that but I'm not sure I understand why we care either way?

> 
> 
> > +
> > +                       doe->state = DOE_IDLE;
> > +                       return;
> > +               }
> > +               if (time_after(jiffies, doe->timeout_jiffies)) {
> > +                       /* Task has timed out and is dead - abort */
> > +                       pci_err(pdev, "DOE ABORT timed out\n");
> > +                       mutex_lock(&doe->state_lock);
> > +                       doe->dead = true;
> > +                       doe->cur_task = NULL;
> > +                       mutex_unlock(&doe->state_lock);
> > +                       wake_up_interruptible(&doe->wq);
> > +
> > +                       if (doe->state == DOE_WAIT_ABORT)
> > +                               complete(&doe->abort_c);
> > +               }
> > +               return;
> > +       }
> > +
> > +err_abort:
> > +       doe->state = DOE_WAIT_ABORT_ON_ERR;
> > +       pci_doe_abort_start(doe);
> > +err_busy:
> > +       task->rv = rc;
> > +       task->cb(task->private);
> > +       /* If here via err_busy, signal the task done. */
> > +       if (doe->state == DOE_IDLE) {
> > +               mutex_lock(&doe->state_lock);
> > +               doe->cur_task = NULL;
> > +               mutex_unlock(&doe->state_lock);
> > +               wake_up_interruptible(&doe->wq);
> > +       }
> > +}
> > +
> > +static void pci_doe_task_complete(void *private)
> > +{
> > +       complete(private);
> > +}
> > +
> > +/**
> > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a
> > + *                          response
> > + * @doe_dev: DOE mailbox state structure
> > + * @ex: Description of the buffers and Vendor ID + type used in this
> > + *      request/response pair
> > + *
> > + * Excess data will be discarded.
> > + *
> > + * RETURNS: payload in bytes on success, < 0 on error
> > + */
> > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
> > +                         struct pci_doe_exchange *ex)
> > +{
> > +       struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > +       struct pci_doe_task task;
> > +       DECLARE_COMPLETION_ONSTACK(c);
> > +
> > +       if (!doe)
> > +               return -EAGAIN;
> > +
> > +       /* DOE requests must be a whole number of DW */
> > +       if (ex->request_pl_sz % sizeof(u32))
> > +               return -EINVAL;
> > +
> > +       task.ex = ex;
> > +       task.cb = pci_doe_task_complete;
> > +       task.private = &c;
> > +
> > +again:
> > +       mutex_lock(&doe->state_lock);
> > +       if (doe->cur_task) {
> > +               mutex_unlock(&doe->state_lock);
> > +               wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> > +               goto again;
> > +       }
> > +
> > +       if (doe->dead) {
> > +               mutex_unlock(&doe->state_lock);
> > +               return -EIO;
> > +       }
> > +       doe->cur_task = &task;
> > +       schedule_delayed_work(&doe->statemachine, 0);
> > +       mutex_unlock(&doe->state_lock);
> > +
> > +       wait_for_completion(&c);  
> 
> I would expect that the caller of this routine would want to specify
> the task and end_task() callback and use that as the completion
> signal. It may also want "no wait" behavior where it is prepared for
> the DOE result to come back sometime later. With that change the
> exchange fields can move into the task directly.

This is the simple synchronous wrapper around an async core.
If we want an async path at somepoint in the future where we have
someone using it then sure, we can have an async version that
takes the callback.

> 
> > +
> > +       return task.rv;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> > +
> > +/**
> > + * pci_doe_supports_prot() - Return if the DOE instance supports the given
> > + *                          protocol
> > + * @pdev: Device on which to find the DOE instance
> > + * @vid: Protocol Vendor ID
> > + * @type: protocol type
> > + *
> > + * This device can then be passed to pci_doe_exchange_sync() to execute a
> > + * mailbox exchange through that DOE mailbox.
> > + *
> > + * RETURNS: True if the DOE device supports the protocol specified
> > + */
> > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> > +{
> > +       struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > +       int i;
> > +
> > +       if (!doe)
> > +               return false;
> > +
> > +       for (i = 0; i < doe->num_prots; i++)
> > +               if ((doe->prots[i].vid == vid) &&
> > +                   (doe->prots[i].type == type))
> > +                       return true;
> > +
> > +       return false;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> > +
> > +static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid,
> > +                            u8 *protocol)
> > +{
> > +       u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
> > +                                   *index);
> > +       u32 response_pl;
> > +       struct pci_doe_exchange ex = {
> > +               .prot.vid = PCI_VENDOR_ID_PCI_SIG,
> > +               .prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> > +               .request_pl = &request_pl,
> > +               .request_pl_sz = sizeof(request_pl),
> > +               .response_pl = &response_pl,
> > +               .response_pl_sz = sizeof(response_pl),
> > +       };
> > +       int ret;
> > +
> > +       ret = pci_doe_exchange_sync(doe->doe_dev, &ex);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       if (ret != sizeof(response_pl))
> > +               return -EIO;
> > +
> > +       *vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > +       *protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > +                             response_pl);
> > +       *index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > +                          response_pl);
> > +
> > +       return 0;
> > +}
> > +
> > +static int pci_doe_cache_protocols(struct pci_doe *doe)
> > +{
> > +       u8 index = 0;
> > +       int num_prots;
> > +       int rc;
> > +
> > +       /* Discovery protocol must always be supported and must report itself */
> > +       num_prots = 1;
> > +       doe->prots = devm_kcalloc(&doe->doe_dev->adev.dev, num_prots,
> > +                                 sizeof(*doe->prots), GFP_KERNEL);
> > +       if (doe->prots == NULL)
> > +               return -ENOMEM;
> > +
> > +       do {
> > +               struct pci_doe_protocol *prot;
> > +
> > +               prot = &doe->prots[num_prots - 1];
> > +               rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
> > +               if (rc)
> > +                       return rc;
> > +
> > +               if (index) {
> > +                       struct pci_doe_protocol *prot_new;
> > +
> > +                       num_prots++;
> > +                       prot_new = devm_krealloc(&doe->doe_dev->adev.dev,
> > +                                                doe->prots,
> > +                                                sizeof(*doe->prots) *
> > +                                                       num_prots,
> > +                                                GFP_KERNEL);
> > +                       if (prot_new == NULL)
> > +                               return -ENOMEM;
> > +                       doe->prots = prot_new;
> > +               }
> > +       } while (index);
> > +
> > +       doe->num_prots = num_prots;
> > +       return 0;
> > +}
> > +
> > +static int pci_doe_abort(struct pci_doe *doe)
> > +{
> > +       reinit_completion(&doe->abort_c);
> > +       mutex_lock(&doe->state_lock);
> > +       doe->abort = true;  
> 
> Why not a flags field where atomic bitops can be used without need for a mutex.

I'll go the other way, why bother with atomics when this isn't a high performance
path or something expected to happen often?

> 
> > +       mutex_unlock(&doe->state_lock);
> > +       schedule_delayed_work(&doe->statemachine, 0);
> > +       wait_for_completion(&doe->abort_c);
> > +
> > +       if (doe->dead)  
> 
> dead could also be another atomic flag.

Probably true, but I'm really not getting why that would be a beneficial
thing to do.

> 
> > +               return -EIO;
> > +
> > +       return 0;
> > +}
> > +
> > +static int pci_doe_reg_irq(struct pci_doe *doe)
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       bool poll = !pci_dev_msi_enabled(pdev);
> > +       int offset = doe->doe_dev->cap_offset;
> > +       int rc, irq;
> > +       u32 val;
> > +
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> > +
> > +       if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
> > +               irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
> > +               if (irq < 0)
> > +                       return irq;
> > +
> > +               doe->irq_name = devm_kasprintf(&doe->doe_dev->adev.dev,
> > +                                               GFP_KERNEL,
> > +                                               "DOE[%s]",
> > +                                               doe->doe_dev->adev.name);
> > +               if (!doe->irq_name)
> > +                       return -ENOMEM;
> > +
> > +               rc = devm_request_irq(&pdev->dev, irq, pci_doe_irq, 0,
> > +                                     doe->irq_name, doe);
> > +               if (rc)
> > +                       return rc;
> > +
> > +               doe->irq = irq;
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> > +                                      PCI_DOE_CTRL_INT_EN);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +/*
> > + * pci_doe_probe() - Set up the Mailbox
> > + * @aux_dev: Auxiliary Device
> > + * @id: Auxiliary device ID
> > + *
> > + * Probe the mailbox found for all protocols and set up the Mailbox
> > + *
> > + * RETURNS: 0 on success, < 0 on error
> > + */
> > +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> > +                        const struct auxiliary_device_id *id)
> > +{
> > +       struct pci_doe_dev *doe_dev = container_of(aux_dev,
> > +                                       struct pci_doe_dev,
> > +                                       adev);
> > +       struct pci_doe *doe;
> > +       int rc;
> > +
> > +       doe = devm_kzalloc(&aux_dev->dev, sizeof(*doe), GFP_KERNEL);
> > +       if (!doe)
> > +               return -ENOMEM;
> > +
> > +       mutex_init(&doe->state_lock);
> > +       init_completion(&doe->abort_c);
> > +       doe->doe_dev = doe_dev;
> > +       init_waitqueue_head(&doe->wq);
> > +       INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> > +       dev_set_drvdata(&aux_dev->dev, doe);
> > +
> > +       rc = pci_doe_reg_irq(doe);
> > +       if (rc)
> > +               return rc;
> > +
> > +       /* Reset the mailbox by issuing an abort */
> > +       rc = pci_doe_abort(doe);
> > +       if (rc)
> > +               return rc;
> > +
> > +       rc = pci_doe_cache_protocols(doe);
> > +       if (rc)
> > +               return rc;  
> 
> This can just be:
> 
>  return pci_doe_cache_protocols(doe);
> 
> > +
> > +       return 0;
> > +}
> > +
> > +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> > +{
> > +       struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> > +
> > +       /* First halt the state machine */
> > +       cancel_delayed_work_sync(&doe->statemachine);
> > +}
> > +
> > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> > +       {},
> > +};
> > +
> > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);  
> 
> Why is this empty table here?
> 
> > +
> > +struct auxiliary_driver pci_doe_auxiliary_drv = {
> > +       .name = "pci_doe",
> > +       .id_table = pci_doe_auxiliary_id_table,
> > +       .probe = pci_doe_probe,
> > +       .remove = pci_doe_remove
> > +};  
> 
> I expect that these helpers would be provided by the PCI core, but
> then a subsystem like CXL would have code to register their auxiliary
> devices and drivers that mostly just wrap the PCI core DOE
> implementation.
> 
> > +
> > +static int __init pci_doe_init_module(void)
> > +{
> > +       int ret;
> > +
> > +       ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> > +       if (ret) {
> > +               pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> > +                      ret);
> > +               return ret;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static void __exit pci_doe_exit_module(void)
> > +{
> > +       auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> > +}
> > +
> > +module_init(pci_doe_init_module);
> > +module_exit(pci_doe_exit_module);
> > +MODULE_LICENSE("GPL v2");
> > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> > new file mode 100644
> > index 000000000000..2f52b31c6f32
> > --- /dev/null
> > +++ b/include/linux/pci-doe.h
> > @@ -0,0 +1,60 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> > + *
> > + * Copyright (C) 2021 Huawei
> > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + */
> > +
> > +#include <linux/completion.h>
> > +#include <linux/list.h>
> > +#include <linux/auxiliary_bus.h>
> > +
> > +#ifndef LINUX_PCI_DOE_H
> > +#define LINUX_PCI_DOE_H
> > +
> > +struct pci_doe_protocol {
> > +       u16 vid;
> > +       u8 type;
> > +};
> > +
> > +/**
> > + * struct pci_doe_exchange - represents a single query/response
> > + *
> > + * @prot: DOE Protocol
> > + * @request_pl: The request payload
> > + * @request_pl_sz: Size of the request payload
> > + * @response_pl: The response payload
> > + * @response_pl_sz: Size of the response payload
> > + */
> > +struct pci_doe_exchange {
> > +       struct pci_doe_protocol prot;
> > +       u32 *request_pl;
> > +       size_t request_pl_sz;
> > +       u32 *response_pl;
> > +       size_t response_pl_sz;
> > +};
> > +
> > +/**
> > + * struct pci_doe_dev - DOE mailbox device
> > + *
> > + * @adrv: Auxiliary Driver data
> > + * @pdev: PCI device this belongs to
> > + * @offset: Capability offset
> > + *
> > + * This represents a single DOE mailbox device.  Devices should create this
> > + * device and register it on the Auxiliary bus for the DOE driver to maintain.
> > + *
> > + */
> > +struct pci_doe_dev {
> > +       struct auxiliary_device adev;
> > +       struct pci_dev *pdev;
> > +       int cap_offset;
> > +};
> > +
> > +/* Library operations */
> > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
> > +                                struct pci_doe_exchange *ex);
> > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type);
> > +
> > +#endif
> > diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> > index ff6ccbc6efe9..c04aad391669 100644
> > --- a/include/uapi/linux/pci_regs.h
> > +++ b/include/uapi/linux/pci_regs.h
> > @@ -736,7 +736,8 @@
> >  #define PCI_EXT_CAP_ID_DVSEC   0x23    /* Designated Vendor-Specific */
> >  #define PCI_EXT_CAP_ID_DLF     0x25    /* Data Link Feature */
> >  #define PCI_EXT_CAP_ID_PL_16GT 0x26    /* Physical Layer 16.0 GT/s */
> > -#define PCI_EXT_CAP_ID_MAX     PCI_EXT_CAP_ID_PL_16GT
> > +#define PCI_EXT_CAP_ID_DOE     0x2E    /* Data Object Exchange */
> > +#define PCI_EXT_CAP_ID_MAX     PCI_EXT_CAP_ID_DOE
> >
> >  #define PCI_EXT_CAP_DSN_SIZEOF 12
> >  #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
> > @@ -1098,4 +1099,30 @@
> >  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK                0x000000F0
> >  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT       4
> >
> > +/* Data Object Exchange */
> > +#define PCI_DOE_CAP            0x04    /* DOE Capabilities Register */
> > +#define  PCI_DOE_CAP_INT                       0x00000001  /* Interrupt Support */
> > +#define  PCI_DOE_CAP_IRQ                       0x00000ffe  /* Interrupt Message Number */
> > +#define PCI_DOE_CTRL           0x08    /* DOE Control Register */
> > +#define  PCI_DOE_CTRL_ABORT                    0x00000001  /* DOE Abort */
> > +#define  PCI_DOE_CTRL_INT_EN                   0x00000002  /* DOE Interrupt Enable */
> > +#define  PCI_DOE_CTRL_GO                       0x80000000  /* DOE Go */
> > +#define PCI_DOE_STATUS         0x0c    /* DOE Status Register */
> > +#define  PCI_DOE_STATUS_BUSY                   0x00000001  /* DOE Busy */
> > +#define  PCI_DOE_STATUS_INT_STATUS             0x00000002  /* DOE Interrupt Status */
> > +#define  PCI_DOE_STATUS_ERROR                  0x00000004  /* DOE Error */
> > +#define  PCI_DOE_STATUS_DATA_OBJECT_READY      0x80000000  /* Data Object Ready */
> > +#define PCI_DOE_WRITE          0x10    /* DOE Write Data Mailbox Register */
> > +#define PCI_DOE_READ           0x14    /* DOE Read Data Mailbox Register */
> > +
> > +/* DOE Data Object - note not actually registers */
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID               0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE              0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH            0x0003ffff
> > +
> > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX           0x000000ff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID             0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL                0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX      0xff000000
> > +
> >  #endif /* LINUX_PCI_REGS_H */
> > --
> > 2.31.1
> >  


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-09 10:13     ` Jonathan Cameron
@ 2022-02-09 16:26       ` Dan Williams
  2022-02-09 16:57         ` Jonathan Cameron
  0 siblings, 1 reply; 49+ messages in thread
From: Dan Williams @ 2022-02-09 16:26 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Weiny, Ira, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Wed, Feb 9, 2022 at 2:13 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
[..]
> > > + *
> > > + * @doe_dev: The DOE Auxiliary device being driven
> > > + * @abort_c: Completion used for initial abort handling
> > > + * @irq: Interrupt used for signaling DOE ready or abort
> > > + * @irq_name: Name used to identify the irq for a particular DOE
> > > + * @prots: Array of identifiers for protocols supported
> >
> > "prot" already has a meaning in the kernel, just spell out
> > "protocols". This also looks like something that can be allocated
> > inline rather than out of line i.e.:
> >
> > struct pci_doe {
> > ...
> >     int nr_protocols
> >     struct pci_doe_protocol protocols[];
> > }
> >
> > ...and then use struct_size() to allocate it.
>
> Can't do that. The size isn't known when we first start using
> this structure - We need to use it to query what protocols are
> supported.  It's initially set to 1 to cover the discovery
> protocol and then we realloc to expand it as we discover more
> protocols.

Ok.

>
> >
> > > + * @num_prots: Size of prots array
> > > + * @cur_task: Current task the state machine is working on
> > > + * @wq: Wait queue to wait on if a query is in progress
> > > + * @state_lock: Protect the state of cur_task, abort, and dead
> > > + * @statemachine: Work item for the DOE state machine
> > > + * @state: Current state of this DOE
> > > + * @timeout_jiffies: 1 second after GO set
> > > + * @busy_retries: Count of retry attempts
> > > + * @abort: Request a manual abort (e.g. on init)
> > > + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> > > + *        will immediately be aborted with error
> > > + */
> > > +struct pci_doe {
> > > +       struct pci_doe_dev *doe_dev;
> > > +       struct completion abort_c;
> > > +       int irq;
> > > +       char *irq_name;
> > > +       struct pci_doe_protocol *prots;
> > > +       int num_prots;
> > > +
> > > +       struct pci_doe_task *cur_task;
> > > +       wait_queue_head_t wq;
> > > +       struct mutex state_lock;
> > > +       struct delayed_work statemachine;
> > > +       enum pci_doe_state state;
> > > +       unsigned long timeout_jiffies;
> > > +       unsigned int busy_retries;
> > > +       unsigned int abort:1;
> > > +       unsigned int dead:1;
> > > +};
> > > +
> > > +static irqreturn_t pci_doe_irq(int irq, void *data)
> > > +{
> > > +       struct pci_doe *doe = data;
> > > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > > +       int offset = doe->doe_dev->cap_offset;
> > > +       u32 val;
> > > +
> > > +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > +
> > > +       /* Leave the error case to be handled outside IRQ */
> > > +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > +               mod_delayed_work(system_wq, &doe->statemachine, 0);
> > > +               return IRQ_HANDLED;
> > > +       }
> > > +
> > > +       if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > +               pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > +                                       PCI_DOE_STATUS_INT_STATUS);
> > > +               mod_delayed_work(system_wq, &doe->statemachine, 0);
> > > +               return IRQ_HANDLED;
> > > +       }
> > > +
> > > +       return IRQ_NONE;
> > > +}
> > > +
> > > +/*
> > > + * Only call when safe to directly access the DOE, either because no tasks yet
> > > + * queued, or called from doe_statemachine_work() which has exclusive access to
> > > + * the DOE config space.
> >
> > It doesn't have exclusive access unless the patch to lock out
> > userspace config writes are revived. Instead, I like Bjorn's idea of
> > tracking and warning / tainting, but not blocking conflicting
> > userspace access to sensitive configuration registers.
> >
> > Yes, it was somewhat of a throw-away comment from Bjorn in that
> > thread, "(and IMO should taint the kernel)", but DOE can do so much
> > subtle damage (compliance test modes, link-encryption / disruption,
> > vendor private who-knows-what...) that I think it behooves us as
> > kernel developers to know when we are debugging system behavior that
> > may be the result of non-kernel mitigated DOE access. The proposal is
> > that when kernel lockdown is not enabled, use the approach from the
> > exclusive config access patch [2] to trap, warn (once per device?),
> > and taint when userspace writes to DOE registers that have been
> > claimed by the kernel. This lets strict environments use
> > kernel-lockdown to block userspace DOE access altogether, in
> > non-strict environment it discourages userspace from clobbering DOE
> > driver state, and it allows a warn-free path if userspace takes the
> > step of at least unbinding the kernel DOE driver before running
> > userspace DOE cycles.
> >
> > [1]: https://lore.kernel.org/r/20211203235617.GA3036259@bhelgaas
> > [2]: https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> Good info. I'd missed some of the subtle parts of that discussion.
>
> >
> > > + */
> > > +static void pci_doe_abort_start(struct pci_doe *doe)
> > > +{
> > > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > > +       int offset = doe->doe_dev->cap_offset;
> > > +       u32 val;
> > > +
> > > +       val = PCI_DOE_CTRL_ABORT;
> > > +       if (doe->irq)
> > > +               val |= PCI_DOE_CTRL_INT_EN;
> > > +       pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > > +
> > > +       doe->timeout_jiffies = jiffies + HZ;
> > > +       schedule_delayed_work(&doe->statemachine, HZ);
> >
> > Given the spec timeout is 1 second and the device clock might be
> > slightly off from the host clock how about make this a more generous
> > 1.5 or 2 seconds?
>
> Makes sense. Though if a clock is bad enough we need 2 seconds that
> is pretty awful! :)

Perhaps, though DOE is not a fast path, and block I/O defaults to a 30
second timeout, so I don't think anyone would blink at a 2 seconds for
DOE.

>
> >
> > > +}
> > > +
> > > +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)
> >
> > The relationship between tasks, requests, responses, and exchanges is
> > not immediately clear to me. For example, can this helper be renamed
> > in terms of its relationship to a task? A theory of operation document
> > would help, but it seems there is also room for the implementation to
> > be more self documenting.
>
> Not totally sure what such naming would be.
>
> A task is the management wrapper around an exchange which is a request
> + response pair.  In the sense you queue a task which will carry out
> and exchange by sending a request and receiving a response.
>
> Could rename this pci_doe_start_exchange() but that then obscures
> that we mean send the request to the hardware and removes the resemblance
> to what I recall the specification uses.

I'm not a big fan of copying spec names *if* Linux has a more
idiomatic name for the concept. I am mainly reviewing this from the
perspective that 'struct bio' and 'struct request' naming /
organization is idiomatic for Linux driver transaction flows. Up to
this point in the review I was mapping tasks to bios and exchanges to
requests but then the usage of "req" in this function name threw off
my ontology. At a minimum a decoder ring style comment, like your
reply, about the relationship between these terms would help avoid
this exercise again.

> > > +       case DOE_WAIT_ABORT:
> > > +       case DOE_WAIT_ABORT_ON_ERR:
> > > +               pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > +
> > > +               if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > > +                   !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > > +                       /* Back to normal state - carry on */
> > > +                       mutex_lock(&doe->state_lock);
> > > +                       doe->cur_task = NULL;
> > > +                       mutex_unlock(&doe->state_lock);
> > > +                       wake_up_interruptible(&doe->wq);
> > > +
> > > +                       /*
> > > +                        * For deliberately triggered abort, someone is
> > > +                        * waiting.
> > > +                        */
> > > +                       if (doe->state == DOE_WAIT_ABORT)
> > > +                               complete(&doe->abort_c);
> >
> > Why is a completion and waitqueue needed? I.e. a waiter could simply
> > look for an abort completion flag to be set instead.
>
> You mean use the main completion (the one for the non abort case)
> and a flag?
>
> Or a wait_event() with appropriate check?
>
> Could do that but I'm not sure I understand why we care either way?

Just reduction in machinery that needs to be maintained /
comprehended. 2 wait primitives when one will do will always be a
tempting cleanup target.

[..]
> > > +/**
> > > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a
> > > + *                          response
> > > + * @doe_dev: DOE mailbox state structure
> > > + * @ex: Description of the buffers and Vendor ID + type used in this
> > > + *      request/response pair
> > > + *
> > > + * Excess data will be discarded.
> > > + *
> > > + * RETURNS: payload in bytes on success, < 0 on error
> > > + */
> > > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
> > > +                         struct pci_doe_exchange *ex)
> > > +{
> > > +       struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > > +       struct pci_doe_task task;
> > > +       DECLARE_COMPLETION_ONSTACK(c);
> > > +
> > > +       if (!doe)
> > > +               return -EAGAIN;
> > > +
> > > +       /* DOE requests must be a whole number of DW */
> > > +       if (ex->request_pl_sz % sizeof(u32))
> > > +               return -EINVAL;
> > > +
> > > +       task.ex = ex;
> > > +       task.cb = pci_doe_task_complete;
> > > +       task.private = &c;
> > > +
> > > +again:
> > > +       mutex_lock(&doe->state_lock);
> > > +       if (doe->cur_task) {
> > > +               mutex_unlock(&doe->state_lock);
> > > +               wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> > > +               goto again;
> > > +       }
> > > +
> > > +       if (doe->dead) {
> > > +               mutex_unlock(&doe->state_lock);
> > > +               return -EIO;
> > > +       }
> > > +       doe->cur_task = &task;
> > > +       schedule_delayed_work(&doe->statemachine, 0);
> > > +       mutex_unlock(&doe->state_lock);
> > > +
> > > +       wait_for_completion(&c);
> >
> > I would expect that the caller of this routine would want to specify
> > the task and end_task() callback and use that as the completion
> > signal. It may also want "no wait" behavior where it is prepared for
> > the DOE result to come back sometime later. With that change the
> > exchange fields can move into the task directly.
>
> This is the simple synchronous wrapper around an async core.
> If we want an async path at somepoint in the future where we have
> someone using it then sure, we can have an async version that
> takes the callback.

It just seems an unnecessary hunk of code for the core to carry when
it's trivial for a client of the core to do:

task->private = &completion;
task->end_task = complete_completion;
submit_task()
wait_for_completion(&completion);

> > > +       return task.rv;
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> > > +
> > > +/**
> > > + * pci_doe_supports_prot() - Return if the DOE instance supports the given
> > > + *                          protocol
> > > + * @pdev: Device on which to find the DOE instance
> > > + * @vid: Protocol Vendor ID
> > > + * @type: protocol type
> > > + *
> > > + * This device can then be passed to pci_doe_exchange_sync() to execute a
> > > + * mailbox exchange through that DOE mailbox.
> > > + *
> > > + * RETURNS: True if the DOE device supports the protocol specified
> > > + */
> > > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> > > +{
> > > +       struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > > +       int i;
> > > +
> > > +       if (!doe)
> > > +               return false;
> > > +
> > > +       for (i = 0; i < doe->num_prots; i++)
> > > +               if ((doe->prots[i].vid == vid) &&
> > > +                   (doe->prots[i].type == type))
> > > +                       return true;
> > > +
> > > +       return false;
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> > > +
> > > +static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid,
> > > +                            u8 *protocol)
> > > +{
> > > +       u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
> > > +                                   *index);
> > > +       u32 response_pl;
> > > +       struct pci_doe_exchange ex = {
> > > +               .prot.vid = PCI_VENDOR_ID_PCI_SIG,
> > > +               .prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> > > +               .request_pl = &request_pl,
> > > +               .request_pl_sz = sizeof(request_pl),
> > > +               .response_pl = &response_pl,
> > > +               .response_pl_sz = sizeof(response_pl),
> > > +       };
> > > +       int ret;
> > > +
> > > +       ret = pci_doe_exchange_sync(doe->doe_dev, &ex);
> > > +       if (ret < 0)
> > > +               return ret;
> > > +
> > > +       if (ret != sizeof(response_pl))
> > > +               return -EIO;
> > > +
> > > +       *vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > > +       *protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > > +                             response_pl);
> > > +       *index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > > +                          response_pl);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int pci_doe_cache_protocols(struct pci_doe *doe)
> > > +{
> > > +       u8 index = 0;
> > > +       int num_prots;
> > > +       int rc;
> > > +
> > > +       /* Discovery protocol must always be supported and must report itself */
> > > +       num_prots = 1;
> > > +       doe->prots = devm_kcalloc(&doe->doe_dev->adev.dev, num_prots,
> > > +                                 sizeof(*doe->prots), GFP_KERNEL);
> > > +       if (doe->prots == NULL)
> > > +               return -ENOMEM;
> > > +
> > > +       do {
> > > +               struct pci_doe_protocol *prot;
> > > +
> > > +               prot = &doe->prots[num_prots - 1];
> > > +               rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
> > > +               if (rc)
> > > +                       return rc;
> > > +
> > > +               if (index) {
> > > +                       struct pci_doe_protocol *prot_new;
> > > +
> > > +                       num_prots++;
> > > +                       prot_new = devm_krealloc(&doe->doe_dev->adev.dev,
> > > +                                                doe->prots,
> > > +                                                sizeof(*doe->prots) *
> > > +                                                       num_prots,
> > > +                                                GFP_KERNEL);
> > > +                       if (prot_new == NULL)
> > > +                               return -ENOMEM;
> > > +                       doe->prots = prot_new;
> > > +               }
> > > +       } while (index);
> > > +
> > > +       doe->num_prots = num_prots;
> > > +       return 0;
> > > +}
> > > +
> > > +static int pci_doe_abort(struct pci_doe *doe)
> > > +{
> > > +       reinit_completion(&doe->abort_c);
> > > +       mutex_lock(&doe->state_lock);
> > > +       doe->abort = true;
> >
> > Why not a flags field where atomic bitops can be used without need for a mutex.
>
> I'll go the other way, why bother with atomics when this isn't a high performance
> path or something expected to happen often?

It obfuscates what the lock is protecting if it's used for state
management and atomic flag management, but I am not holding the pen
here, so I can let this arbitrary trade-off go.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-09 16:26       ` Dan Williams
@ 2022-02-09 16:57         ` Jonathan Cameron
  2022-02-09 19:57           ` Dan Williams
  0 siblings, 1 reply; 49+ messages in thread
From: Jonathan Cameron @ 2022-02-09 16:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Weiny, Ira, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Wed, 9 Feb 2022 08:26:43 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Wed, Feb 9, 2022 at 2:13 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> [..]

...

> 
> >  
> > >  
> > > > +}
> > > > +
> > > > +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)  
> > >
> > > The relationship between tasks, requests, responses, and exchanges is
> > > not immediately clear to me. For example, can this helper be renamed
> > > in terms of its relationship to a task? A theory of operation document
> > > would help, but it seems there is also room for the implementation to
> > > be more self documenting.  
> >
> > Not totally sure what such naming would be.
> >
> > A task is the management wrapper around an exchange which is a request
> > + response pair.  In the sense you queue a task which will carry out
> > and exchange by sending a request and receiving a response.
> >
> > Could rename this pci_doe_start_exchange() but that then obscures
> > that we mean send the request to the hardware and removes the resemblance
> > to what I recall the specification uses.  
> 
> I'm not a big fan of copying spec names *if* Linux has a more
> idiomatic name for the concept. I am mainly reviewing this from the
> perspective that 'struct bio' and 'struct request' naming /
> organization is idiomatic for Linux driver transaction flows. Up to
> this point in the review I was mapping tasks to bios and exchanges to
> requests but then the usage of "req" in this function name threw off
> my ontology. At a minimum a decoder ring style comment, like your
> reply, about the relationship between these terms would help avoid
> this exercise again.

OK. So up to Ira, but my suggestion is go with a comment unless
someone comes up with clearer naming.

Mind you, if we are now exposing the doe_exchange to callers anyway,
we could just squash the structure into the doe_task one and drop
the separation.

Intent before was doe_exchange was all the stuff related to the protocol
(so buffers etc0 whereas task was about the implementation but
if we expose struct doe_task anyway that separation becomes a bit pointless.

> 
> > > > +       case DOE_WAIT_ABORT:
> > > > +       case DOE_WAIT_ABORT_ON_ERR:
> > > > +               pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > +
> > > > +               if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > > > +                   !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > > > +                       /* Back to normal state - carry on */
> > > > +                       mutex_lock(&doe->state_lock);
> > > > +                       doe->cur_task = NULL;
> > > > +                       mutex_unlock(&doe->state_lock);
> > > > +                       wake_up_interruptible(&doe->wq);
> > > > +
> > > > +                       /*
> > > > +                        * For deliberately triggered abort, someone is
> > > > +                        * waiting.
> > > > +                        */
> > > > +                       if (doe->state == DOE_WAIT_ABORT)
> > > > +                               complete(&doe->abort_c);  
> > >
> > > Why is a completion and waitqueue needed? I.e. a waiter could simply
> > > look for an abort completion flag to be set instead.  
> >
> > You mean use the main completion (the one for the non abort case)
> > and a flag?
> >
> > Or a wait_event() with appropriate check?
> >
> > Could do that but I'm not sure I understand why we care either way?  
> 
> Just reduction in machinery that needs to be maintained /
> comprehended. 2 wait primitives when one will do will always be a
> tempting cleanup target.

Ah. Fair enough - it looks like using the same completion won't
be a huge addition in complexity.

> 
> [..]
> > > > +/**
> > > > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a
> > > > + *                          response
> > > > + * @doe_dev: DOE mailbox state structure
> > > > + * @ex: Description of the buffers and Vendor ID + type used in this
> > > > + *      request/response pair
> > > > + *
> > > > + * Excess data will be discarded.
> > > > + *
> > > > + * RETURNS: payload in bytes on success, < 0 on error
> > > > + */
> > > > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
> > > > +                         struct pci_doe_exchange *ex)
> > > > +{
> > > > +       struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > > > +       struct pci_doe_task task;
> > > > +       DECLARE_COMPLETION_ONSTACK(c);
> > > > +
> > > > +       if (!doe)
> > > > +               return -EAGAIN;
> > > > +
> > > > +       /* DOE requests must be a whole number of DW */
> > > > +       if (ex->request_pl_sz % sizeof(u32))
> > > > +               return -EINVAL;
> > > > +
> > > > +       task.ex = ex;
> > > > +       task.cb = pci_doe_task_complete;
> > > > +       task.private = &c;
> > > > +
> > > > +again:
> > > > +       mutex_lock(&doe->state_lock);
> > > > +       if (doe->cur_task) {
> > > > +               mutex_unlock(&doe->state_lock);
> > > > +               wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> > > > +               goto again;
> > > > +       }
> > > > +
> > > > +       if (doe->dead) {
> > > > +               mutex_unlock(&doe->state_lock);
> > > > +               return -EIO;
> > > > +       }
> > > > +       doe->cur_task = &task;
> > > > +       schedule_delayed_work(&doe->statemachine, 0);
> > > > +       mutex_unlock(&doe->state_lock);
> > > > +
> > > > +       wait_for_completion(&c);  
> > >
> > > I would expect that the caller of this routine would want to specify
> > > the task and end_task() callback and use that as the completion
> > > signal. It may also want "no wait" behavior where it is prepared for
> > > the DOE result to come back sometime later. With that change the
> > > exchange fields can move into the task directly.  
> >
> > This is the simple synchronous wrapper around an async core.
> > If we want an async path at somepoint in the future where we have
> > someone using it then sure, we can have an async version that
> > takes the callback.  
> 
> It just seems an unnecessary hunk of code for the core to carry when
> it's trivial for a client of the core to do:
> 
> task->private = &completion;
> task->end_task = complete_completion;
> submit_task()
> wait_for_completion(&completion);

OK, we can move this to the callers though function obviously will
also need renaming - I guess to pci_doe_exchange() and now need to take a
task rather than the exchange.

I personally slightly prefer the layered approach, but don't care that
strongly.

> 
> > > > +       return task.rv;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> > > > +
> > > > +/**
> > > > + * pci_doe_supports_prot() - Return if the DOE instance supports the given
> > > > + *                          protocol
> > > > + * @pdev: Device on which to find the DOE instance
> > > > + * @vid: Protocol Vendor ID
> > > > + * @type: protocol type
> > > > + *
> > > > + * This device can then be passed to pci_doe_exchange_sync() to execute a
> > > > + * mailbox exchange through that DOE mailbox.
> > > > + *
> > > > + * RETURNS: True if the DOE device supports the protocol specified
> > > > + */
> > > > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> > > > +{
> > > > +       struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > > > +       int i;
> > > > +
> > > > +       if (!doe)
> > > > +               return false;
> > > > +
> > > > +       for (i = 0; i < doe->num_prots; i++)
> > > > +               if ((doe->prots[i].vid == vid) &&
> > > > +                   (doe->prots[i].type == type))
> > > > +                       return true;
> > > > +
> > > > +       return false;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> > > > +
> > > > +static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid,
> > > > +                            u8 *protocol)
> > > > +{
> > > > +       u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
> > > > +                                   *index);
> > > > +       u32 response_pl;
> > > > +       struct pci_doe_exchange ex = {
> > > > +               .prot.vid = PCI_VENDOR_ID_PCI_SIG,
> > > > +               .prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> > > > +               .request_pl = &request_pl,
> > > > +               .request_pl_sz = sizeof(request_pl),
> > > > +               .response_pl = &response_pl,
> > > > +               .response_pl_sz = sizeof(response_pl),
> > > > +       };
> > > > +       int ret;
> > > > +
> > > > +       ret = pci_doe_exchange_sync(doe->doe_dev, &ex);
> > > > +       if (ret < 0)
> > > > +               return ret;
> > > > +
> > > > +       if (ret != sizeof(response_pl))
> > > > +               return -EIO;
> > > > +
> > > > +       *vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > > > +       *protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > > > +                             response_pl);
> > > > +       *index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > > > +                          response_pl);
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static int pci_doe_cache_protocols(struct pci_doe *doe)
> > > > +{
> > > > +       u8 index = 0;
> > > > +       int num_prots;
> > > > +       int rc;
> > > > +
> > > > +       /* Discovery protocol must always be supported and must report itself */
> > > > +       num_prots = 1;
> > > > +       doe->prots = devm_kcalloc(&doe->doe_dev->adev.dev, num_prots,
> > > > +                                 sizeof(*doe->prots), GFP_KERNEL);
> > > > +       if (doe->prots == NULL)
> > > > +               return -ENOMEM;
> > > > +
> > > > +       do {
> > > > +               struct pci_doe_protocol *prot;
> > > > +
> > > > +               prot = &doe->prots[num_prots - 1];
> > > > +               rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
> > > > +               if (rc)
> > > > +                       return rc;
> > > > +
> > > > +               if (index) {
> > > > +                       struct pci_doe_protocol *prot_new;
> > > > +
> > > > +                       num_prots++;
> > > > +                       prot_new = devm_krealloc(&doe->doe_dev->adev.dev,
> > > > +                                                doe->prots,
> > > > +                                                sizeof(*doe->prots) *
> > > > +                                                       num_prots,
> > > > +                                                GFP_KERNEL);
> > > > +                       if (prot_new == NULL)
> > > > +                               return -ENOMEM;
> > > > +                       doe->prots = prot_new;
> > > > +               }
> > > > +       } while (index);
> > > > +
> > > > +       doe->num_prots = num_prots;
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static int pci_doe_abort(struct pci_doe *doe)
> > > > +{
> > > > +       reinit_completion(&doe->abort_c);
> > > > +       mutex_lock(&doe->state_lock);
> > > > +       doe->abort = true;  
> > >
> > > Why not a flags field where atomic bitops can be used without need for a mutex.  
> >
> > I'll go the other way, why bother with atomics when this isn't a high performance
> > path or something expected to happen often?  
> 
> It obfuscates what the lock is protecting if it's used for state
> management and atomic flag management, but I am not holding the pen
> here, so I can let this arbitrary trade-off go.

Sure, given Ira is now doing the leg work, up to Ira or other reviewers.

Jonathan



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-09 16:57         ` Jonathan Cameron
@ 2022-02-09 19:57           ` Dan Williams
  2022-02-10 21:51             ` Ira Weiny
  0 siblings, 1 reply; 49+ messages in thread
From: Dan Williams @ 2022-02-09 19:57 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Weiny, Ira, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Wed, Feb 9, 2022 at 8:58 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
[..]
> > It just seems an unnecessary hunk of code for the core to carry when
> > it's trivial for a client of the core to do:
> >
> > task->private = &completion;
> > task->end_task = complete_completion;
> > submit_task()
> > wait_for_completion(&completion);
>
> OK, we can move this to the callers though function obviously will
> also need renaming - I guess to pci_doe_exchange() and now need to take a
> task rather than the exchange.
>
> I personally slightly prefer the layered approach, but don't care that
> strongly.

Like I said, you and Ira are holding the pen, so if you decide to keep
the layering, just document the ontology somewhere and I'll let it go.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-09 19:57           ` Dan Williams
@ 2022-02-10 21:51             ` Ira Weiny
  0 siblings, 0 replies; 49+ messages in thread
From: Ira Weiny @ 2022-02-10 21:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Wed, Feb 09, 2022 at 11:57:38AM -0800, Dan Williams wrote:
> On Wed, Feb 9, 2022 at 8:58 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> [..]
> > > It just seems an unnecessary hunk of code for the core to carry when
> > > it's trivial for a client of the core to do:
> > >
> > > task->private = &completion;
> > > task->end_task = complete_completion;
> > > submit_task()
> > > wait_for_completion(&completion);
> >
> > OK, we can move this to the callers though function obviously will
> > also need renaming - I guess to pci_doe_exchange() and now need to take a
> > task rather than the exchange.
> >
> > I personally slightly prefer the layered approach, but don't care that
> > strongly.
> 
> Like I said, you and Ira are holding the pen, so if you decide to keep
> the layering, just document the ontology somewhere and I'll let it go.

I'm busy with the PKS series ATM but I should get back to reviewing all these
comments soon.

Ira

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices
  2022-02-04 16:27       ` Bjorn Helgaas
@ 2022-02-11  2:54         ` Dan Williams
  0 siblings, 0 replies; 49+ messages in thread
From: Dan Williams @ 2022-02-11  2:54 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jonathan Cameron, Weiny, Ira, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, Linux Kernel Mailing List, linux-cxl,
	Linux PCI

On Fri, Feb 4, 2022 at 8:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Feb 04, 2022 at 02:51:16PM +0000, Jonathan Cameron wrote:
> > On Thu, 3 Feb 2022 16:44:37 -0600
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:
>
> > > > + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> > > > + *                              mailboxes found
> > > > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > > > + *
> > > > + * There is no coresponding destroy of these devices.  This function associates
> > > > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > > > + * association is device managed (devm_*) such that the DOE auxiliary device
> > > > + * lifetime is always greater than or equal to the lifetime of the pci_dev.
> > >
> > > This seems backwards.  What does it mean if the DOE aux dev
> > > lifetime is *greater* than that of the pci_dev?  Surely you can't
> > > access a PCI DOE Capability if the pci_dev is gone?
> >
> > I think the description is inaccurate - the end of life is the same
> > as that of the PCI driver binding to the pci_dev.  It'll get cleared
> > up if that is unbound etc.
>
> I don't know much about devm, but I *think* the devm things get
> released by devres_release_all(), which is called by
> __device_release_driver() after it calls the bus or driver's .remove()
> method (pci_device_remove(), in this case).
>
> So in this case, I think the aux dev is created after the pci_dev and
> released after the PCI driver and the PCI core are done with the
> pci_dev.  I assume some refcounting prevents the pci_dev from actually
> being deallocated until the aux dev is done with it.
>
> I'm not confident that this is a robust situation.

devm is a replacement for hand coding driver ->remove() handlers.
Anything devm allocated at ->probe() will be freed in the proper
reverse order by the driver core after it calls ->remove(). Ideally
for pure devm usage the ->remove() handler can be elided altogether.
I'll go read this patch to make sure it follows the expected pattern
which is:

1/ Parent device driver performs kmalloc(), device_initialize(), and
device_add() of a child device.
2/ Parent registers a devm handler for that child device that will
trigger device_unregister() at remove

During parent device unregister or unbind the devm action will
complete device_unregister() for all children first.

That process is independent of the device lifetime that can be
arbitrarily extended by 3rd party get_device() or
CONFIG_DEBUG_KOBJECT_RELEASE. The device core / kobject hierarchy
guarantees that the parent device is pinned until after child-device
final put event. I.e. final put_device() on a child also triggers a
put_device() on the parent paired with the get_device() taken on the
parent at device_add() time.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-03 22:40   ` Bjorn Helgaas
@ 2022-03-15 21:48     ` Ira Weiny
  0 siblings, 0 replies; 49+ messages in thread
From: Ira Weiny @ 2022-03-15 21:48 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, Feb 03, 2022 at 04:40:27PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 31, 2022 at 11:19:45PM -0800, ira.weiny@intel.com wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> > 
> > Define an auxiliary device driver which control DOE auxiliary devices
> > registered on the auxiliary bus.
> > 
> > A DOE mailbox is allowed to support any number of protocols while some
> > DOE protocol specifications apply additional restrictions.
> > 
> > The protocols supported are queried and cached.  pci_doe_supports_prot()
> > can be used to determine if the DOE device supports the protocol
> > specified.
> > 
> > A synchronous interface is provided in pci_doe_exchange_sync() to
> > perform a single query / response exchange from the driver through the
> > device specified.
> > 
> > Testing was conducted against QEMU using:
> > 
> > https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
> > 
> > This code is based on Jonathan's V4 series here:
> > 
> > https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> 
> Details like references to previous versions can go below the "---"
> so they are omitted from the merged commit.  Many/most maintainers now
> include a Link: tag that facilitates tracing back from a commit to the
> mailing list history.

Done.

> 
> > [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
> >     Data Object Exchange (DOE) - Approved 12 March 2020
> 
> Please update the "PCI ECN" text above and this citation to PCIe r6.0,
> sec 6.30.  No need to reference the ECN now that it's part of the
> published spec.

Done.

> 
> > +config PCI_DOE_DRIVER
> > +	tristate "PCI Data Object Exchange (DOE) driver"
> > +	select AUXILIARY_BUS
> > +	help
> > +	  Driver for DOE auxiliary devices.
> > +
> > +	  DOE provides a simple mailbox in PCI config space that is used by a
> > +	  number of different protocols.  DOE is defined in the Data Object
> > +	  Exchange ECN to the PCIe r5.0 spec.
> 
> Not sure this is relevant in Kconfig help, but if it is, update the
> citation to PCIe r6.0, sec 6.30.

I removed it.  I agree it will probably not age well.

> 
> > +obj-$(CONFIG_PCI_DOE_DRIVER)	+= pci-doe.o
> >  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
> >  
> > +pci-doe-y := doe.o
> 
> Why do we need this doe.o to pci-doe.o dance?  Why not just rename
> doe.c to pci-doe.c?  It looks like that's what we do with pci-stub.c
> and pci-pf-stub.c, which are also tristate.

Not sure.  I think I may have just carried that from the cxl side when I moved
it here to pci.

I agree pci-doe is good.  I'll adjust the series as needed.

> 
> > +++ b/drivers/pci/doe.c
> > @@ -0,0 +1,675 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Data Object Exchange ECN
> > + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> 
> Update citation.  Maybe copyright dates, too.



> 
> > + * Copyright (C) 2021 Huawei
> 
> > +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> 
> Update citation.

Done.

> 
> > +/**
> > + * struct pci_doe - A single DOE mailbox driver
> > + *
> > + * @doe_dev: The DOE Auxiliary device being driven
> > + * @abort_c: Completion used for initial abort handling
> > + * @irq: Interrupt used for signaling DOE ready or abort
> > + * @irq_name: Name used to identify the irq for a particular DOE
> 
> s/ irq / IRQ /

Done.

> 
> > +static int pci_doe_cache_protocols(struct pci_doe *doe)
> > +{
> > +	u8 index = 0;
> > +	int num_prots;
> > +	int rc;
> > +
> > +	/* Discovery protocol must always be supported and must report itself */
> > +	num_prots = 1;
> > +	doe->prots = devm_kcalloc(&doe->doe_dev->adev.dev, num_prots,
> > +				  sizeof(*doe->prots), GFP_KERNEL);
> > +	if (doe->prots == NULL)
> 
> More idiomatic (and as you did below):
> 
>   if (!doe->prots)

Done.

> 
> > +		return -ENOMEM;
> > +
> > +	do {
> > +		struct pci_doe_protocol *prot;
> > +
> > +		prot = &doe->prots[num_prots - 1];
> > +		rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
> > +		if (rc)
> > +			return rc;
> > +
> > +		if (index) {
> > +			struct pci_doe_protocol *prot_new;
> > +
> > +			num_prots++;
> > +			prot_new = devm_krealloc(&doe->doe_dev->adev.dev,
> > +						 doe->prots,
> > +						 sizeof(*doe->prots) *
> > +							num_prots,
> > +						 GFP_KERNEL);
> > +			if (prot_new == NULL)
> 
> Ditto.

Done.

> 
> > +				return -ENOMEM;
> > +			doe->prots = prot_new;
> > +		}
> > +	} while (index);
> > +
> > +	doe->num_prots = num_prots;
> > +	return 0;
> > +}
> 
> > +static int pci_doe_reg_irq(struct pci_doe *doe)
> > +{
> > +	struct pci_dev *pdev = doe->doe_dev->pdev;
> > +	bool poll = !pci_dev_msi_enabled(pdev);
> > +	int offset = doe->doe_dev->cap_offset;
> > +	int rc, irq;
> > +	u32 val;
> > +
> 
>   if (poll)
>     return 0;
> 
> or maybe just:
> 
>   if (!pci_dev_msi_enabled(pdev))
>     return 0;
> 
> No need to read PCI_DOE_CAP or indent all this code.

Good point. done.

> 
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> > +
> > +	if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
> > +		irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
> > +		if (irq < 0)
> > +			return irq;
> > +
> > +		doe->irq_name = devm_kasprintf(&doe->doe_dev->adev.dev,
> > +						GFP_KERNEL,
> > +						"DOE[%s]",
> 
> Fill line.

Done.

> 
> > +						doe->doe_dev->adev.name);
> > +		if (!doe->irq_name)
> > +			return -ENOMEM;
> > +
> > +		rc = devm_request_irq(&pdev->dev, irq, pci_doe_irq, 0,
> > +				      doe->irq_name, doe);
> > +		if (rc)
> > +			return rc;
> > +
> > +		doe->irq = irq;
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> > +				       PCI_DOE_CTRL_INT_EN);
> > +	}
> > +
> > +	return 0;
> > +}
> 
> > +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> > +			 const struct auxiliary_device_id *id)
> > +{
> > +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> > +					struct pci_doe_dev,
> > +					adev);
> 
> Fill line.

Done.

> 
> > +	struct pci_doe *doe;
> > +	int rc;
> > +
> > +	doe = devm_kzalloc(&aux_dev->dev, sizeof(*doe), GFP_KERNEL);
> > +	if (!doe)
> > +		return -ENOMEM;
> > +
> > +	mutex_init(&doe->state_lock);
> > +	init_completion(&doe->abort_c);
> > +	doe->doe_dev = doe_dev;
> > +	init_waitqueue_head(&doe->wq);
> > +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> > +	dev_set_drvdata(&aux_dev->dev, doe);
> > +
> > +	rc = pci_doe_reg_irq(doe);
> 
> "request_irq" or "setup_irq" or something?  "reg" is a little
> ambiguous.

Ok pci_doe_request_irq() since we are wrapping devm_request_irq()

> 
> > +	if (rc)
> > +		return rc;
> > +
> > +	/* Reset the mailbox by issuing an abort */
> > +	rc = pci_doe_abort(doe);
> > +	if (rc)
> > +		return rc;
> > +
> > +	rc = pci_doe_cache_protocols(doe);
> > +	if (rc)
> > +		return rc;
> > +
> > +	return 0;
> 
> Same as:
> 
>   return pci_doe_cache_protocols(doe);

Done.

> 
> > +static int __init pci_doe_init_module(void)
> > +{
> > +	int ret;
> > +
> > +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> > +	if (ret) {
> > +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> > +		       ret);
> > +		return ret;
> > +	}
> > +
> > +	return 0;
> 
> Same as:
> 
>   if (ret)
>     pr_err(...);
> 
>   return ret;

Done.

> 
> > +++ b/include/linux/pci-doe.h
> > @@ -0,0 +1,60 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> 
> Update citation.

Done.

> 
> > +struct pci_doe_dev {
> > +	struct auxiliary_device adev;
> > +	struct pci_dev *pdev;
> > +	int cap_offset;
> 
> Can you name this "doe_cap", in the style of "msi_cap", "msix_cap",
> etc?

I can.  However, I feel I have to point out that msi[x]_cap both have comments
thusly:

        u8                      msi_cap;        /* MSI capability offset */
        u8                      msix_cap;       /* MSI-X capability offset */

Whereas cap_offset when read in the code is nicely self documenting.

...
        int offset = doe->doe_dev->cap_offset;
...

So I feel like it should be left as cap_offset;

Also there are 2 drivers which use cap_offset as well.


drivers/pci/hotplug/shpchp.h|102| u32 cap_offset;
drivers/pci/hotplug/shpchp_hpc.c|201| u32 cap_offset = ctrl->cap_offset;

drivers/pci/controller/pcie-altera.c|112| u32 cap_offset;               /* PCIe capability structure register offset */
drivers/pci/controller/pcie-altera.c|144| pcie->pcie_data->cap_offset +

And I don't think the comment on the altera device is needed...

So let me know if you feel strongly enough to change it.

Ira

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 02/10] PCI: Replace magic constant for PCI Sig Vendor ID
  2022-02-04 21:49   ` Bjorn Helgaas
@ 2022-03-15 21:48     ` Ira Weiny
  0 siblings, 0 replies; 49+ messages in thread
From: Ira Weiny @ 2022-03-15 21:48 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Fri, Feb 04, 2022 at 03:49:02PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 31, 2022 at 11:19:44PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Based on Bjorn's suggestion[1], now that the PCI Sig Vendor ID is
> > defined the define should be used in pci_bus_crs_vendor_id() rather than
> > the hard coded magic value.
> > 
> > Replace the magic value in pci_bus_crs_vendor_id() with
> > PCI_VENDOR_ID_PCI_SIG.
>  
> This sentence is plenty; no attribution or link needed.  I appreciate
> the acknowledgement, but replacing a magic value isn't a better idea
> simply because *I* suggested it ;)

Done.

> 
> > [1] https://lore.kernel.org/linux-cxl/20211117215044.GA1777828@bhelgaas/
> > 
> > Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Thanks,
Ira

> 
> > ---
> >  drivers/pci/probe.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index 087d3658f75c..d92dbb136fc9 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -2318,7 +2318,7 @@ EXPORT_SYMBOL(pci_alloc_dev);
> >  
> >  static bool pci_bus_crs_vendor_id(u32 l)
> >  {
> > -	return (l & 0xffff) == 0x0001;
> > +	return (l & 0xffff) == PCI_VENDOR_ID_PCI_SIG;
> >  }
> >  
> >  static bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l,
> > -- 
> > 2.31.1
> > 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-02-09  0:59   ` Dan Williams
  2022-02-09 10:13     ` Jonathan Cameron
@ 2022-03-16 22:50     ` Ira Weiny
  2022-03-17 19:37       ` Ira Weiny
  1 sibling, 1 reply; 49+ messages in thread
From: Ira Weiny @ 2022-03-16 22:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Tue, Feb 08, 2022 at 04:59:39PM -0800, Dan Williams wrote:
> On Mon, Jan 31, 2022 at 11:20 PM <ira.weiny@intel.com> wrote:
> >
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >
> > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> >
> > Define an auxiliary device driver which control DOE auxiliary devices
> 
> s/control/controls/

Done.

[snip]

> >
> > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> > index 43e615aa12ff..8de51b64067c 100644
> > --- a/drivers/pci/Kconfig
> > +++ b/drivers/pci/Kconfig
> > @@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND
> >           The PCI device frontend driver allows the kernel to import arbitrary
> >           PCI devices from a PCI backend to support PCI driver domains.
> >
> > +config PCI_DOE_DRIVER
> > +       tristate "PCI Data Object Exchange (DOE) driver"
> > +       select AUXILIARY_BUS
> 
> See below near the comment about the odd usage of MODULE_DEVICE_TABLE,
> perhaps the auxiliary device / driver should be registered by the
> client of this core code, not the core itself.

I'll look into it more.

> > +
> > +/**
> > + * struct pci_doe_task - description of a query / response task
> > + * @ex: The details of the task to be done
> > + * @rv: Return value.  Length of received response or error
> > + * @cb: Callback for completion of task
> > + * @private: Private data passed to callback on completion
> > + */
> > +struct pci_doe_task {
> > +       struct pci_doe_exchange *ex;
> > +       int rv;
> > +       void (*cb)(void *private);
> 
> s/cb/end_task/?

Sounds good.

> 
> Why does this need to abandon all semblance of type safety?
> 
> I would expect:
> 
> void (*end_task)(struct pci_doe_task *task);
> 
> ...and let the caller attach any follow on data to task->private if necessary.

I kind of like this too.  But to be fair the code pattern is likely going
to be the caller only using pivate in end_task() which remains opaque.

I'll change it.

> 
> > +       void *private;
> > +};
> > +
> > +/**
> > + * struct pci_doe - A single DOE mailbox driver
> 
> This is driver *state*, right? I.e. not something that wraps "struct
> device_driver" which is what I would expect something claiming to be a
> "driver" would do.

Yes.

> 
> > + *
> > + * @doe_dev: The DOE Auxiliary device being driven
> > + * @abort_c: Completion used for initial abort handling
> > + * @irq: Interrupt used for signaling DOE ready or abort
> > + * @irq_name: Name used to identify the irq for a particular DOE
> > + * @prots: Array of identifiers for protocols supported
> 
> "prot" already has a meaning in the kernel, just spell out
> "protocols". This also looks like something that can be allocated
> inline rather than out of line i.e.:
> 
> struct pci_doe {
> ...
>     int nr_protocols
>     struct pci_doe_protocol protocols[];
> }
> 
> ...and then use struct_size() to allocate it.

As Jonathan said this needs to be more dynamic than that.  I'll add a comment.

[snip]

> > +
> > +/*
> > + * Only call when safe to directly access the DOE, either because no tasks yet
> > + * queued, or called from doe_statemachine_work() which has exclusive access to
> > + * the DOE config space.
> 
> It doesn't have exclusive access unless the patch to lock out
> userspace config writes are revived. Instead, I like Bjorn's idea of
> tracking and warning / tainting, but not blocking conflicting
> userspace access to sensitive configuration registers.
> 
> Yes, it was somewhat of a throw-away comment from Bjorn in that
> thread, "(and IMO should taint the kernel)", but DOE can do so much
> subtle damage (compliance test modes, link-encryption / disruption,
> vendor private who-knows-what...) that I think it behooves us as
> kernel developers to know when we are debugging system behavior that
> may be the result of non-kernel mitigated DOE access. The proposal is
> that when kernel lockdown is not enabled, use the approach from the
> exclusive config access patch [2] to trap, warn (once per device?),
> and taint when userspace writes to DOE registers that have been
> claimed by the kernel. This lets strict environments use
> kernel-lockdown to block userspace DOE access altogether, in
> non-strict environment it discourages userspace from clobbering DOE
> driver state, and it allows a warn-free path if userspace takes the
> step of at least unbinding the kernel DOE driver before running
> userspace DOE cycles.
> 
> [1]: https://lore.kernel.org/r/20211203235617.GA3036259@bhelgaas
> [2]: https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com/
> 

Reading through the threads there seems to be a number
of ways to deal with this.  Let me try to numerate them:

1) Use pci_cfg_access_lock() to block user space access while the kernel DOE
   state machine is working
   	issue - how to ensure the kernel does not interrupt an in progress user
	space transaction.

2) Develop a user space interface to marshal the transactions through with the
   kernel transactions
   	issue - requires new user space APIs

3) Add Dan's patch [2] above which allows for exclusive claim of the resource.
   Require user space to unload driver for the device which then allows user
   space access without worry of conflict with kernel space.

I'm going in the direction of #3 via the use of the auxiliary bus.

To that end I'm going to pull [2] into this series and use it within the DOE
driver.

> > + */
> > +static void pci_doe_abort_start(struct pci_doe *doe)
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       u32 val;
> > +
> > +       val = PCI_DOE_CTRL_ABORT;
> > +       if (doe->irq)
> > +               val |= PCI_DOE_CTRL_INT_EN;
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +
> > +       doe->timeout_jiffies = jiffies + HZ;
> > +       schedule_delayed_work(&doe->statemachine, HZ);
> 
> Given the spec timeout is 1 second and the device clock might be
> slightly off from the host clock how about make this a more generous
> 1.5 or 2 seconds?

Also I wonder why this was not using the PCI_DOE_TIMEOUT?

> 
> > +}
> > +
> > +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)
> 
> The relationship between tasks, requests, responses, and exchanges is
> not immediately clear to me. For example, can this helper be renamed
> in terms of its relationship to a task? A theory of operation document
> would help, but it seems there is also room for the implementation to
> be more self documenting.

Ok yea.  The difference between exchange and task was rather vague when I
looked at this with fresher eyes...

I've merged pci_doe_exchange and pci_doe_task into pci_doe_task and eliminated
pci_doe_exchange_sync() in favor of a pci_doe_submit_task() call per Dan's
suggestion above.

> 
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       u32 val;
> > +       int i;
> > +
> > +       /*
> > +        * Check the DOE busy bit is not set. If it is set, this could indicate
> > +        * someone other than Linux (e.g. firmware) is using the mailbox. Note
> > +        * it is expected that firmware and OS will negotiate access rights via
> > +        * an, as yet to be defined method.
> > +        */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +       if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> > +               return -EBUSY;
> > +
> > +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +               return -EIO;
> > +
> > +       /* Write DOE Header */
> > +       val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) |
> > +               FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type);
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> > +       /* Length is 2 DW of header + length of payload in DW */
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +                              FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> > +                                         2 + ex->request_pl_sz /
> > +                                               sizeof(u32)));
> > +       for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++)
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +                                      ex->request_pl[i]);
> > +
> > +       val = PCI_DOE_CTRL_GO;
> > +       if (doe->irq)
> > +               val |= PCI_DOE_CTRL_INT_EN;
> > +
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +       /* Request is sent - now wait for poll or IRQ */
> > +       return 0;
> > +}
> > +
> > +static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex)
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       size_t length;
> > +       u32 val;
> > +       int i;
> > +
> > +       /* Read the first dword to get the protocol */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +       if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) ||
> > +           (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) {
> > +               pci_err(pdev,
> > +                       "Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",
> > +                       ex->prot.vid, ex->prot.type,
> > +                       FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> > +                       FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> > +               return -EIO;
> > +       }
> > +
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +       /* Read the second dword to get the length */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +
> > +       length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > +       if (length > SZ_1M || length < 2)
> > +               return -EIO;
> > +
> > +       /* First 2 dwords have already been read */
> > +       length -= 2;
> > +       /* Read the rest of the response payload */
> > +       for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) {
> > +               pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > +                                     &ex->response_pl[i]);
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +       }
> > +
> > +       /* Flush excess length */
> > +       for (; i < length; i++) {
> > +               pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +       }
> > +       /* Final error check to pick up on any since Data Object Ready */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +               return -EIO;
> > +
> > +       return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32);
> > +}
> > +
> > +static void doe_statemachine_work(struct work_struct *work)
> > +{
> > +       struct delayed_work *w = to_delayed_work(work);
> > +       struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       struct pci_doe_task *task;
> > +       bool abort;
> > +       u32 val;
> > +       int rc;
> > +
> > +       mutex_lock(&doe->state_lock);
> > +       task = doe->cur_task;
> > +       abort = doe->abort;
> > +       doe->abort = false;
> > +       mutex_unlock(&doe->state_lock);
> > +
> > +       if (abort) {
> > +               /*
> > +                * Currently only used during init - care needed if
> > +                * pci_doe_abort() is generally exposed as it would impact
> > +                * queries in flight.
> > +                */
> > +               WARN_ON(task);
> 
> Why is it worth potentially crashing the kernel here? Is this purely a
> situation that will only happen during development  and refactoring of
> the driver? Otherwise I would expect handling the error without WARN.

I think currently this is only something which may happen during development.
But I think it would be worth throwing a pr_err() and we need to completing the
task when the state machine finally resets.

As I work on this I'm seeing a number of ways to simplify the state machine.
If Jonathan does not mind I'm going to send a series of small patches to the
state machine directly to him for review before I squash them into this patch.

I think that will be easier to review the changes I'm making and hopefully
verify correctness.

[snip]

> > +
> > +       case DOE_WAIT_ABORT:
> > +       case DOE_WAIT_ABORT_ON_ERR:
> > +               pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +               if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > +                   !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > +                       /* Back to normal state - carry on */
> > +                       mutex_lock(&doe->state_lock);
> > +                       doe->cur_task = NULL;
> > +                       mutex_unlock(&doe->state_lock);
> > +                       wake_up_interruptible(&doe->wq);
> > +
> > +                       /*
> > +                        * For deliberately triggered abort, someone is
> > +                        * waiting.
> > +                        */
> > +                       if (doe->state == DOE_WAIT_ABORT)
> > +                               complete(&doe->abort_c);
> 
> Why is a completion and waitqueue needed? I.e. a waiter could simply
> look for an abort completion flag to be set instead.

I agree with Dan here regarding the reduction of machinery but it seems that a
call to abort could come from someone not waiting on the current task?  Is that
true?

This is why their is a separate completion AFAICS.  But I don't follow the use
case for that?

What is currently done is for the DOE mailbox to signal an abort without a task
running.  Therefore there is nothing waiting on the wait queue.

We could flag a special task called 'abort' but this will not support a thread
calling abort separate from the task queue.  Essentially there would not be a
way to abort a queue of waiters.  But again I don't see that use case
currently.

[snip]

> > +
> > +/**
> > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a
> > + *                          response
> > + * @doe_dev: DOE mailbox state structure
> > + * @ex: Description of the buffers and Vendor ID + type used in this
> > + *      request/response pair
> > + *
> > + * Excess data will be discarded.
> > + *
> > + * RETURNS: payload in bytes on success, < 0 on error
> > + */
> > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
> > +                         struct pci_doe_exchange *ex)
> > +{
> > +       struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > +       struct pci_doe_task task;
> > +       DECLARE_COMPLETION_ONSTACK(c);
> > +
> > +       if (!doe)
> > +               return -EAGAIN;
> > +
> > +       /* DOE requests must be a whole number of DW */
> > +       if (ex->request_pl_sz % sizeof(u32))
> > +               return -EINVAL;
> > +
> > +       task.ex = ex;
> > +       task.cb = pci_doe_task_complete;
> > +       task.private = &c;
> > +
> > +again:
> > +       mutex_lock(&doe->state_lock);
> > +       if (doe->cur_task) {
> > +               mutex_unlock(&doe->state_lock);
> > +               wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> > +               goto again;
> > +       }
> > +
> > +       if (doe->dead) {
> > +               mutex_unlock(&doe->state_lock);
> > +               return -EIO;
> > +       }
> > +       doe->cur_task = &task;
> > +       schedule_delayed_work(&doe->statemachine, 0);
> > +       mutex_unlock(&doe->state_lock);
> > +
> > +       wait_for_completion(&c);
> 
> I would expect that the caller of this routine would want to specify
> the task and end_task() callback and use that as the completion
> signal. It may also want "no wait" behavior where it is prepared for
> the DOE result to come back sometime later. With that change the
> exchange fields can move into the task directly.

Yep done.

[snip]

> > +
> > +static int pci_doe_abort(struct pci_doe *doe)
> > +{
> > +       reinit_completion(&doe->abort_c);
> > +       mutex_lock(&doe->state_lock);
> > +       doe->abort = true;
> 
> Why not a flags field where atomic bitops can be used without need for a mutex.

At first I agreed with Jonathan however, after some refactoring I think the
atomics would be nice.

> 
> > +       mutex_unlock(&doe->state_lock);
> > +       schedule_delayed_work(&doe->statemachine, 0);
> > +       wait_for_completion(&doe->abort_c);
> > +
> > +       if (doe->dead)
> 
> dead could also be another atomic flag.

Yep.

I've also changed state_lock to be task_lock as that clarifies what is being
protected.

[snip]

> > +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> > +                        const struct auxiliary_device_id *id)
> > +{
> > +       struct pci_doe_dev *doe_dev = container_of(aux_dev,
> > +                                       struct pci_doe_dev,
> > +                                       adev);
> > +       struct pci_doe *doe;
> > +       int rc;
> > +
> > +       doe = devm_kzalloc(&aux_dev->dev, sizeof(*doe), GFP_KERNEL);
> > +       if (!doe)
> > +               return -ENOMEM;
> > +
> > +       mutex_init(&doe->state_lock);
> > +       init_completion(&doe->abort_c);
> > +       doe->doe_dev = doe_dev;
> > +       init_waitqueue_head(&doe->wq);
> > +       INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> > +       dev_set_drvdata(&aux_dev->dev, doe);
> > +
> > +       rc = pci_doe_reg_irq(doe);
> > +       if (rc)
> > +               return rc;
> > +
> > +       /* Reset the mailbox by issuing an abort */
> > +       rc = pci_doe_abort(doe);
> > +       if (rc)
> > +               return rc;
> > +
> > +       rc = pci_doe_cache_protocols(doe);
> > +       if (rc)
> > +               return rc;
> 
> This can just be:
> 
>  return pci_doe_cache_protocols(doe);

Already suggested by Bjorn and done.

> 
> > +
> > +       return 0;
> > +}
> > +
> > +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> > +{
> > +       struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> > +
> > +       /* First halt the state machine */
> > +       cancel_delayed_work_sync(&doe->statemachine);
> > +}
> > +
> > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> > +       {},
> > +};
> > +
> > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> 
> Why is this empty table here?

Filling the id table was done in the next patch.

The split of the patches may have been a bit arbitrary here.  This patch was
focused on the state machine and probing of the mailboxes.  The next patch
provided the helper function to create all the DOE devices for a given
PCI device; pci_doe_create_doe_devices()

> 
> > +
> > +struct auxiliary_driver pci_doe_auxiliary_drv = {
> > +       .name = "pci_doe",
> > +       .id_table = pci_doe_auxiliary_id_table,
> > +       .probe = pci_doe_probe,
> > +       .remove = pci_doe_remove
> > +};
> 
> I expect that these helpers would be provided by the PCI core, but
> then a subsystem like CXL would have code to register their auxiliary
> devices and drivers that mostly just wrap the PCI core DOE
> implementation.

Ah ok, I think I see what you are saying.  That is not quite as straight
forward a use of the auxiliary bus but I _think_ it will work.  I'll also
attempt to clarify with documentation how the above probe/remove functions are
to be used by those defining their own drivers.

Ira

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver
  2022-03-16 22:50     ` Ira Weiny
@ 2022-03-17 19:37       ` Ira Weiny
  0 siblings, 0 replies; 49+ messages in thread
From: Ira Weiny @ 2022-03-17 19:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Bjorn Helgaas, Alison Schofield, Vishal Verma,
	Ben Widawsky, Linux Kernel Mailing List, linux-cxl, Linux PCI

On Wed, Mar 16, 2022 at 03:50:55PM -0700, Ira Weiny wrote:
> On Tue, Feb 08, 2022 at 04:59:39PM -0800, Dan Williams wrote:
> > On Mon, Jan 31, 2022 at 11:20 PM <ira.weiny@intel.com> wrote:

[snip]

> > 
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> > > +{
> > > +       struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> > > +
> > > +       /* First halt the state machine */
> > > +       cancel_delayed_work_sync(&doe->statemachine);
> > > +}
> > > +
> > > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> > > +       {},
> > > +};
> > > +
> > > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> > 
> > Why is this empty table here?
> 
> Filling the id table was done in the next patch.
> 
> The split of the patches may have been a bit arbitrary here.  This patch was
> focused on the state machine and probing of the mailboxes.  The next patch
> provided the helper function to create all the DOE devices for a given
> PCI device; pci_doe_create_doe_devices()
> 
> > 
> > > +
> > > +struct auxiliary_driver pci_doe_auxiliary_drv = {
> > > +       .name = "pci_doe",
> > > +       .id_table = pci_doe_auxiliary_id_table,
> > > +       .probe = pci_doe_probe,
> > > +       .remove = pci_doe_remove
> > > +};
> > 
> > I expect that these helpers would be provided by the PCI core, but
> > then a subsystem like CXL would have code to register their auxiliary
> > devices and drivers that mostly just wrap the PCI core DOE
> > implementation.
> 
> Ah ok, I think I see what you are saying.  That is not quite as straight
> forward a use of the auxiliary bus but I _think_ it will work.  I'll also
> attempt to clarify with documentation how the above probe/remove functions are
> to be used by those defining their own drivers.

Ok looking at this again today I see why I did things the way I did.

The question is:

Is the DOE driver a PCI driver or a driver defined by the subsystems?

The way I have it now the PCI core defines the driver and a couple of very
small helper functions for the subsystems to use.

What I think you are proposing is the PCI core supplies the helper functions to
drive the protocol but the actual driver is defined as part of the subsystem?
Is that correct?

The implications are subtle but one thing about the way I have things is that
subsystems don't really need to learn about auxiliary bus driver stuff.

OTOH pushing the auxiliary bus code into the subsystem allows for a bit more
flexibility around the use of the DOE protocol code within the PCI core.

I'll keep looking.

Ira

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices
  2022-02-03 22:44   ` Bjorn Helgaas
  2022-02-04 14:51     ` Jonathan Cameron
@ 2022-03-24  0:26     ` Ira Weiny
  2022-03-24 14:05       ` Jonathan Cameron
  1 sibling, 1 reply; 49+ messages in thread
From: Ira Weiny @ 2022-03-24  0:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Dan Williams, Jonathan Cameron, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, Feb 03, 2022 at 04:44:37PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > CXL and/or PCI devices can define DOE mailboxes.  
> 
> In concrete terms, "DOE mailbox" refers to a DOE Capability, right?

Right.

> PCIe devices are allowed to implement several instances of the DOE
> Capability, of course.  I'm kind of partial to concreteness because it
> makes it easier to map between the code and the spec.

I agree.  I'm just not great at remembering the terminology.  I've fixed it up
in this next go round.

> 
> > Normally the kernel will want to maintain control of all of these
> > mailboxes.  However, under a limited number of use cases users may
> > want to allow user space access to some of these mailboxes while the
> > kernel retains control of the rest.
> 
> Is there something in this patch related to user-space vs kernel
> control of things?  To me this patch looks like "for every DOE
> Capability on a device, create an auxiliary device and try to attach
> an auxiliary device driver to it."

That was the way this series worked yest.

> 
> If part of creating the auxiliary devices is adding things in sysfs, I
> think it would be useful to mention that here.

Nothing was added in sysfs for this series and nothing is planned in the next
series.

That said Dan and I discussed internally and have settled on the PCI layer
being agnostic to the aux bus idea.

The new PCI layer code I have simply provides helpers to create struct
pci_doe_mb (mailbox objects) which control the state machine and accept
pci_doe_tasks as work requests.

The idea of using auxiliary bus devices is now used in the CXL layer as a way
to allow individual mailboxes to be controlled via user space directly by
unlinking that auxiliary device.

I have also added a modified version of Dan's patch from here:

https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com/

The new version taints the kernel if a write occurs but allows any read.

I'm curious if allowing reads is really ok though.  Because allowing a read at
just the right time could allow snooping of DOE protocol data.  Could this be a
potential security issue with protocols like IDE?  It would be a difficult
attack for sure but...  :-/

> 
> > An example of this is for CXL Compliance Testing (see CXL 2.0
> > 14.16.4 Compliance Mode DOE) which offers a mechanism to set
> > different test modes for a device.
> 
> Not sure exactly what this contributes here.  I guess you're saying
> you might want user-space access to this, but I don't see anything in
> this patch related to that.

The detail in this patch was that if user space unlinked a specific DOE device
then user space could control that device directly and without interference
from this kernel code.  As I stated above we will use this mechanism for CXL
but other subsystems could decide to do something else and own each DOE MB
capability directly.

> 
> > Rather than re-invent the wheel the architecture creates auxiliary
> > devices for each DOE mailbox which can then be driven by a generic
> > DOE mailbox driver.  If access to an individual mailbox is required
> > by user space the driver for that mailbox can be unloaded and access
> > handed to user space.
> 
> IIUC a device can have several DOE Capabilities, and each Capability
> can support several protocols.  So I would think the granularity might
> be "protocol" rather than "mailbox" (DOE Capability).

We debated that and decided that was to fine a granularity.

> 
> But either way this text seems like it would go with a different patch
> since this patch has nothing to specify a particular protocol or even
> a particular mailbox/DOE Capability.

Again, this was just more justification for the aux bus architecture.

> 
> > Create the helper pci_doe_create_doe_devices() which iterates each DOE
> > mailbox found in the device and creates a DOE auxiliary device on the
> > auxiliary bus.  While doing so ensure that the auxiliary DOE driver
> > loads to drive that device.
> 
> Here's a case where "iterating over DOE mailboxes found in the device"
> is slightly abstract.  The code obviously iterates over DOE
> *Capabilities* (PCI_EXT_CAP_ID_DOE), and that's something I can easily
> find in the spec.

I've clarified that thanks.  Also in the new version I've created an iterator
to find the capabilities.

	pci_doe_for_each_off()

> 
> Knowing that this is a PCIe Capability is useful because it puts it in
> the context of other capabilities ("optional things that live in
> config space") and the mechanisms for synchronization and user-space
> access.

Yes thanks.

> 
> > +/**
> > + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> > + *                              mailboxes found
> > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > + *
> > + * There is no coresponding destroy of these devices.  This function associates
> > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > + * association is device managed (devm_*) such that the DOE auxiliary device
> > + * lifetime is always greater than or equal to the lifetime of the pci_dev.
> 
> This seems backwards.  What does it mean if the DOE aux dev lifetime
> is *greater* than that of the pci_dev?  Surely you can't access a PCI
> DOE Capability if the pci_dev is gone?

No you could not.  Thus the idea that the pci_dev's lifetime was greater than
the lifetime of the auxiliary devices.

Regardless this has all changed away from being part of the core and more tied
to the management of particular devices.  Which I think is much more clear.

> 
> > + * RETURNS: 0 on success -ERRNO on failure.
> > + */
> > +int pci_doe_create_doe_devices(struct pci_dev *pdev)
> > +{
> > +	struct device *dev = &pdev->dev;
> > +	int irqs, rc;
> > +	u16 pos = 0;
> > +
> > +	/*
> > +	 * An implementation may support an unknown number of interrupts.
> > +	 * Assume that number is not that large and request them all.
> 
> This doesn't really inspire confidence :)  Playing devil's advocate,
> since pdev is an arbitrary device, I would assume the number *is*
> large.

I've moved the call to pci_alloc_irq_vectors() to the CXL code which is
managing the pci_dev itself (rather than being buried in this auxiliary device
stuff.)

> 
> > +	irqs = pci_msix_vec_count(pdev);
> > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
> 
> pci_msix_vec_count() is apparently sort of discouraged; see
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/msi-howto.rst?id=v5.16#n179

I've removed pci_msix_vec_count() in favor of counting the DOE capabilities and
allocating the vectors based on that count.

> 
> A DOE Capability may be implemented by any device, e.g., a NIC or
> storage HBA, etc.  I'm a little queasy about IRQ alloc happening both
> here and in the driver for the device's primary functionality.  Can
> you reassure me that this is actually OK and safe?

I think it was perfectly safe for this implementation but it was probably not
a good idea generally.

> 
> Sorry if I've asked this before.  If I have, perhaps a comment would
> be useful.
> 
> > +	if (rc != irqs) {
> > +		/* No interrupt available - carry on */
> > +		pci_dbg(pdev, "No interrupts available for DOE\n");
> > +	} else {
> > +		/*
> > +		 * Enabling bus mastering is require for MSI/MSIx.  It could be
> 
> s/require/required/
> s/MSIx/MSI-X/ to match spec usage.
> 
> But I think you only support MSI-X, since you passed "PCI_IRQ_MSIX", not
> "PCI_IRQ_MSI | PCI_IRQ_MSIX" above?

Done.

> 
> > +		 * done later within the DOE initialization, but as it
> > +		 * potentially has other impacts keep it here when setting up
> > +		 * the IRQ's.
> 
> s/IRQ's/IRQs/
> 
> "Potentially has other impacts" is too vague, and this doesn't explain
> why bus mastering should be enabled here rather than later.  The
> device should not issue an MSI-X until DOE Interrupt Enable is set, so
> near there seems like a logical place.

Is it safe to call pci_set_master() more than once?

The reason I am asking is because I've debated between having the new create
mailbox command [pci_doe_create_mb()] request the irq or not.

The issue is the irq handler is part of the DOE state machine and so
pci_request_irq() needs to pass that handler.  I would rather not make that a
globally visible function.  Nor do I think it is appropriate for the DOE state
machine to trust callers setting the correct handler.

So currently the pci_alloc_irq_vectors() is the responsibility of the consumer
(CXL layer in this series) and pci_{request,free}_irq() is handled in the PCI
layer.

But placing pci_set_master() near the DOE Interrupt Enable would then cause
pci_set_master() to be called for each mailbox create.

For now I have left the pci_set_master() call next to pci_alloc_irq_vectors()
in the CXL layer.  As in Jonathans original code it gets called if the
allocation gets enough vectors for all mailboxes found.  And the use of irq's
is all or nothing for each CXL device.

Here is the code to be more clear...


drivers/cxl/pci.c:

int cxl_pci_create_doe_devices(struct pci_dev *pdev)
{               
        struct device *dev = &pdev->dev;
        bool use_irq = true;
        int irqs = 0;
        u16 off = 0;         
        int rc;
        
        pci_doe_for_each_off(pdev, off)
                irqs++;
        pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
        
        /*                         
         * Allocate enough vectors for the DOE's
         */     
        rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
                                                     PCI_IRQ_MSIX);
        if (rc != irqs) {
                pci_err(pdev, "Not enough interrupts for all the DOEs; use polling\n");
                use_irq = false;
                /* Some got allocated; clean them up */
                if (rc > 0)
                        cxl_pci_free_irq_vectors(pdev); 
        } else {
                /*
                 * Enabling bus mastering is require for MSI/MSIx.  It could be
                 * done later within the DOE initialization, but as it
                 * potentially has other impacts keep it here when setting up
                 * the IRQ's.
                 */
                pci_set_master(pdev);
                rc = devm_add_action_or_reset(dev,
                                              cxl_pci_free_irq_vectors,
                                              pdev);
                if (rc)
                        return rc;
        }

        pci_doe_for_each_off(pdev, off) {
...
		/* Create each auxiliary device which internally calls */
		pci_doe_create_mb(pdev, off, use_irq);
...
	}
...
}


drivers/pci/pci-doe.c:

static irqreturn_t pci_doe_irq_handler(int irq, void *data)
{
...
}

static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
{
        struct pci_dev *pdev = doe_mb->pdev;
        int offset = doe_mb->cap_offset;
        int doe_irq, rc;
        u32 val;

        pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);

        if (!FIELD_GET(PCI_DOE_CAP_INT, val))
                return -ENOTSUPP;

        doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
        rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
                             NULL, doe_mb,
                             "DOE[%d:%s]", doe_irq, pci_name(pdev));
        if (rc) 
                return rc;

        doe_mb->irq = doe_irq;
        pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
                               PCI_DOE_CTRL_INT_EN);
        return 0;
}

struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
                                     bool use_irq)
{
...
        if (use_irq) {
                rc = pci_doe_request_irq(doe_mb);
                if (rc) 
                        pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
                                cap_offset, rc);
        }
...
}


Does this look reasonable?

> 
> > +		 */
> > +		pci_set_master(pdev);
> > +		rc = devm_add_action_or_reset(dev,
> > +					      pci_doe_free_irq_vectors,
> > +					      pdev);
> > +		if (rc)
> > +			return rc;
> > +	}
> 
> > +++ b/include/linux/pci-doe.h
> > @@ -13,6 +13,8 @@
> >  #ifndef LINUX_PCI_DOE_H
> >  #define LINUX_PCI_DOE_H
> >  
> > +#define DOE_DEV_NAME "doe"
> 
> This is only used once, above.  Why not just use the string there
> directly and skip the #define?  If it's needed elsewhere eventually,
> we can add a #define then.

This is now moved elsewhere in the series.

Thanks for the feedback and sorry it's taken so long to respond.
Ira

> 
> Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices
  2022-03-24  0:26     ` Ira Weiny
@ 2022-03-24 14:05       ` Jonathan Cameron
  2022-03-24 23:44         ` Ira Weiny
  0 siblings, 1 reply; 49+ messages in thread
From: Jonathan Cameron @ 2022-03-24 14:05 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Bjorn Helgaas, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

Hi Ira,

> Here is the code to be more clear...
> 
> 
> drivers/cxl/pci.c:
> 
> int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> {               
>         struct device *dev = &pdev->dev;
>         bool use_irq = true;
>         int irqs = 0;
>         u16 off = 0;         
>         int rc;
>         
>         pci_doe_for_each_off(pdev, off)
>                 irqs++;
>         pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
>         
>         /*                         
>          * Allocate enough vectors for the DOE's
>          */     
>         rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
>                                                      PCI_IRQ_MSIX);
>         if (rc != irqs) {
>                 pci_err(pdev, "Not enough interrupts for all the DOEs; use polling\n");
>                 use_irq = false;
>                 /* Some got allocated; clean them up */
>                 if (rc > 0)
>                         cxl_pci_free_irq_vectors(pdev); 
>         } else {
>                 /*
>                  * Enabling bus mastering is require for MSI/MSIx.  It could be
>                  * done later within the DOE initialization, but as it
>                  * potentially has other impacts keep it here when setting up
>                  * the IRQ's.
>                  */
>                 pci_set_master(pdev);
>                 rc = devm_add_action_or_reset(dev,
>                                               cxl_pci_free_irq_vectors,
>                                               pdev);
>                 if (rc)
>                         return rc;
>         }
> 
>         pci_doe_for_each_off(pdev, off) {
> ...
> 		/* Create each auxiliary device which internally calls */
> 		pci_doe_create_mb(pdev, off, use_irq);
> ...
> 	}
> ...
> }
> 
> 
> drivers/pci/pci-doe.c:
> 
> static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> {
> ...
> }
> 
> static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> {
>         struct pci_dev *pdev = doe_mb->pdev;
>         int offset = doe_mb->cap_offset;
>         int doe_irq, rc;
>         u32 val;
> 
>         pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> 
>         if (!FIELD_GET(PCI_DOE_CAP_INT, val))
>                 return -ENOTSUPP;
> 
>         doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
>         rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
>                              NULL, doe_mb,
>                              "DOE[%d:%s]", doe_irq, pci_name(pdev));
>         if (rc) 
>                 return rc;
> 
>         doe_mb->irq = doe_irq;
>         pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
>                                PCI_DOE_CTRL_INT_EN);
>         return 0;
> }
> 
> struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
>                                      bool use_irq)
> {
> ...
>         if (use_irq) {
>                 rc = pci_doe_request_irq(doe_mb);
>                 if (rc) 
>                         pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
>                                 cap_offset, rc);
>         }
> ...
> }
> 
> 
> Does this look reasonable?

I'm a little nervous about how we are going to make DOEs on switches work.
Guess I'll do an experiment once your next version is out and check we
can do that reasonably cleanly.  For switches we'll probably have to
check for DOEs on all such ports and end up with infrastructure to
map to all protocols we might see on a switch.

Jonathan

> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices
  2022-03-24 14:05       ` Jonathan Cameron
@ 2022-03-24 23:44         ` Ira Weiny
  2022-03-25 12:02           ` Jonathan Cameron
  0 siblings, 1 reply; 49+ messages in thread
From: Ira Weiny @ 2022-03-24 23:44 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Bjorn Helgaas, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, Mar 24, 2022 at 02:05:39PM +0000, Jonathan Cameron wrote:
> Hi Ira,
> 
> > Here is the code to be more clear...
> > 
> > 
> > drivers/cxl/pci.c:
> > 
> > int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > {               
> >         struct device *dev = &pdev->dev;
> >         bool use_irq = true;
> >         int irqs = 0;
> >         u16 off = 0;         
> >         int rc;
> >         
> >         pci_doe_for_each_off(pdev, off)
> >                 irqs++;
> >         pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> >         
> >         /*                         
> >          * Allocate enough vectors for the DOE's
> >          */     
> >         rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> >                                                      PCI_IRQ_MSIX);
> >         if (rc != irqs) {
> >                 pci_err(pdev, "Not enough interrupts for all the DOEs; use polling\n");
> >                 use_irq = false;
> >                 /* Some got allocated; clean them up */
> >                 if (rc > 0)
> >                         cxl_pci_free_irq_vectors(pdev); 
> >         } else {
> >                 /*
> >                  * Enabling bus mastering is require for MSI/MSIx.  It could be
> >                  * done later within the DOE initialization, but as it
> >                  * potentially has other impacts keep it here when setting up
> >                  * the IRQ's.
> >                  */
> >                 pci_set_master(pdev);
> >                 rc = devm_add_action_or_reset(dev,
> >                                               cxl_pci_free_irq_vectors,
> >                                               pdev);
> >                 if (rc)
> >                         return rc;
> >         }
> > 
> >         pci_doe_for_each_off(pdev, off) {
> > ...
> > 		/* Create each auxiliary device which internally calls */
> > 		pci_doe_create_mb(pdev, off, use_irq);
> > ...
> > 	}
> > ...
> > }
> > 
> > 
> > drivers/pci/pci-doe.c:
> > 
> > static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > {
> > ...
> > }
> > 
> > static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> > {
> >         struct pci_dev *pdev = doe_mb->pdev;
> >         int offset = doe_mb->cap_offset;
> >         int doe_irq, rc;
> >         u32 val;
> > 
> >         pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> > 
> >         if (!FIELD_GET(PCI_DOE_CAP_INT, val))
> >                 return -ENOTSUPP;
> > 
> >         doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
> >         rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
> >                              NULL, doe_mb,
> >                              "DOE[%d:%s]", doe_irq, pci_name(pdev));
> >         if (rc) 
> >                 return rc;
> > 
> >         doe_mb->irq = doe_irq;
> >         pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> >                                PCI_DOE_CTRL_INT_EN);
> >         return 0;
> > }
> > 
> > struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> >                                      bool use_irq)
> > {
> > ...
> >         if (use_irq) {
> >                 rc = pci_doe_request_irq(doe_mb);
> >                 if (rc) 
> >                         pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
> >                                 cap_offset, rc);
> >         }
> > ...
> > }
> > 
> > 
> > Does this look reasonable?
> 
> I'm a little nervous about how we are going to make DOEs on switches work.
> Guess I'll do an experiment once your next version is out and check we
> can do that reasonably cleanly.  For switches we'll probably have to
> check for DOEs on all such ports and end up with infrastructure to
> map to all protocols we might see on a switch.

Are the switches not represented as PCI devices in linux?

If my vision of switches is correct I think that problem is independent of what
I'm solving here.  In other words the relationship between a port on a switch
and a DOE capability on that switch will have to be established somehow and
nothing I'm doing precludes doing that, but at the same time nothing I'm doing
helps that either.

Ira

> 
> Jonathan
> 
> > 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices
  2022-03-24 23:44         ` Ira Weiny
@ 2022-03-25 12:02           ` Jonathan Cameron
  0 siblings, 0 replies; 49+ messages in thread
From: Jonathan Cameron @ 2022-03-25 12:02 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Bjorn Helgaas, Dan Williams, Bjorn Helgaas, Alison Schofield,
	Vishal Verma, Ben Widawsky, linux-kernel, linux-cxl, linux-pci

On Thu, 24 Mar 2022 16:44:33 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> On Thu, Mar 24, 2022 at 02:05:39PM +0000, Jonathan Cameron wrote:
> > Hi Ira,
> >   
> > > Here is the code to be more clear...
> > > 
> > > 
> > > drivers/cxl/pci.c:
> > > 
> > > int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > > {               
> > >         struct device *dev = &pdev->dev;
> > >         bool use_irq = true;
> > >         int irqs = 0;
> > >         u16 off = 0;         
> > >         int rc;
> > >         
> > >         pci_doe_for_each_off(pdev, off)
> > >                 irqs++;
> > >         pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> > >         
> > >         /*                         
> > >          * Allocate enough vectors for the DOE's
> > >          */     
> > >         rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> > >                                                      PCI_IRQ_MSIX);
> > >         if (rc != irqs) {
> > >                 pci_err(pdev, "Not enough interrupts for all the DOEs; use polling\n");
> > >                 use_irq = false;
> > >                 /* Some got allocated; clean them up */
> > >                 if (rc > 0)
> > >                         cxl_pci_free_irq_vectors(pdev); 
> > >         } else {
> > >                 /*
> > >                  * Enabling bus mastering is require for MSI/MSIx.  It could be
> > >                  * done later within the DOE initialization, but as it
> > >                  * potentially has other impacts keep it here when setting up
> > >                  * the IRQ's.
> > >                  */
> > >                 pci_set_master(pdev);
> > >                 rc = devm_add_action_or_reset(dev,
> > >                                               cxl_pci_free_irq_vectors,
> > >                                               pdev);
> > >                 if (rc)
> > >                         return rc;
> > >         }
> > > 
> > >         pci_doe_for_each_off(pdev, off) {
> > > ...
> > > 		/* Create each auxiliary device which internally calls */
> > > 		pci_doe_create_mb(pdev, off, use_irq);
> > > ...
> > > 	}
> > > ...
> > > }
> > > 
> > > 
> > > drivers/pci/pci-doe.c:
> > > 
> > > static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > {
> > > ...
> > > }
> > > 
> > > static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> > > {
> > >         struct pci_dev *pdev = doe_mb->pdev;
> > >         int offset = doe_mb->cap_offset;
> > >         int doe_irq, rc;
> > >         u32 val;
> > > 
> > >         pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> > > 
> > >         if (!FIELD_GET(PCI_DOE_CAP_INT, val))
> > >                 return -ENOTSUPP;
> > > 
> > >         doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
> > >         rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
> > >                              NULL, doe_mb,
> > >                              "DOE[%d:%s]", doe_irq, pci_name(pdev));
> > >         if (rc) 
> > >                 return rc;
> > > 
> > >         doe_mb->irq = doe_irq;
> > >         pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> > >                                PCI_DOE_CTRL_INT_EN);
> > >         return 0;
> > > }
> > > 
> > > struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > >                                      bool use_irq)
> > > {
> > > ...
> > >         if (use_irq) {
> > >                 rc = pci_doe_request_irq(doe_mb);
> > >                 if (rc) 
> > >                         pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
> > >                                 cap_offset, rc);
> > >         }
> > > ...
> > > }
> > > 
> > > 
> > > Does this look reasonable?  
> > 
> > I'm a little nervous about how we are going to make DOEs on switches work.
> > Guess I'll do an experiment once your next version is out and check we
> > can do that reasonably cleanly.  For switches we'll probably have to
> > check for DOEs on all such ports and end up with infrastructure to
> > map to all protocols we might see on a switch.  
> 
> Are the switches not represented as PCI devices in linux?
> 
> If my vision of switches is correct I think that problem is independent of what
> I'm solving here.  In other words the relationship between a port on a switch
> and a DOE capability on that switch will have to be established somehow and
> nothing I'm doing precludes doing that, but at the same time nothing I'm doing
> helps that either.

Sure, I'm just expressing nervousness and would want a PoC of that at least
to check it's not too nasty.  The port drivers are rather 'unusual' in PCI
so touching them always ends up more complex than I expect.

Anyhow, start of cycle so should be plenty of time to do such an RFC
once your code is out there.

Jonathan

> 
> Ira
> 
> > 
> > Jonathan
> >   
> > >   


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2022-03-25 12:03 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-01  7:19 [PATCH V6 00/10] CXL: Read CDAT and DSMAS data from the device ira.weiny
2022-02-01  7:19 ` [PATCH V6 01/10] PCI: Add vendor ID for the PCI SIG ira.weiny
2022-02-03 17:11   ` Bjorn Helgaas
2022-02-03 20:28     ` Ira Weiny
2022-02-01  7:19 ` [PATCH V6 02/10] PCI: Replace magic constant for PCI Sig Vendor ID ira.weiny
2022-02-04 21:16   ` Dan Williams
2022-02-04 21:49   ` Bjorn Helgaas
2022-03-15 21:48     ` Ira Weiny
2022-02-01  7:19 ` [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver ira.weiny
2022-02-03 22:40   ` Bjorn Helgaas
2022-03-15 21:48     ` Ira Weiny
2022-02-09  0:59   ` Dan Williams
2022-02-09 10:13     ` Jonathan Cameron
2022-02-09 16:26       ` Dan Williams
2022-02-09 16:57         ` Jonathan Cameron
2022-02-09 19:57           ` Dan Williams
2022-02-10 21:51             ` Ira Weiny
2022-03-16 22:50     ` Ira Weiny
2022-03-17 19:37       ` Ira Weiny
2022-02-01  7:19 ` [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices ira.weiny
2022-02-03 22:44   ` Bjorn Helgaas
2022-02-04 14:51     ` Jonathan Cameron
2022-02-04 16:27       ` Bjorn Helgaas
2022-02-11  2:54         ` Dan Williams
2022-03-24  0:26     ` Ira Weiny
2022-03-24 14:05       ` Jonathan Cameron
2022-03-24 23:44         ` Ira Weiny
2022-03-25 12:02           ` Jonathan Cameron
2022-02-01  7:19 ` [PATCH V6 05/10] cxl/pci: Create DOE auxiliary devices ira.weiny
2022-02-01  7:19 ` [PATCH V6 06/10] cxl/pci: Find the DOE mailbox which supports CDAT ira.weiny
2022-02-01 18:49   ` Ben Widawsky
2022-02-01 22:18     ` Ira Weiny
2022-02-04 14:04       ` Jonathan Cameron
2022-02-01  7:19 ` [PATCH V6 07/10] cxl/mem: Read CDAT table ira.weiny
2022-02-04 13:46   ` Jonathan Cameron
2022-02-01  7:19 ` [PATCH V6 08/10] cxl/cdat: Introduce cdat_hdr_valid() ira.weiny
2022-02-01 18:56   ` Ben Widawsky
2022-02-01 22:29     ` Ira Weiny
2022-02-04 13:17       ` Jonathan Cameron
2022-02-01  7:19 ` [PATCH V6 09/10] cxl/mem: Retry reading CDAT on failure ira.weiny
2022-02-01 18:59   ` Ben Widawsky
2022-02-01 22:31     ` Ira Weiny
2022-02-04 13:20       ` Jonathan Cameron
2022-02-01  7:19 ` [PATCH V6 10/10] cxl/cdat: Parse out DSMAS data from CDAT table ira.weiny
2022-02-01 19:05   ` Ben Widawsky
2022-02-01 22:37     ` Ira Weiny
2022-02-04 13:33       ` Jonathan Cameron
2022-02-04 13:41       ` Jonathan Cameron
2022-02-04 13:40   ` Jonathan Cameron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.