dmaengine Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator
@ 2019-11-20 21:23 Dave Jiang
  2019-11-20 21:23 ` [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction Dave Jiang
                   ` (13 more replies)
  0 siblings, 14 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:23 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

The patch series breaks down into following parts:
Patch 1: x86 arch / generic, add a new I/O accessor based on movdir64b
Patches 2,3,5-7,12: dmaengine subsystem additions
Patch 3: mm, moving common allocation code from blk-mq to mm
Patches 8-11,13,14: idxd driver

This patch series implements the first part of the driver for the Intel
Data Stream accelerator, the Intel Data Accelerator driver (idxd).
The Intel DSA replaces the Intel IOAT DMA engine from previous Xeon platforms
on a future processor platform. Many new features are implemented by Intel DSA.
1. Descriptors can be issued directly from kernel, user, and guest via new CPU
   instructions enqcmd, enqcmds, and movdir64b. The descriptor is written to
   an mmio address in one of the device's PCI BAR and is called a portal.
   New CPU instruction details can be found in the latest Intel Software
   Developer's Manual. [1]
2. Shared workqueues allow multiple users issue descriptors to the same
   workqueue.
3. Shared Virtual Memory (SVM) support allows using virtual address instead of
   requiring pinned physical address that traditional DMA controllers require.
   This simplifies programming and makes it easier for user space to do DMA
   operations. Page faults can be recovered through PCI Address Translation
   Service (ATS) performed by the DMA device.
4. Supports scalable IOV (SIOV) to accelerate virtualization. [2]

The submission will happen in multiple stages depending on availability of
kernel support for Process Address Space ID (PASID), IOMMU, vIOMMU, and
Interrupt Message Storage (IMS).

Stage 1 (this series): idxd driver with only dedicated workqueue support.
	- No PASID support
	- No shared workqueue (requires PASID) support
	- With DMA engine plumbing
	- With char driver for user command portal export.
Stage 2. idxd driver with PASID support and shared workqueue support
STage 3. idxd driver with VFIO mediated device (mdev) and with IMS support.

The DSA device defines sub-components called workqueues, groups, and engines.
A group is an abstract container that can have 1 or more workqueues and 1 or
more engines. The number of groups, workqueues, and engines supported by the
device can be detected from the general capabilities register. The workqueues
are where descriptors queued up before being processed by the engines.

The DSA device also has a memory BAR that contains multiple portals.
Depending on the offset from the BAR, various portals can be used to submit
descriptors with one of the CPU commands mentioned above. The types of
portals are MSIX limited, MSIX unlimited, IMS limited, and IMS unlimited as
defined by the hardware spec. The MSIX unlimited portals are reserved for
kernel submissions. The limited portals can be exported to user space for
application usages. A limited portal is configured by the workqueue threshold
attribute and can be restricted to have a workqueue size that is smaller than
the actual workqueue size. This allows the kernel to submit command descriptors
to a workqueue and not be blocked by the user application.

There are two types of workqueues that the DSA device supports, dedicated and
shared. A dedicated workqueue receives descriptors via the movdir64b
instruction. This instruction is a posted write and therefore does not wait for
a completion. Because of this, the software must keep track of the number of
descriptors submitted to the workqueue. A full workqueue will drop the
descriptor without notice. A shared workqueue accepts the enqcmds instruction in
the kernel and enqcmd instruction from user applications. The command will set
the zero flag to indicate whether the submission of the descriptor is
successful. The enqcmd(s) instruction is non-posted and waits for the write
completion before return.

The stage 1 of the patch submission provides a base driver that only support
the dedicated workqueue type without PASID support. The supported source and
destination addresses must be physical. This is similar to traditional
DMA operations where the device receives a descriptor with physical source and
destination addresses for operation. Plumbing to the existing kernel dmaengine
subsystem is added in order to support such usages. DMA memmove operation can be
tested with the in kernel dmatest module.

A large part of the base driver is the sysfs component.  There is also
no requirement for DSA to be used during early kernel boot. Configuration
of the device during initramfs should be sufficient.

A bus type (dsa_bus) is defined for a hierachy of DSA devices and 
sub-components to be connected to, /sys/bus/dsa/.
A struct device is created for each DSA device and for each of its
sub-component (workqueues, groups, and engines).  So looking under
/sys/bus/dsa/devices, one would observe entries such as dsa0, dsa1, wq0.0,
wq1.0, group0.0, engine0.0, and etc. Each of those has sysfs attributes
underneath that allows the configuration of those parts or reporting status or
capabilities of the parts that they represent.

/sys/bus/dsa/devices
├── dsa0 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0
├── engine0.0 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/engine0.0
├── engine0.1 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/engine0.1
├── engine0.2 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/engine0.2
├── engine0.3 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/engine0.3
├── group0.0 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/group0.0
├── group0.1 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/group0.1
├── group0.2 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/group0.2
├── group0.3 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/group0.3
├── wq0.0 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/wq0.0
├── wq0.1 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/wq0.1
├── wq0.2 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/wq0.2
├── wq0.3 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/wq0.3
├── wq0.4 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/wq0.4
├── wq0.5 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/wq0.5
├── wq0.6 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/wq0.6
├── wq0.7 -> ../../../devices/pci0000:00/0000:00:0a.0/dsa0/wq0.7

Under /sys/bus/dsa/drivers/dsa/ there is a bind and an unbind attribute. Those
allow us to enable and disable the device and workqueue components through the
bus probe and remove functions in the driver. By writing the "device" names
(i.e. dsa0, wq0.0) into bind or unbind attributes we can enable or disable those
components respectively. This is the typical driver-core bind / unbind behavior.

The workqueue device attributes exports two attributes, type and name, to
indicate how the workqueue is being utilized. There are 2 primary types that
the driver recognizes: kernel, user. An additonal mdev type is available from
stage 3 enabling.  The "kernel" type marks the workqueue for in kernel usages.
The "user" type surfaces a char device for user application consumption.
The "name" attribute is a string type that marks the workqueue for more
specific usages. For example, for the dmaengine subsystem to claim the
workqueue the name should be "dmanegine". For "user" queue types, the name
can be any valid string useful for identification by the user application.

For the "user" workqueue that surfaces a char device, char device allows a 
limited portal region to be exported to user applications by the mmap() call
once the application opens the char device.  Character device nodes in
/dev/dsa/wqM.N will be made visible for application to open the device.
A user application can use the enqcmd CPU instruction to submit
descriptors directly to a workqueue without kernel driver involvement.

Kernel branch for easy review:
https://github.com/intel/idxd-driver.git idxd-stage1

[1]: https://software.intel.com/en-us/articles/intel-sdm
[2]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[3]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
[4]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

---

Dave Jiang (13):
      x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
      dmaengine: break out channel registration
      dmaengine: add new dma device registration
      mm: create common code from request allocation based from blk-mq code
      dmaengine: add dma_request support functions
      dmaengine: add dma request submit and completion path support
      dmaengine: update dmatest to support dma request
      dmaengine: idxd: Init and probe for Intel data accelerators
      dmaengine: idxd: add configuration component of driver
      dmaengine: idxd: add descriptor manipulation routines
      dmaengine: idxd: connect idxd to dmaengine subsystem
      dmaengine: request submit optimization
      dmaengine: idxd: add char driver to expose submission portal to userland

Jing Lin (1):
      dmaengine: idxd: add sysfs ABI for idxd driver


 Documentation/ABI/stable/sysfs-driver-dma-idxd |  171 +++
 MAINTAINERS                                    |    8 
 arch/x86/include/asm/io.h                      |   44 +
 block/blk-mq.c                                 |   94 -
 drivers/dma/Kconfig                            |   19 
 drivers/dma/Makefile                           |    2 
 drivers/dma/dma-request.c                      |   96 ++
 drivers/dma/dmaengine.c                        |  312 +++--
 drivers/dma/dmatest.c                          |  366 ++++--
 drivers/dma/idxd/Makefile                      |    2 
 drivers/dma/idxd/cdev.c                        |  304 +++++
 drivers/dma/idxd/device.c                      |  699 +++++++++++
 drivers/dma/idxd/dma.c                         |  120 ++
 drivers/dma/idxd/idxd.h                        |  307 +++++
 drivers/dma/idxd/init.c                        |  550 +++++++++
 drivers/dma/idxd/irq.c                         |  275 ++++
 drivers/dma/idxd/registers.h                   |  336 +++++
 drivers/dma/idxd/submit.c                      |  178 +++
 drivers/dma/idxd/sysfs.c                       | 1510 ++++++++++++++++++++++++
 include/linux/dmaengine.h                      |  132 ++
 include/linux/idxd.h                           |   23 
 include/linux/io.h                             |   11 
 include/linux/mempool.h                        |    6 
 include/uapi/linux/idxd.h                      |  214 +++
 mm/Makefile                                    |    2 
 mm/request_alloc.c                             |   95 ++
 usr/include/Makefile                           |    1 
 27 files changed, 5591 insertions(+), 286 deletions(-)
 create mode 100644 Documentation/ABI/stable/sysfs-driver-dma-idxd
 create mode 100644 drivers/dma/dma-request.c
 create mode 100644 drivers/dma/idxd/Makefile
 create mode 100644 drivers/dma/idxd/cdev.c
 create mode 100644 drivers/dma/idxd/device.c
 create mode 100644 drivers/dma/idxd/dma.c
 create mode 100644 drivers/dma/idxd/idxd.h
 create mode 100644 drivers/dma/idxd/init.c
 create mode 100644 drivers/dma/idxd/irq.c
 create mode 100644 drivers/dma/idxd/registers.h
 create mode 100644 drivers/dma/idxd/submit.c
 create mode 100644 drivers/dma/idxd/sysfs.c
 create mode 100644 include/linux/idxd.h
 create mode 100644 include/uapi/linux/idxd.h
 create mode 100644 mm/request_alloc.c

--
Signature

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
@ 2019-11-20 21:23 ` Dave Jiang
  2019-11-20 21:50   ` Dave Hansen
  2019-11-20 21:53   ` Borislav Petkov
  2019-11-20 21:23 ` [PATCH RFC 02/14] dmaengine: break out channel registration Dave Jiang
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:23 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

With the introduction of movdir64b instruction, there is now an instruction
that can write 64 bytes of data atomicaly.

Quoting from Intel SDM:
"There is no atomicity guarantee provided for the 64-byte load operation
from source address, and processor implementations may use multiple
load operations to read the 64-bytes. The 64-byte direct-store issued
by MOVDIR64B guarantees 64-byte write-completion atomicity. This means
that the data arrives at the destination in a single undivided 64-byte
write transaction."

We have identified at least 3 different use cases for this instruction in
the format of func(dst, src, count):
1) Clear poison / Initialize MKTME memory
   Destination is normal memory.
   Source in normal memory. Does not increment. (Copy same line to all
   targets)
   Count (to clear/init multiple lines)
2) Submit command(s) to new devices
   Destination is a special MMIO region for a device. Does not increment.
   Source is normal memory. Increments.
   Count usually is 1, but can be multiple.
3) Copy to iomem in big chunks
   Destination is iomem and increments
   Source in normal memory and increments
   Count is number of chunks to copy

This commit adds support for case #2 to support device that will accept
commands via this instruction.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 arch/x86/include/asm/io.h |   44 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/io.h        |   11 +++++++++++
 2 files changed, 55 insertions(+)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 6bed97ff6db2..3126f6e1d5b8 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -403,5 +403,49 @@ extern bool arch_memremap_can_ram_remap(resource_size_t offset,
 
 extern bool phys_mem_access_encrypted(unsigned long phys_addr,
 				      unsigned long size);
+#include <linux/cpufeature.h>
+static inline bool cpu_has_write512(void)
+{
+	return cpu_feature_enabled(X86_FEATURE_MOVDIR64B);
+}
+
+#define cpu_has_write512 cpu_has_write512
+
+static inline void __iowrite512(void __iomem *__dst, const void *src)
+{
+	volatile struct { char _[64]; } *dst = __dst;
+
+	/* movdir64b [rdx], rax */
+	asm volatile(".byte 0x66, 0x0f, 0x38, 0xf8, 0x02"
+			: "=m" (dst)
+			: "d" (src), "a" (dst));
+}
+
+/**
+ * iosubmit_cmds512 - copy data to single MMIO location, in 512-bit units
+ * @dst: destination, in MMIO space (must be 512-bit aligned)
+ * @src: source
+ * @count: number of 512 bits quantities to submit
+ *
+ * Submit data from kernel space to MMIO space, in units of 512 bits at a
+ * time.  Order of access is not guaranteed, nor is a memory barrier
+ * performed afterwards.
+ */
+static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
+				    size_t count)
+{
+	const u8 *from = src;
+	const u8 *end = from + count * 64;
+
+	if (!cpu_has_write512())
+		return;
+
+	while (from < end) {
+		__iowrite512(dst, from);
+		from += 64;
+	}
+}
+
+#define iosubmit_cmds512 iosubmit_cmds512
 
 #endif /* _ASM_X86_IO_H */
diff --git a/include/linux/io.h b/include/linux/io.h
index accac822336a..ca14af3e84de 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -20,6 +20,17 @@ __visible void __iowrite32_copy(void __iomem *to, const void *from, size_t count
 void __ioread32_copy(void *to, const void __iomem *from, size_t count);
 void __iowrite64_copy(void __iomem *to, const void *from, size_t count);
 
+#ifndef cpu_has_write512
+#define cpu_has_write512() (0)
+#endif
+
+#ifndef iosubmit_cmds512
+static inline void iosubmit_cmds512(void __iomem *to, const void *from,
+				    size_t count)
+{
+}
+#endif
+
 #ifdef CONFIG_MMU
 int ioremap_page_range(unsigned long addr, unsigned long end,
 		       phys_addr_t phys_addr, pgprot_t prot);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 02/14] dmaengine: break out channel registration
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
  2019-11-20 21:23 ` [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction Dave Jiang
@ 2019-11-20 21:23 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 03/14] dmaengine: add new dma device registration Dave Jiang
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:23 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

In preparation for dynamic channel registration, the code segment that
does the channel registration is broken out to its own function.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/dmaengine.c   |  157 ++++++++++++++++++++++++++++++---------------
 include/linux/dmaengine.h |    4 +
 2 files changed, 107 insertions(+), 54 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 03ac4b96117c..a20ab568b637 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -900,15 +900,109 @@ static int get_dma_id(struct dma_device *device)
 	return 0;
 }
 
+static int __dma_async_device_channel_register(struct dma_device *device,
+					       struct dma_chan *chan,
+					       int chan_id)
+{
+	int rc = 0;
+	int chancnt = device->chancnt;
+	atomic_t *idr_ref;
+	struct dma_chan *tchan;
+
+	tchan = list_first_entry_or_null(&device->channels,
+					 struct dma_chan, device_node);
+	if (tchan->dev) {
+		idr_ref = tchan->dev->idr_ref;
+	} else {
+		idr_ref = kmalloc(sizeof(*idr_ref), GFP_KERNEL);
+		if (!idr_ref)
+			return -ENOMEM;
+		atomic_set(idr_ref, 0);
+	}
+
+	chan->local = alloc_percpu(typeof(*chan->local));
+	if (!chan->local)
+		goto err_out;
+	chan->dev = kzalloc(sizeof(*chan->dev), GFP_KERNEL);
+	if (!chan->dev) {
+		free_percpu(chan->local);
+		chan->local = NULL;
+		goto err_out;
+	}
+
+	/*
+	 * When the chan_id is a negative value, we are dynamically adding
+	 * the channel. Otherwise we are static enumerating.
+	 */
+	chan->chan_id = chan_id < 0 ? chancnt : chan_id;
+	chan->dev->device.class = &dma_devclass;
+	chan->dev->device.parent = device->dev;
+	chan->dev->chan = chan;
+	chan->dev->idr_ref = idr_ref;
+	chan->dev->dev_id = device->dev_id;
+	atomic_inc(idr_ref);
+	dev_set_name(&chan->dev->device, "dma%dchan%d",
+		     device->dev_id, chan->chan_id);
+
+	rc = device_register(&chan->dev->device);
+	if (rc)
+		goto err_out;
+	chan->client_count = 0;
+	device->chancnt = chan->chan_id + 1;
+
+	return 0;
+
+ err_out:
+	free_percpu(chan->local);
+	kfree(chan->dev);
+	if (atomic_dec_return(idr_ref) == 0)
+		kfree(idr_ref);
+	return rc;
+}
+
+int dma_async_device_channel_register(struct dma_device *device,
+				      struct dma_chan *chan)
+{
+	int rc;
+
+	rc = __dma_async_device_channel_register(device, chan, -1);
+	if (rc < 0)
+		return rc;
+
+	dma_channel_rebalance();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dma_async_device_channel_register);
+
+static void __dma_async_device_channel_unregister(struct dma_device *device,
+						  struct dma_chan *chan)
+{
+	WARN_ONCE(chan->client_count,
+		  "%s called while %d clients hold a reference\n",
+		  __func__, chan->client_count);
+	mutex_lock(&dma_list_mutex);
+	chan->dev->chan = NULL;
+	mutex_unlock(&dma_list_mutex);
+	device_unregister(&chan->dev->device);
+	free_percpu(chan->local);
+}
+
+void dma_async_device_channel_unregister(struct dma_device *device,
+					 struct dma_chan *chan)
+{
+	__dma_async_device_channel_unregister(device, chan);
+	dma_channel_rebalance();
+}
+EXPORT_SYMBOL_GPL(dma_async_device_channel_unregister);
+
 /**
  * dma_async_device_register - registers DMA devices found
  * @device: &dma_device
  */
 int dma_async_device_register(struct dma_device *device)
 {
-	int chancnt = 0, rc;
+	int rc, i = 0;
 	struct dma_chan* chan;
-	atomic_t *idr_ref;
 
 	if (!device)
 		return -ENODEV;
@@ -1000,59 +1094,23 @@ int dma_async_device_register(struct dma_device *device)
 	if (device_has_all_tx_types(device))
 		dma_cap_set(DMA_ASYNC_TX, device->cap_mask);
 
-	idr_ref = kmalloc(sizeof(*idr_ref), GFP_KERNEL);
-	if (!idr_ref)
-		return -ENOMEM;
 	rc = get_dma_id(device);
-	if (rc != 0) {
-		kfree(idr_ref);
+	if (rc != 0)
 		return rc;
-	}
-
-	atomic_set(idr_ref, 0);
 
 	/* represent channels in sysfs. Probably want devs too */
 	list_for_each_entry(chan, &device->channels, device_node) {
-		rc = -ENOMEM;
-		chan->local = alloc_percpu(typeof(*chan->local));
-		if (chan->local == NULL)
+		rc = __dma_async_device_channel_register(device, chan, i++);
+		if (rc < 0)
 			goto err_out;
-		chan->dev = kzalloc(sizeof(*chan->dev), GFP_KERNEL);
-		if (chan->dev == NULL) {
-			free_percpu(chan->local);
-			chan->local = NULL;
-			goto err_out;
-		}
-
-		chan->chan_id = chancnt++;
-		chan->dev->device.class = &dma_devclass;
-		chan->dev->device.parent = device->dev;
-		chan->dev->chan = chan;
-		chan->dev->idr_ref = idr_ref;
-		chan->dev->dev_id = device->dev_id;
-		atomic_inc(idr_ref);
-		dev_set_name(&chan->dev->device, "dma%dchan%d",
-			     device->dev_id, chan->chan_id);
-
-		rc = device_register(&chan->dev->device);
-		if (rc) {
-			free_percpu(chan->local);
-			chan->local = NULL;
-			kfree(chan->dev);
-			atomic_dec(idr_ref);
-			goto err_out;
-		}
-		chan->client_count = 0;
 	}
 
-	if (!chancnt) {
+	if (!device->chancnt) {
 		dev_err(device->dev, "%s: device has no channels!\n", __func__);
 		rc = -ENODEV;
 		goto err_out;
 	}
 
-	device->chancnt = chancnt;
-
 	mutex_lock(&dma_list_mutex);
 	/* take references on public channels */
 	if (dmaengine_ref_count && !dma_has_cap(DMA_PRIVATE, device->cap_mask))
@@ -1080,9 +1138,8 @@ int dma_async_device_register(struct dma_device *device)
 
 err_out:
 	/* if we never registered a channel just release the idr */
-	if (atomic_read(idr_ref) == 0) {
+	if (!device->chancnt) {
 		ida_free(&dma_ida, device->dev_id);
-		kfree(idr_ref);
 		return rc;
 	}
 
@@ -1115,16 +1172,8 @@ void dma_async_device_unregister(struct dma_device *device)
 	dma_channel_rebalance();
 	mutex_unlock(&dma_list_mutex);
 
-	list_for_each_entry(chan, &device->channels, device_node) {
-		WARN_ONCE(chan->client_count,
-			  "%s called while %d clients hold a reference\n",
-			  __func__, chan->client_count);
-		mutex_lock(&dma_list_mutex);
-		chan->dev->chan = NULL;
-		mutex_unlock(&dma_list_mutex);
-		device_unregister(&chan->dev->device);
-		free_percpu(chan->local);
-	}
+	list_for_each_entry(chan, &device->channels, device_node)
+		__dma_async_device_channel_unregister(device, chan);
 }
 EXPORT_SYMBOL(dma_async_device_unregister);
 
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 8fcdee1c0cf9..0202d44a17a5 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -1399,6 +1399,10 @@ static inline int dmaengine_desc_free(struct dma_async_tx_descriptor *desc)
 int dma_async_device_register(struct dma_device *device);
 int dmaenginem_async_device_register(struct dma_device *device);
 void dma_async_device_unregister(struct dma_device *device);
+int dma_async_device_channel_register(struct dma_device *device,
+				      struct dma_chan *chan);
+void dma_async_device_channel_unregister(struct dma_device *device,
+					 struct dma_chan *chan);
 void dma_run_dependencies(struct dma_async_tx_descriptor *tx);
 struct dma_chan *dma_get_slave_channel(struct dma_chan *chan);
 struct dma_chan *dma_get_any_slave_channel(struct dma_device *device);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 03/14] dmaengine: add new dma device registration
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
  2019-11-20 21:23 ` [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction Dave Jiang
  2019-11-20 21:23 ` [PATCH RFC 02/14] dmaengine: break out channel registration Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 04/14] mm: create common code from request allocation based from blk-mq code Dave Jiang
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

Add a new device registration call in order to allow dynamic registration
of channels. __dma_async_device_register() will only register the DMA
device. The channel registration is done separately.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/dmaengine.c |  106 ++++++++++++++++++++++++++++-------------------
 1 file changed, 63 insertions(+), 43 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index a20ab568b637..3c74402f1c34 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -149,10 +149,8 @@ static void chan_dev_release(struct device *dev)
 	struct dma_chan_dev *chan_dev;
 
 	chan_dev = container_of(dev, typeof(*chan_dev), device);
-	if (atomic_dec_and_test(chan_dev->idr_ref)) {
-		ida_free(&dma_ida, chan_dev->dev_id);
+	if (atomic_dec_and_test(chan_dev->idr_ref))
 		kfree(chan_dev->idr_ref);
-	}
 	kfree(chan_dev);
 }
 
@@ -950,8 +948,23 @@ static int __dma_async_device_channel_register(struct dma_device *device,
 	chan->client_count = 0;
 	device->chancnt = chan->chan_id + 1;
 
+	if (dmaengine_ref_count &&
+	    !dma_has_cap(DMA_PRIVATE, device->cap_mask)) {
+		if (dma_chan_get(chan) == -ENODEV) {
+			/*
+			 * Note we can only get here for the first
+			 * channel as the remaining channels are
+			 * guaranteed to get a reference.
+			 */
+			rc = -ENODEV;
+			goto chan_get_err;
+		}
+	}
+
 	return 0;
 
+ chan_get_err:
+	device_unregister(&chan->dev->device);
  err_out:
 	free_percpu(chan->local);
 	kfree(chan->dev);
@@ -981,6 +994,8 @@ static void __dma_async_device_channel_unregister(struct dma_device *device,
 		  "%s called while %d clients hold a reference\n",
 		  __func__, chan->client_count);
 	mutex_lock(&dma_list_mutex);
+	list_del(&chan->device_node);
+	device->chancnt--;
 	chan->dev->chan = NULL;
 	mutex_unlock(&dma_list_mutex);
 	device_unregister(&chan->dev->device);
@@ -995,13 +1010,53 @@ void dma_async_device_channel_unregister(struct dma_device *device,
 }
 EXPORT_SYMBOL_GPL(dma_async_device_channel_unregister);
 
+/**
+ * __dma_async_device_register - registers DMA devices found.
+ * Core function that registers a DMA device.
+ * @device: &dma_device
+ */
+static int __dma_async_device_register(struct dma_device *device)
+{
+	struct dma_chan *chan;
+	int rc, i = 0;
+
+	if (!device)
+		return -ENODEV;
+
+	/* Validate device routines */
+	if (!device->dev) {
+		pr_err("DMA device must have valid dev\n");
+		return -EIO;
+	}
+
+	rc = get_dma_id(device);
+	if (rc != 0)
+		return rc;
+
+	/* represent channels in sysfs. Probably want devs too */
+	list_for_each_entry(chan, &device->channels, device_node) {
+		rc = __dma_async_device_channel_register(device, chan, i++);
+		if (rc < 0)
+			return rc;
+	}
+
+	mutex_lock(&dma_list_mutex);
+	list_add_tail_rcu(&device->global_node, &dma_device_list);
+	if (dma_has_cap(DMA_PRIVATE, device->cap_mask))
+		device->privatecnt++;	/* Always private */
+	dma_channel_rebalance();
+	mutex_unlock(&dma_list_mutex);
+
+	return 0;
+}
+
 /**
  * dma_async_device_register - registers DMA devices found
  * @device: &dma_device
  */
 int dma_async_device_register(struct dma_device *device)
 {
-	int rc, i = 0;
+	int rc;
 	struct dma_chan* chan;
 
 	if (!device)
@@ -1094,45 +1149,9 @@ int dma_async_device_register(struct dma_device *device)
 	if (device_has_all_tx_types(device))
 		dma_cap_set(DMA_ASYNC_TX, device->cap_mask);
 
-	rc = get_dma_id(device);
+	rc = __dma_async_device_register(device);
 	if (rc != 0)
-		return rc;
-
-	/* represent channels in sysfs. Probably want devs too */
-	list_for_each_entry(chan, &device->channels, device_node) {
-		rc = __dma_async_device_channel_register(device, chan, i++);
-		if (rc < 0)
-			goto err_out;
-	}
-
-	if (!device->chancnt) {
-		dev_err(device->dev, "%s: device has no channels!\n", __func__);
-		rc = -ENODEV;
 		goto err_out;
-	}
-
-	mutex_lock(&dma_list_mutex);
-	/* take references on public channels */
-	if (dmaengine_ref_count && !dma_has_cap(DMA_PRIVATE, device->cap_mask))
-		list_for_each_entry(chan, &device->channels, device_node) {
-			/* if clients are already waiting for channels we need
-			 * to take references on their behalf
-			 */
-			if (dma_chan_get(chan) == -ENODEV) {
-				/* note we can only get here for the first
-				 * channel as the remaining channels are
-				 * guaranteed to get a reference
-				 */
-				rc = -ENODEV;
-				mutex_unlock(&dma_list_mutex);
-				goto err_out;
-			}
-		}
-	list_add_tail_rcu(&device->global_node, &dma_device_list);
-	if (dma_has_cap(DMA_PRIVATE, device->cap_mask))
-		device->privatecnt++;	/* Always private */
-	dma_channel_rebalance();
-	mutex_unlock(&dma_list_mutex);
 
 	return 0;
 
@@ -1165,15 +1184,16 @@ EXPORT_SYMBOL(dma_async_device_register);
  */
 void dma_async_device_unregister(struct dma_device *device)
 {
-	struct dma_chan *chan;
+	struct dma_chan *chan, *n;
 
 	mutex_lock(&dma_list_mutex);
 	list_del_rcu(&device->global_node);
 	dma_channel_rebalance();
 	mutex_unlock(&dma_list_mutex);
 
-	list_for_each_entry(chan, &device->channels, device_node)
+	list_for_each_entry_safe(chan, n, &device->channels, device_node)
 		__dma_async_device_channel_unregister(device, chan);
+	ida_free(&dma_ida, device->dev_id);
 }
 EXPORT_SYMBOL(dma_async_device_unregister);
 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 04/14] mm: create common code from request allocation based from blk-mq code
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (2 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 03/14] dmaengine: add new dma device registration Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 05/14] dmaengine: add dma_request support functions Dave Jiang
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

Move the allocation of requests from compound pages to a common function
to allow usages by other callers. Since the routine has more to do with
memory allocation and management, it is moved to be exported by the
mempool.h and be part of mm subsystem.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 block/blk-mq.c          |   94 +++++++++++++----------------------------------
 include/linux/mempool.h |    6 +++
 mm/Makefile             |    2 -
 mm/request_alloc.c      |   95 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 128 insertions(+), 69 deletions(-)
 create mode 100644 mm/request_alloc.c

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ec791156e9cc..399dfe7b1d2e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -10,7 +10,6 @@
 #include <linux/backing-dev.h>
 #include <linux/bio.h>
 #include <linux/blkdev.h>
-#include <linux/kmemleak.h>
 #include <linux/mm.h>
 #include <linux/init.h>
 #include <linux/slab.h>
@@ -26,6 +25,7 @@
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
 #include <linux/prefetch.h>
+#include <linux/mempool.h>
 
 #include <trace/events/block.h>
 
@@ -2054,8 +2054,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
-	struct page *page;
-
 	if (tags->rqs && set->ops->exit_request) {
 		int i;
 
@@ -2069,16 +2067,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		}
 	}
 
-	while (!list_empty(&tags->page_list)) {
-		page = list_first_entry(&tags->page_list, struct page, lru);
-		list_del_init(&page->lru);
-		/*
-		 * Remove kmemleak object previously allocated in
-		 * blk_mq_alloc_rqs().
-		 */
-		kmemleak_free(page_address(page));
-		__free_pages(page, page->private);
-	}
+	request_from_pages_free(&tags->page_list);
 }
 
 void blk_mq_free_rq_map(struct blk_mq_tags *tags)
@@ -2128,11 +2117,6 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	return tags;
 }
 
-static size_t order_to_size(unsigned int order)
-{
-	return (size_t)PAGE_SIZE << order;
-}
-
 static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			       unsigned int hctx_idx, int node)
 {
@@ -2148,12 +2132,20 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 	return 0;
 }
 
+static void blk_mq_assign_request(void *ctx, void *ptr, int idx)
+{
+	struct blk_mq_tags *tags = (struct blk_mq_tags *)ctx;
+	struct request *rq = ptr;
+
+	tags->static_rqs[idx] = rq;
+}
+
 int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx, unsigned int depth)
 {
-	unsigned int i, j, entries_per_page, max_order = 4;
-	size_t rq_size, left;
-	int node;
+	unsigned int i;
+	size_t rq_size;
+	int node, rc;
 
 	node = blk_mq_hw_queue_to_node(&set->map[HCTX_TYPE_DEFAULT], hctx_idx);
 	if (node == NUMA_NO_NODE)
@@ -2167,62 +2159,28 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	 */
 	rq_size = round_up(sizeof(struct request) + set->cmd_size,
 				cache_line_size());
-	left = rq_size * depth;
-
-	for (i = 0; i < depth; ) {
-		int this_order = max_order;
-		struct page *page;
-		int to_do;
-		void *p;
-
-		while (this_order && left < order_to_size(this_order - 1))
-			this_order--;
-
-		do {
-			page = alloc_pages_node(node,
-				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
-				this_order);
-			if (page)
-				break;
-			if (!this_order--)
-				break;
-			if (order_to_size(this_order) < rq_size)
-				break;
-		} while (1);
 
-		if (!page)
-			goto fail;
+	rc = request_from_pages_alloc((void *)tags, depth, rq_size,
+				      &tags->page_list, 4, node,
+				      blk_mq_assign_request);
+	if (rc < 0)
+		goto fail;
 
-		page->private = this_order;
-		list_add_tail(&page->lru, &tags->page_list);
+	for (i = 0; i < rc; i++) {
+		struct request *rq = tags->static_rqs[i];
 
-		p = page_address(page);
-		/*
-		 * Allow kmemleak to scan these pages as they contain pointers
-		 * to additional allocations like via ops->init_request().
-		 */
-		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
-		entries_per_page = order_to_size(this_order) / rq_size;
-		to_do = min(entries_per_page, depth - i);
-		left -= to_do * rq_size;
-		for (j = 0; j < to_do; j++) {
-			struct request *rq = p;
-
-			tags->static_rqs[i] = rq;
-			if (blk_mq_init_request(set, rq, hctx_idx, node)) {
-				tags->static_rqs[i] = NULL;
-				goto fail;
-			}
-
-			p += rq_size;
-			i++;
+		if (blk_mq_init_request(set, rq, hctx_idx, node)) {
+			tags->static_rqs[i] = NULL;
+			rc = -ENOMEM;
+			goto fail;
 		}
 	}
+
 	return 0;
 
 fail:
 	blk_mq_free_rqs(set, tags, hctx_idx);
-	return -ENOMEM;
+	return rc;
 }
 
 /*
diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 0c964ac107c2..5b1f6214c881 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -108,4 +108,10 @@ static inline mempool_t *mempool_create_page_pool(int min_nr, int order)
 			      (void *)(long)order);
 }
 
+int request_from_pages_alloc(void *ctx, unsigned int depth, size_t rq_size,
+			     struct list_head *page_list, int max_order,
+			     int node,
+			     void (*assign)(void *ctx, void *req, int idx));
+void request_from_pages_free(struct list_head *page_list);
+
 #endif /* _LINUX_MEMPOOL_H */
diff --git a/mm/Makefile b/mm/Makefile
index d996846697ef..b122e7ddd1e5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -42,7 +42,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o $(mmu-y)
+			   debug.o gup.o request_alloc.o $(mmu-y)
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/request_alloc.c b/mm/request_alloc.c
new file mode 100644
index 000000000000..01ebea8ccdfc
--- /dev/null
+++ b/mm/request_alloc.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Common function for struct allocation. Moved from blk-mq code
+ *
+ * Copyright (C) 2013-2014 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/export.h>
+#include <linux/mm_types.h>
+#include <linux/list.h>
+#include <linux/kmemleak.h>
+#include <linux/mm.h>
+
+void request_from_pages_free(struct list_head *page_list)
+{
+	struct page *page, *n;
+
+	list_for_each_entry_safe(page, n, page_list, lru) {
+		list_del_init(&page->lru);
+		/*
+		 * Remove kmemleak object previously allocated in
+		 * blk_mq_alloc_rqs().
+		 */
+		kmemleak_free(page_address(page));
+		__free_pages(page, page->private);
+	}
+}
+EXPORT_SYMBOL_GPL(request_from_pages_free);
+
+static size_t order_to_size(unsigned int order)
+{
+	return (size_t)PAGE_SIZE << order;
+}
+
+int request_from_pages_alloc(void *ctx, unsigned int depth, size_t rq_size,
+			     struct list_head *page_list, int max_order,
+			     int node,
+			     void (*assign)(void *ctx, void *req, int idx))
+{
+	size_t left;
+	unsigned int i, j, entries_per_page;
+
+	left = rq_size * depth;
+
+	for (i = 0; i < depth; ) {
+		int this_order = max_order;
+		struct page *page;
+		int to_do;
+		void *p;
+
+		while (this_order && left < order_to_size(this_order - 1))
+			this_order--;
+
+		do {
+			page = alloc_pages_node(node,
+						GFP_NOIO | __GFP_NOWARN |
+						__GFP_NORETRY | __GFP_ZERO,
+						this_order);
+			if (page)
+				break;
+			if (!this_order--)
+				break;
+			if (order_to_size(this_order) < rq_size)
+				break;
+		} while (1);
+
+		if (!page)
+			goto fail;
+
+		page->private = this_order;
+		list_add_tail(&page->lru, page_list);
+
+		p = page_address(page);
+		/*
+		 * Allow kmemleak to scan these pages as they contain pointers
+		 * to additional allocations like via ops->init_request().
+		 */
+		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
+		entries_per_page = order_to_size(this_order) / rq_size;
+		to_do = min(entries_per_page, depth - i);
+		left -= to_do * rq_size;
+		for (j = 0; j < to_do; j++) {
+			assign((void *)ctx, p, i);
+			p += rq_size;
+			i++;
+		}
+	}
+
+	return i;
+
+fail:
+	request_from_pages_free(page_list);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(request_from_pages_alloc);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 05/14] dmaengine: add dma_request support functions
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (3 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 04/14] mm: create common code from request allocation based from blk-mq code Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 06/14] dmaengine: add dma request submit and completion path support Dave Jiang
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

In order to provide a lockless submission path, the request context needs
to be pre-allocated rather than pulling from a memory pool.
Use the common request allocation call request_from_pages_alloc() to
accomplish this. The sbitmap code will be used to get the next
free request context. This is a simplified version of what blk-mq does
(not sbitmap_queue). The config option DMA_ENGINE_REQUEST is added so that
only drivers that supports dma request would enable the code.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/Kconfig       |    5 ++
 drivers/dma/Makefile      |    1 
 drivers/dma/dma-request.c |   96 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dmaengine.h |   57 +++++++++++++++++++++++++++
 4 files changed, 159 insertions(+)
 create mode 100644 drivers/dma/dma-request.c

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 7af874b69ffb..8885e9d3f363 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -56,6 +56,11 @@ config DMA_OF
 	depends on OF
 	select DMA_ENGINE
 
+config DMA_ENGINE_REQUEST
+	def_bool n
+	depends on DMA_ENGINE
+	select SBITMAP
+
 #devices
 config ALTERA_MSGDMA
 	tristate "Altera / Intel mSGDMA Engine"
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index f5ce8665e944..205f343e39fe 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
 obj-$(CONFIG_DMA_VIRTUAL_CHANNELS) += virt-dma.o
 obj-$(CONFIG_DMA_ACPI) += acpi-dma.o
 obj-$(CONFIG_DMA_OF) += of-dma.o
+obj-$(CONFIG_DMA_ENGINE_REQUEST) += dma-request.o
 
 #dmatest
 obj-$(CONFIG_DMATEST) += dmatest.o
diff --git a/drivers/dma/dma-request.c b/drivers/dma/dma-request.c
new file mode 100644
index 000000000000..01390f179107
--- /dev/null
+++ b/drivers/dma/dma-request.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Copyright(c) 2019 Intel Corporation. All rights reserved.  */
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/device.h>
+#include <linux/dmaengine.h>
+#include <linux/mempool.h>
+
+struct dma_request *dma_chan_alloc_request(struct dma_chan *chan)
+{
+	int nr;
+	struct dma_request *req;
+
+	nr = sbitmap_get(&chan->sbmap, 0, false);
+	if (nr < 0)
+		return NULL;
+
+	req = chan->rqs[nr];
+	req->rq_private = NULL;
+	req->callback = NULL;
+	memset(&req->result, 0, sizeof(struct dmaengine_result));
+	return req;
+}
+EXPORT_SYMBOL_GPL(dma_chan_alloc_request);
+
+void dma_chan_free_request(struct dma_chan *chan, struct dma_request *rq)
+{
+	sbitmap_clear_bit(&chan->sbmap, rq->id);
+}
+EXPORT_SYMBOL_GPL(dma_chan_free_request);
+
+void dma_chan_free_request_resources(struct dma_chan *chan)
+{
+	request_from_pages_free(&chan->page_list);
+	kfree(chan->rqs);
+}
+EXPORT_SYMBOL_GPL(dma_chan_free_request_resources);
+
+static void dma_chan_assign_request(void *ctx, void *ptr, int idx)
+{
+	struct dma_chan *chan = (struct dma_chan *)ctx;
+	struct dma_request *rq = ptr;
+
+	chan->rqs[idx] = rq;
+}
+
+int dma_chan_alloc_request_resources(struct dma_chan *chan)
+{
+	int i, node, rc, id = 0;
+	size_t rq_size;
+
+	/* Requests are already allocated */
+	if (chan->rqs)
+		return 0;
+
+	node = dev_to_node(chan->device->dev);
+	rc = sbitmap_init_node(&chan->sbmap, chan->depth, -1,
+			       GFP_KERNEL, node);
+	if (rc < 0)
+		return rc;
+
+	chan->rqs = kcalloc_node(chan->depth, sizeof(struct dma_request *),
+				 GFP_KERNEL, node);
+	if (!chan->rqs) {
+		rc = -ENOMEM;
+		goto fail;
+	}
+
+	INIT_LIST_HEAD(&chan->page_list);
+
+	rq_size = round_up(sizeof(struct dma_request) +
+			chan->max_sgs * sizeof(struct scatterlist),
+			cache_line_size());
+
+	rc = request_from_pages_alloc((void *)chan, chan->depth, rq_size,
+				      &chan->page_list, 4, node,
+				      dma_chan_assign_request);
+	if (rc < 0)
+		goto fail;
+
+	for (i = 0; i < rc; i++) {
+		struct dma_request *rq = chan->rqs[i];
+
+		rq->id = id++;
+		rq->chan = chan;
+	}
+
+	return 0;
+
+ fail:
+	sbitmap_free(&chan->sbmap);
+	dma_chan_free_request_resources(chan);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(dma_chan_alloc_request_resources);
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 0202d44a17a5..7bc8c3f8283f 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -12,6 +12,8 @@
 #include <linux/scatterlist.h>
 #include <linux/bitmap.h>
 #include <linux/types.h>
+#include <linux/sbitmap.h>
+#include <linux/bvec.h>
 #include <asm/page.h>
 
 /**
@@ -176,6 +178,8 @@ struct dma_interleaved_template {
  * @DMA_PREP_CMD: tell the driver that the data passed to DMA API is command
  *  data and the descriptor should be in different format from normal
  *  data descriptors.
+ *  @DMA_SUBMIT_NONBLOCK: tell the driver do not wait for resources if submit
+ *  is not possible.
  */
 enum dma_ctrl_flags {
 	DMA_PREP_INTERRUPT = (1 << 0),
@@ -186,6 +190,7 @@ enum dma_ctrl_flags {
 	DMA_PREP_FENCE = (1 << 5),
 	DMA_CTRL_REUSE = (1 << 6),
 	DMA_PREP_CMD = (1 << 7),
+	DMA_SUBMIT_NONBLOCK = (1 << 8),
 };
 
 /**
@@ -268,6 +273,13 @@ struct dma_chan {
 	struct dma_router *router;
 	void *route_data;
 
+	/* DMA request */
+	int max_sgs;
+	int depth;
+	struct sbitmap sbmap;
+	struct dma_request **rqs;
+	struct list_head page_list;
+
 	void *private;
 };
 
@@ -511,6 +523,25 @@ struct dma_async_tx_descriptor {
 #endif
 };
 
+struct dma_request {
+	int id;
+	struct dma_chan *chan;
+	enum dma_transaction_type cmd;
+	enum dma_ctrl_flags flags;
+	struct bio_vec bvec;
+	dma_addr_t pg_dma;
+	int sg_nents;
+	void *rq_private;
+
+	/* Set by driver */
+	dma_async_tx_callback_result callback;
+	struct dmaengine_result result;
+	void *callback_param;
+
+	/* Leave as last member for flexible array of scatterlist */
+	struct scatterlist sg[];
+};
+
 #ifdef CONFIG_DMA_ENGINE
 static inline void dma_set_unmap(struct dma_async_tx_descriptor *tx,
 				 struct dmaengine_unmap_data *unmap)
@@ -1359,6 +1390,32 @@ static inline int dma_get_slave_caps(struct dma_chan *chan,
 }
 #endif
 
+#ifdef CONFIG_DMA_ENGINE_REQUEST
+struct dma_request *dma_chan_alloc_request(struct dma_chan *chan);
+void dma_chan_free_request(struct dma_chan *chan, struct dma_request *rq);
+void dma_chan_free_request_resources(struct dma_chan *chan);
+int dma_chan_alloc_request_resources(struct dma_chan *chan);
+#else
+static inline struct dma_request *dma_chan_alloc_request(struct dma_chan *chan)
+{
+	return NULL;
+}
+
+static inline void dma_chan_free_request(struct dma_chan *chan,
+					 struct dma_request *rq)
+{
+}
+
+static inline void dma_chan_free_request_resources(struct dma_chan *chan)
+{
+}
+
+static inline int dma_chan_alloc_request_resources(struct dma_chan *chan)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
 #define dma_request_slave_channel_reason(dev, name) dma_request_chan(dev, name)
 
 static inline int dmaengine_desc_set_reuse(struct dma_async_tx_descriptor *tx)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 06/14] dmaengine: add dma request submit and completion path support
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (4 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 05/14] dmaengine: add dma_request support functions Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 07/14] dmaengine: update dmatest to support dma request Dave Jiang
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

There are several issues with the existing dmaengine submission APIs.
1. A new function pointer is introduced every time a new operation is
   added.
2. The whole submission path requires locking and requires multiple API
   calls with prep+submit+start engine.

A new DMA register function for request based DMA devices is added,
dma_async_request_device_register(). This allows the checking of parts for
dma requests that are setup.

A new submission API call that can start an I/O immediately in a single
call and will be lockless is being introduced. A helper function that
submits and wait is also added for consumers such as dmatest that will wait
on the completion of the I/O. And a helper function is added that completes
the I/O by either calling complete() or envoking the callback depending on
the setup for submission.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/dmaengine.c   |   59 ++++++++++++++++++++++++++++++++++++++++
 include/linux/dmaengine.h |   67 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 3c74402f1c34..5b053624f9e3 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -1050,6 +1050,37 @@ static int __dma_async_device_register(struct dma_device *device)
 	return 0;
 }
 
+/**
+ * dma_async_request_device_register - registers DMA devices found that
+ *					support DMA requests.
+ * @device: &dma_device
+ */
+int dma_async_request_device_register(struct dma_device *device)
+{
+	int rc;
+
+	if (!device)
+		return -ENODEV;
+
+	/* validate device routines */
+	if (!device->dev) {
+		pr_err("DMA device must have dev\n");
+		return -EIO;
+	}
+
+	if (!device->device_submit_request) {
+		dev_err(device->dev, "Device has no op defined\n");
+		return -EIO;
+	}
+
+	rc = __dma_async_device_register(device);
+	if (rc != 0)
+		return rc;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dma_async_request_device_register);
+
 /**
  * dma_async_device_register - registers DMA devices found
  * @device: &dma_device
@@ -1232,6 +1263,34 @@ int dmaenginem_async_device_register(struct dma_device *device)
 }
 EXPORT_SYMBOL(dmaenginem_async_device_register);
 
+/**
+ * dmaenginem_async_request_device_register - registers DMA devices
+ *					support DMA requests found
+ * @device: &dma_device
+ *
+ * The operation is managed and will be undone on driver detach.
+ */
+int dmaenginem_async_request_device_register(struct dma_device *device)
+{
+	void *p;
+	int ret;
+
+	p = devres_alloc(dmam_device_release, sizeof(void *), GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+
+	ret = dma_async_request_device_register(device);
+	if (!ret) {
+		*(struct dma_device **)p = device;
+		devres_add(device->dev, p);
+	} else {
+		devres_free(p);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(dmaenginem_async_request_device_register);
+
 struct dmaengine_unmap_pool {
 	struct kmem_cache *cache;
 	const char *name;
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 7bc8c3f8283f..220d241d71ed 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -461,6 +461,7 @@ enum dmaengine_tx_result {
 	DMA_TRANS_NOERROR = 0,		/* SUCCESS */
 	DMA_TRANS_READ_FAILED,		/* Source DMA read failed */
 	DMA_TRANS_WRITE_FAILED,		/* Destination DMA write failed */
+	DMA_TRANS_ERROR,		/* General error not rd/wr */
 	DMA_TRANS_ABORTED,		/* Op never submitted / aborted */
 };
 
@@ -831,6 +832,10 @@ struct dma_device {
 					    dma_cookie_t cookie,
 					    struct dma_tx_state *txstate);
 	void (*device_issue_pending)(struct dma_chan *chan);
+
+	/* function calls for request API */
+	int (*device_submit_request)(struct dma_chan *chan,
+				     struct dma_request *req);
 };
 
 static inline int dmaengine_slave_config(struct dma_chan *chan,
@@ -1390,6 +1395,66 @@ static inline int dma_get_slave_caps(struct dma_chan *chan,
 }
 #endif
 
+/* dmaengine_submit_request - helper routine for caller to submit
+ *				a DMA request.
+ * @chan: dma channel context
+ * @req: dma request context
+ */
+static inline int dmaengine_submit_request(struct dma_chan *chan,
+					   struct dma_request *req)
+{
+	struct dma_device *ddev;
+
+	if (!chan)
+		return -EINVAL;
+
+	ddev = chan->device;
+	if (!ddev->device_submit_request)
+		return -EINVAL;
+
+	return ddev->device_submit_request(chan, req);
+}
+
+/* dmaengine_submit_request_and_wait - helper routine for caller to submit
+ *					a DMA request and wait until
+ *					completion or timeout.
+ * @chan: dma channel context
+ * @req: dma request context
+ * @timeout: time in jiffies to wait for completion timeout. A timeout of 0
+ *		equals to wait indefinitely.
+ */
+static inline int dmaengine_submit_request_and_wait(struct dma_chan *chan,
+						    struct dma_request *req,
+						    int timeout)
+{
+	int rc;
+	DECLARE_COMPLETION_ONSTACK(done);
+
+	req->rq_private = &done;
+	rc = dmaengine_submit_request(chan, req);
+	if (rc < 0)
+		return rc;
+
+	if (timeout)
+		return wait_for_completion_timeout(&done, timeout);
+
+	wait_for_completion(&done);
+	return 0;
+}
+
+/* dmaengine_request_complete - helper function to complete dma request.
+ *				If callback exists will envoke callback.
+ *
+ * @req - dma request context
+ */
+static inline void dmaengine_request_complete(struct dma_request *req)
+{
+	if (req->rq_private)
+		complete(req->rq_private);
+	else if (req->callback)
+		req->callback(req->callback_param, &req->result);
+}
+
 #ifdef CONFIG_DMA_ENGINE_REQUEST
 struct dma_request *dma_chan_alloc_request(struct dma_chan *chan);
 void dma_chan_free_request(struct dma_chan *chan, struct dma_request *rq);
@@ -1454,7 +1519,9 @@ static inline int dmaengine_desc_free(struct dma_async_tx_descriptor *desc)
 /* --- DMA device --- */
 
 int dma_async_device_register(struct dma_device *device);
+int dma_async_request_device_register(struct dma_device *device);
 int dmaenginem_async_device_register(struct dma_device *device);
+int dmaenginem_async_request_device_register(struct dma_device *device);
 void dma_async_device_unregister(struct dma_device *device);
 int dma_async_device_channel_register(struct dma_device *device,
 				      struct dma_chan *chan);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 07/14] dmaengine: update dmatest to support dma request
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (5 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 06/14] dmaengine: add dma request submit and completion path support Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 08/14] dmaengine: idxd: Init and probe for Intel data accelerators Dave Jiang
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

Introduce dmatest to the new dma request API so the testing of the new
drivers that utilizies the new functionalities can be performed. The
existing DMA setup function has been split to its own function.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/dmatest.c |  366 ++++++++++++++++++++++++++++++++-----------------
 1 file changed, 241 insertions(+), 125 deletions(-)

diff --git a/drivers/dma/dmatest.c b/drivers/dma/dmatest.c
index a2cadfa2e6d7..a544f05b0fd7 100644
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -535,6 +535,240 @@ static int dmatest_alloc_test_data(struct dmatest_data *d,
 	return -ENOMEM;
 }
 
+static int dma_test_req_op(struct dmatest_thread *thread,
+			   struct dmatest_data *src, struct dmatest_data *dst,
+			   int total_tests, unsigned int len)
+{
+	struct dmatest_info *info = thread->info;
+	struct dmatest_params *params = &info->params;
+	struct dma_chan *chan = thread->chan;
+	struct dma_device *dma_dev = chan->device;
+	struct device *dev = dma_dev->dev;
+	enum dma_ctrl_flags flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
+	int ret = -ENOMEM;
+	struct dma_request *rq;
+	struct scatterlist *sg;
+	void *vdst = dst->aligned[0] + dst->off;
+	int req_retry = 100;
+
+	do {
+		rq = dma_chan_alloc_request(chan);
+		if (!rq)
+			msleep(20);
+	} while (!rq && req_retry--);
+
+	if (!rq) {
+		result("get request", total_tests, src->off, dst->off,
+		       len, ret);
+		return -ENXIO;
+	}
+
+	if (thread->type == DMA_MEMCPY) {
+		rq->cmd = DMA_MEMCPY;
+	} else {
+		dma_chan_free_request(chan, rq);
+		result("wrong thread type", total_tests, src->off, dst->off,
+		       len, ret);
+		return -ENXIO;
+	}
+
+	rq->chan = chan;
+	rq->flags = flags;
+	rq->bvec.bv_page = virt_to_page(vdst);
+	rq->bvec.bv_offset = offset_in_page(vdst);
+	rq->pg_dma = dma_map_page(dev, rq->bvec.bv_page, rq->bvec.bv_offset,
+				  len, DMA_FROM_DEVICE);
+	if (rq->pg_dma == DMA_MAPPING_ERROR) {
+		dma_chan_free_request(chan, rq);
+		result("DMA map dest", total_tests, src->off, dst->off,
+		       len, ret);
+		msleep(100);
+		return -ENXIO;
+	}
+
+	rq->bvec.bv_len = len;
+	sg = &rq->sg[0];
+	sg_init_one(sg, src->aligned[0] + src->off, len);
+	rq->sg_nents = 1;
+
+	ret = dma_map_sg(dev, sg, 1, DMA_TO_DEVICE);
+	if (ret == 0) {
+		dma_unmap_page(dev, rq->pg_dma, len, DMA_FROM_DEVICE);
+		dma_chan_free_request(chan, rq);
+		result("DMA map src", total_tests, src->off, dst->off,
+		       len, ret);
+		return -ENXIO;
+	}
+
+	ret = dmaengine_submit_request_and_wait(chan, rq,
+						msecs_to_jiffies(params->timeout));
+	if (ret < 0) {
+		dma_chan_free_request(chan, rq);
+		dma_unmap_page(dev, rq->pg_dma, len, DMA_FROM_DEVICE);
+		dma_unmap_sg(dev, sg, 1, DMA_TO_DEVICE);
+		result("submit error", total_tests, src->off, dst->off,
+		       len, ret);
+		return -ENXIO;
+	}
+
+	if (ret == 0) {
+		result("test timed out", total_tests, src->off,
+		       dst->off, len, 0);
+		ret = -ETIMEDOUT;
+		goto out_unmap;
+	} else if (rq->result.result != DMA_TRANS_NOERROR) {
+		result("completion error", total_tests, src->off,
+		       dst->off, len, ret);
+		ret = -ENXIO;
+		goto out_unmap;
+	}
+
+ out_unmap:
+	dma_unmap_page(dev, rq->pg_dma, len, DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, sg, 1, DMA_TO_DEVICE);
+	dma_chan_free_request(chan, rq);
+
+	return ret;
+}
+
+static int dma_test_op(struct dmatest_thread *thread,
+		       struct dmatest_data *src, struct dmatest_data *dst,
+		       dma_addr_t *srcs, dma_addr_t *dma_pq,
+		       int total_tests, unsigned int len, u8 *pq_coefs)
+{
+	struct dma_async_tx_descriptor *tx = NULL;
+	struct dmaengine_unmap_data *um;
+	struct dmatest_info *info = thread->info;
+	struct dmatest_params *params = &info->params;
+	struct dma_chan *chan = thread->chan;
+	struct dma_device *dev = chan->device;
+	struct dmatest_done *done = &thread->test_done;
+	enum dma_ctrl_flags flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
+	dma_cookie_t cookie;
+	dma_addr_t *dsts;
+	enum dma_status status;
+	int ret = -ENOMEM, i;
+
+	um = dmaengine_get_unmap_data(dev->dev, src->cnt + dst->cnt,
+				      GFP_KERNEL);
+	if (!um) {
+		result("unmap data NULL", total_tests,
+		       src->off, dst->off, len, ret);
+		return -ENXIO;
+	}
+
+	um->len = len;
+	for (i = 0; i < src->cnt; i++) {
+		void *buf = src->aligned[i];
+		struct page *pg = virt_to_page(buf);
+		unsigned long pg_off = offset_in_page(buf);
+
+		um->addr[i] = dma_map_page(dev->dev, pg, pg_off,
+					   um->len, DMA_TO_DEVICE);
+		srcs[i] = um->addr[i] + src->off;
+		ret = dma_mapping_error(dev->dev, um->addr[i]);
+		if (ret) {
+			result("src mapping error", total_tests,
+			       src->off, dst->off, len, ret);
+			goto error_unmap;
+		}
+		um->to_cnt++;
+	}
+	/* map with DMA_BIDIRECTIONAL to force writeback/invalidate */
+	dsts = &um->addr[src->cnt];
+	for (i = 0; i < dst->cnt; i++) {
+		void *buf = dst->aligned[i];
+		struct page *pg = virt_to_page(buf);
+		unsigned long pg_off = offset_in_page(buf);
+
+		dsts[i] = dma_map_page(dev->dev, pg, pg_off, um->len,
+				       DMA_BIDIRECTIONAL);
+		ret = dma_mapping_error(dev->dev, dsts[i]);
+		if (ret) {
+			result("dst mapping error", total_tests,
+			       src->off, dst->off, len, ret);
+			goto error_unmap;
+		}
+		um->bidi_cnt++;
+	}
+
+	if (thread->type == DMA_MEMCPY)
+		tx = dev->device_prep_dma_memcpy(chan,
+						 dsts[0] + dst->off,
+						 srcs[0], len, flags);
+	else if (thread->type == DMA_MEMSET)
+		tx = dev->device_prep_dma_memset(chan,
+					dsts[0] + dst->off,
+					*(src->aligned[0] + src->off),
+					len, flags);
+	else if (thread->type == DMA_XOR)
+		tx = dev->device_prep_dma_xor(chan,
+					      dsts[0] + dst->off,
+					      srcs, src->cnt,
+					      len, flags);
+	else if (thread->type == DMA_PQ) {
+		for (i = 0; i < dst->cnt; i++)
+			dma_pq[i] = dsts[i] + dst->off;
+		tx = dev->device_prep_dma_pq(chan, dma_pq, srcs,
+					     src->cnt, pq_coefs,
+					     len, flags);
+	}
+
+	if (!tx) {
+		result("prep error", total_tests, src->off,
+		       dst->off, len, ret);
+		msleep(100);
+		goto error_unmap;
+	}
+
+	done->done = false;
+	if (!params->polled) {
+		tx->callback = dmatest_callback;
+		tx->callback_param = done;
+	}
+	cookie = tx->tx_submit(tx);
+
+	if (dma_submit_error(cookie)) {
+		result("submit error", total_tests, src->off,
+		       dst->off, len, ret);
+		msleep(100);
+		goto error_unmap;
+	}
+
+	if (params->polled) {
+		status = dma_sync_wait(chan, cookie);
+		dmaengine_terminate_sync(chan);
+		if (status == DMA_COMPLETE)
+			done->done = true;
+	} else {
+		dma_async_issue_pending(chan);
+
+		wait_event_freezable_timeout(thread->done_wait, done->done,
+					     msecs_to_jiffies(params->timeout));
+
+		status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
+	}
+
+	if (!done->done) {
+		result("test timed out", total_tests, src->off, dst->off,
+		       len, 0);
+		goto error_unmap;
+	} else if (status != DMA_COMPLETE) {
+		result(status == DMA_ERROR ?
+		       "completion error status" :
+		       "completion busy status", total_tests, src->off,
+		       dst->off, len, ret);
+		goto error_unmap;
+	}
+
+	dmaengine_unmap_put(um);
+	return 0;
+
+ error_unmap:
+	dmaengine_unmap_put(um);
+	return -ENXIO;
+}
+
 /*
  * This function repeatedly tests DMA transfers of various lengths and
  * offsets for a given operation type until it is told to exit by
@@ -552,7 +786,6 @@ static int dmatest_alloc_test_data(struct dmatest_data *d,
 static int dmatest_func(void *data)
 {
 	struct dmatest_thread	*thread = data;
-	struct dmatest_done	*done = &thread->test_done;
 	struct dmatest_info	*info;
 	struct dmatest_params	*params;
 	struct dma_chan		*chan;
@@ -560,8 +793,6 @@ static int dmatest_func(void *data)
 	unsigned int		error_count;
 	unsigned int		failed_tests = 0;
 	unsigned int		total_tests = 0;
-	dma_cookie_t		cookie;
-	enum dma_status		status;
 	enum dma_ctrl_flags 	flags;
 	u8			*pq_coefs = NULL;
 	int			ret;
@@ -664,9 +895,6 @@ static int dmatest_func(void *data)
 	ktime = ktime_get();
 	while (!kthread_should_stop()
 	       && !(params->iterations && total_tests >= params->iterations)) {
-		struct dma_async_tx_descriptor *tx = NULL;
-		struct dmaengine_unmap_data *um;
-		dma_addr_t *dsts;
 		unsigned int len;
 
 		total_tests++;
@@ -714,123 +942,17 @@ static int dmatest_func(void *data)
 			filltime = ktime_add(filltime, diff);
 		}
 
-		um = dmaengine_get_unmap_data(dev->dev, src->cnt + dst->cnt,
-					      GFP_KERNEL);
-		if (!um) {
+		if (dev->device_submit_request)
+			ret = dma_test_req_op(thread, src, dst, total_tests,
+					      len);
+		else
+			ret = dma_test_op(thread, src, dst, srcs,
+					  dma_pq, total_tests, len, pq_coefs);
+		if (ret < 0) {
 			failed_tests++;
-			result("unmap data NULL", total_tests,
-			       src->off, dst->off, len, ret);
 			continue;
 		}
 
-		um->len = buf_size;
-		for (i = 0; i < src->cnt; i++) {
-			void *buf = src->aligned[i];
-			struct page *pg = virt_to_page(buf);
-			unsigned long pg_off = offset_in_page(buf);
-
-			um->addr[i] = dma_map_page(dev->dev, pg, pg_off,
-						   um->len, DMA_TO_DEVICE);
-			srcs[i] = um->addr[i] + src->off;
-			ret = dma_mapping_error(dev->dev, um->addr[i]);
-			if (ret) {
-				result("src mapping error", total_tests,
-				       src->off, dst->off, len, ret);
-				goto error_unmap_continue;
-			}
-			um->to_cnt++;
-		}
-		/* map with DMA_BIDIRECTIONAL to force writeback/invalidate */
-		dsts = &um->addr[src->cnt];
-		for (i = 0; i < dst->cnt; i++) {
-			void *buf = dst->aligned[i];
-			struct page *pg = virt_to_page(buf);
-			unsigned long pg_off = offset_in_page(buf);
-
-			dsts[i] = dma_map_page(dev->dev, pg, pg_off, um->len,
-					       DMA_BIDIRECTIONAL);
-			ret = dma_mapping_error(dev->dev, dsts[i]);
-			if (ret) {
-				result("dst mapping error", total_tests,
-				       src->off, dst->off, len, ret);
-				goto error_unmap_continue;
-			}
-			um->bidi_cnt++;
-		}
-
-		if (thread->type == DMA_MEMCPY)
-			tx = dev->device_prep_dma_memcpy(chan,
-							 dsts[0] + dst->off,
-							 srcs[0], len, flags);
-		else if (thread->type == DMA_MEMSET)
-			tx = dev->device_prep_dma_memset(chan,
-						dsts[0] + dst->off,
-						*(src->aligned[0] + src->off),
-						len, flags);
-		else if (thread->type == DMA_XOR)
-			tx = dev->device_prep_dma_xor(chan,
-						      dsts[0] + dst->off,
-						      srcs, src->cnt,
-						      len, flags);
-		else if (thread->type == DMA_PQ) {
-			for (i = 0; i < dst->cnt; i++)
-				dma_pq[i] = dsts[i] + dst->off;
-			tx = dev->device_prep_dma_pq(chan, dma_pq, srcs,
-						     src->cnt, pq_coefs,
-						     len, flags);
-		}
-
-		if (!tx) {
-			result("prep error", total_tests, src->off,
-			       dst->off, len, ret);
-			msleep(100);
-			goto error_unmap_continue;
-		}
-
-		done->done = false;
-		if (!params->polled) {
-			tx->callback = dmatest_callback;
-			tx->callback_param = done;
-		}
-		cookie = tx->tx_submit(tx);
-
-		if (dma_submit_error(cookie)) {
-			result("submit error", total_tests, src->off,
-			       dst->off, len, ret);
-			msleep(100);
-			goto error_unmap_continue;
-		}
-
-		if (params->polled) {
-			status = dma_sync_wait(chan, cookie);
-			dmaengine_terminate_sync(chan);
-			if (status == DMA_COMPLETE)
-				done->done = true;
-		} else {
-			dma_async_issue_pending(chan);
-
-			wait_event_freezable_timeout(thread->done_wait,
-					done->done,
-					msecs_to_jiffies(params->timeout));
-
-			status = dma_async_is_tx_complete(chan, cookie, NULL,
-							  NULL);
-		}
-
-		if (!done->done) {
-			result("test timed out", total_tests, src->off, dst->off,
-			       len, 0);
-			goto error_unmap_continue;
-		} else if (status != DMA_COMPLETE) {
-			result(status == DMA_ERROR ?
-			       "completion error status" :
-			       "completion busy status", total_tests, src->off,
-			       dst->off, len, ret);
-			goto error_unmap_continue;
-		}
-
-		dmaengine_unmap_put(um);
-
 		if (params->noverify) {
 			verbose_result("test passed", total_tests, src->off,
 				       dst->off, len, 0);
@@ -871,12 +993,6 @@ static int dmatest_func(void *data)
 			verbose_result("test passed", total_tests, src->off,
 				       dst->off, len, 0);
 		}
-
-		continue;
-
-error_unmap_continue:
-		dmaengine_unmap_put(um);
-		failed_tests++;
 	}
 	ktime = ktime_sub(ktime_get(), ktime);
 	ktime = ktime_sub(ktime, comparetime);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 08/14] dmaengine: idxd: Init and probe for Intel data accelerators
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (6 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 07/14] dmaengine: update dmatest to support dma request Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 09/14] dmaengine: idxd: add configuration component of driver Dave Jiang
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

The idxd driver introduces the Intel Data Stream Accelerator [1] that will
be available on future Intel Xeon CPUs. One of the kernel access
point for the driver is through the dmaengine subsystem. It will initially
provide the DMA copy service to the kernel.

Some of the main functionality introduced with this accelerator
are: shared virtual memory (SVM) support, and descriptor submission using
Intel CPU instructions movdir64b and enqcmds. There will be additional
accelerator devices that share the same driver with variations to
capabilities.

This commit introduces the probe and initialization component of the
driver.

[1]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 MAINTAINERS                  |    8 +
 drivers/dma/Kconfig          |   13 +
 drivers/dma/Makefile         |    1 
 drivers/dma/idxd/Makefile    |    2 
 drivers/dma/idxd/device.c    |  672 ++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/idxd.h      |  231 ++++++++++++++
 drivers/dma/idxd/init.c      |  468 +++++++++++++++++++++++++++++
 drivers/dma/idxd/irq.c       |  156 ++++++++++
 drivers/dma/idxd/registers.h |  335 +++++++++++++++++++++
 include/uapi/linux/idxd.h    |  214 +++++++++++++
 10 files changed, 2100 insertions(+)
 create mode 100644 drivers/dma/idxd/Makefile
 create mode 100644 drivers/dma/idxd/device.c
 create mode 100644 drivers/dma/idxd/idxd.h
 create mode 100644 drivers/dma/idxd/init.c
 create mode 100644 drivers/dma/idxd/irq.c
 create mode 100644 drivers/dma/idxd/registers.h
 create mode 100644 include/uapi/linux/idxd.h

diff --git a/MAINTAINERS b/MAINTAINERS
index eb19fad370d7..4f067631e0b4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8281,6 +8281,14 @@ Q:	https://patchwork.kernel.org/project/linux-dmaengine/list/
 S:	Supported
 F:	drivers/dma/ioat*
 
+INTEL IADX DRIVER
+M:	Dave Jiang <dave.jiang@intel.com>
+L:	dmaengine@vger.kernel.org
+S:	Supported
+F:	drivers/dma/idxd/*
+F:	include/uapi/linux/idxd.h
+F:	include/linux/idxd.h
+
 INTEL IDLE DRIVER
 M:	Jacob Pan <jacob.jun.pan@linux.intel.com>
 M:	Len Brown <lenb@kernel.org>
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 8885e9d3f363..5f9419c35960 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -278,6 +278,19 @@ config INTEL_IDMA64
 	  Enable DMA support for Intel Low Power Subsystem such as found on
 	  Intel Skylake PCH.
 
+config INTEL_IDXD
+	tristate "Intel Data Accelerators support"
+	depends on PCI && X86_64
+	select DMA_ENGINE
+	select SBITMAP
+	help
+	  Enable support for the Intel(R) data accelerators present
+	  in Intel Xeon CPU.
+
+	  Say Y if you have such a platform.
+
+	  If unsure, say N.
+
 config INTEL_IOATDMA
 	tristate "Intel I/OAT DMA support"
 	depends on PCI && X86_64
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index 205f343e39fe..dbe74af8d709 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_IMX_DMA) += imx-dma.o
 obj-$(CONFIG_IMX_SDMA) += imx-sdma.o
 obj-$(CONFIG_INTEL_IDMA64) += idma64.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioat/
+obj-$(CONFIG_INTEL_IDXD) += idxd/
 obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o
 obj-$(CONFIG_INTEL_MIC_X100_DMA) += mic_x100_dma.o
 obj-$(CONFIG_K3_DMA) += k3dma.o
diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
new file mode 100644
index 000000000000..0dd1ca77513f
--- /dev/null
+++ b/drivers/dma/idxd/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_INTEL_IDXD) += idxd.o
+idxd-y := init.o irq.o device.o
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
new file mode 100644
index 000000000000..88739c11e163
--- /dev/null
+++ b/drivers/dma/idxd/device.c
@@ -0,0 +1,672 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <uapi/linux/idxd.h>
+#include "idxd.h"
+#include "registers.h"
+
+static int idxd_cmd_wait(struct idxd_device *idxd, u32 *status, int timeout);
+static int idxd_cmd_send(struct idxd_device *idxd, int cmd_code, u32 operand);
+
+/* Interrupt control bits */
+int idxd_mask_msix_vector(struct idxd_device *idxd, int vec_id)
+{
+	struct pci_dev *pdev = idxd->pdev;
+	int msixcnt = pci_msix_vec_count(pdev);
+	union msix_perm perm;
+	u32 offset;
+
+	if (vec_id < 0 || vec_id >= msixcnt)
+		return -EINVAL;
+
+	offset = idxd->msix_perm_offset + vec_id * 8;
+	perm.bits = ioread32(idxd->reg_base + offset);
+	perm.ignore = 1;
+	iowrite32(perm.bits, idxd->reg_base + offset);
+
+	return 0;
+}
+
+void idxd_mask_msix_vectors(struct idxd_device *idxd)
+{
+	struct pci_dev *pdev = idxd->pdev;
+	int msixcnt = pci_msix_vec_count(pdev);
+	int i, rc;
+
+	for (i = 0; i < msixcnt; i++) {
+		rc = idxd_mask_msix_vector(idxd, i);
+		if (rc < 0)
+			dev_warn(&pdev->dev,
+				 "Failed disabling msix vec %d\n", i);
+	}
+}
+
+int idxd_unmask_msix_vector(struct idxd_device *idxd, int vec_id)
+{
+	struct pci_dev *pdev = idxd->pdev;
+	int msixcnt = pci_msix_vec_count(pdev);
+	union msix_perm perm;
+	u32 offset;
+
+	if (vec_id < 0 || vec_id >= msixcnt)
+		return -EINVAL;
+
+	offset = idxd->msix_perm_offset + vec_id * 8;
+	perm.bits = ioread32(idxd->reg_base + offset);
+	perm.ignore = 0;
+	iowrite32(perm.bits, idxd->reg_base + offset);
+
+	return 0;
+}
+
+void idxd_unmask_error_interrupts(struct idxd_device *idxd)
+{
+	union genctrl_reg genctrl;
+
+	genctrl.bits = ioread32(idxd->reg_base + IDXD_GENCTRL_OFFSET);
+	genctrl.softerr_int_en = 1;
+	iowrite32(genctrl.bits, idxd->reg_base + IDXD_GENCTRL_OFFSET);
+}
+
+void idxd_mask_error_interrupts(struct idxd_device *idxd)
+{
+	union genctrl_reg genctrl;
+
+	genctrl.bits = ioread32(idxd->reg_base + IDXD_GENCTRL_OFFSET);
+	genctrl.softerr_int_en = 0;
+	iowrite32(genctrl.bits, idxd->reg_base + IDXD_GENCTRL_OFFSET);
+}
+
+static void free_hw_descs(struct idxd_wq *wq)
+{
+	int i;
+
+	for (i = 0; i < wq->num_descs; i++)
+		kfree(wq->hw_descs[i]);
+
+	kfree(wq->hw_descs);
+}
+
+static int alloc_hw_descs(struct idxd_wq *wq, int num)
+{
+	struct device *dev = &wq->idxd->pdev->dev;
+	int i;
+	int node = dev_to_node(dev);
+
+	wq->hw_descs = kcalloc_node(num, sizeof(struct dsa_hw_desc *),
+				    GFP_KERNEL, node);
+	if (!wq->hw_descs)
+		return -ENOMEM;
+
+	for (i = 0; i < num; i++) {
+		wq->hw_descs[i] = kzalloc_node(sizeof(*wq->hw_descs[i]),
+					       GFP_KERNEL, node);
+		if (!wq->hw_descs[i]) {
+			free_hw_descs(wq);
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+static void free_descs(struct idxd_wq *wq)
+{
+	int i;
+
+	for (i = 0; i < wq->num_descs; i++)
+		kfree(wq->descs[i]);
+
+	kfree(wq->descs);
+}
+
+static int alloc_descs(struct idxd_wq *wq, int num)
+{
+	struct device *dev = &wq->idxd->pdev->dev;
+	int i;
+	int node = dev_to_node(dev);
+
+	wq->descs = kcalloc_node(num, sizeof(struct idxd_desc *),
+				 GFP_KERNEL, node);
+	if (!wq->descs)
+		return -ENOMEM;
+
+	for (i = 0; i < num; i++) {
+		wq->descs[i] = kzalloc_node(sizeof(*wq->descs[i]),
+					    GFP_KERNEL, node);
+		if (!wq->descs[i]) {
+			free_descs(wq);
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+/* WQ control bits */
+int idxd_wq_alloc_resources(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct idxd_group *group = wq->group;
+	struct device *dev = &idxd->pdev->dev;
+	int rc, num_descs, i;
+
+	num_descs = wq->size +
+		idxd->hw.gen_cap.max_descs_per_engine * group->num_engines;
+	wq->num_descs = num_descs;
+
+	rc = alloc_hw_descs(wq, num_descs);
+	if (rc < 0)
+		return rc;
+
+	wq->compls_size = num_descs * sizeof(struct dsa_completion_record);
+	wq->compls = dma_alloc_coherent(dev, wq->compls_size,
+					&wq->compls_addr, GFP_KERNEL);
+	if (!wq->compls) {
+		rc = -ENOMEM;
+		goto fail_alloc_compls;
+	}
+
+	rc = alloc_descs(wq, num_descs);
+	if (rc < 0)
+		goto fail_alloc_descs;
+
+	wq->batch_size = min(IDXD_ALLOCATED_BATCH_SIZE, idxd->max_batch_size);
+	wq->batches_size = num_descs * sizeof(struct dsa_hw_desc)
+		* wq->batch_size;
+	wq->batches = dma_alloc_coherent(dev, wq->batches_size,
+					 &wq->batches_addr, GFP_KERNEL);
+	if (!wq->batches) {
+		rc = -ENOMEM;
+		goto fail_batch_alloc;
+	}
+
+	rc = sbitmap_init_node(&wq->sbmap, num_descs, -1, GFP_KERNEL,
+			       dev_to_node(dev));
+	if (rc < 0)
+		goto fail_sbitmap_init;
+
+	for (i = 0; i < num_descs; i++) {
+		struct idxd_desc *desc = wq->descs[i];
+
+		desc->hw = wq->hw_descs[i];
+		desc->completion = &wq->compls[i];
+		desc->compl_dma  = wq->compls_addr +
+			sizeof(struct dsa_completion_record) * i;
+		desc->batch = &wq->batches[i];
+		desc->batch_dma = wq->batches_addr +
+			sizeof(struct dsa_hw_desc) * wq->batch_size * i;
+		desc->id = i;
+		desc->wq = wq;
+	}
+
+	return 0;
+
+ fail_sbitmap_init:
+	dma_free_coherent(dev, wq->batches_size, wq->batches, wq->batches_addr);
+ fail_batch_alloc:
+	free_descs(wq);
+ fail_alloc_descs:
+	dma_free_coherent(dev, wq->compls_size, wq->compls, wq->compls_addr);
+ fail_alloc_compls:
+	free_hw_descs(wq);
+	return rc;
+}
+
+void idxd_wq_free_resources(struct idxd_wq *wq)
+{
+	struct device *dev = &wq->idxd->pdev->dev;
+
+	free_hw_descs(wq);
+	free_descs(wq);
+	dma_free_coherent(dev, wq->compls_size, wq->compls, wq->compls_addr);
+	sbitmap_free(&wq->sbmap);
+}
+
+int idxd_wq_enable(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	u32 status;
+	int rc;
+
+	lockdep_assert_held(&idxd->dev_lock);
+
+	if (wq->state == IDXD_WQ_ENABLED) {
+		dev_dbg(dev, "WQ %d already enabled\n", wq->id);
+		return -ENXIO;
+	}
+
+	rc = idxd_cmd_send(idxd, IDXD_CMD_ENABLE_WQ, wq->id);
+	if (rc < 0)
+		return rc;
+	rc = idxd_cmd_wait(idxd, &status, IDXD_REG_TIMEOUT);
+	if (rc < 0)
+		return rc;
+
+	if (status != IDXD_CMDSTS_SUCCESS &&
+	    status != IDXD_CMDSTS_ERR_WQ_ENABLED) {
+		dev_dbg(dev, "WQ enable failed: %#x\n", status);
+		return -ENXIO;
+	}
+
+	wq->state = IDXD_WQ_ENABLED;
+	dev_dbg(dev, "WQ %d enabled\n", wq->id);
+	return 0;
+}
+
+int idxd_wq_disable(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	u32 status, operand;
+	int rc;
+
+	lockdep_assert_held(&idxd->dev_lock);
+	dev_dbg(dev, "Disabling WQ %d\n", wq->id);
+
+	if (wq->state != IDXD_WQ_ENABLED) {
+		dev_dbg(dev, "WQ %d in wrong state: %d\n", wq->id, wq->state);
+		return 0;
+	}
+
+	operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
+	rc = idxd_cmd_send(idxd, IDXD_CMD_DISABLE_WQ, operand);
+	if (rc < 0)
+		return rc;
+	rc = idxd_cmd_wait(idxd, &status, IDXD_REG_TIMEOUT);
+	if (rc < 0)
+		return rc;
+
+	if (status != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "WQ disable failed: %#x\n", status);
+		return -ENXIO;
+	}
+
+	wq->state = IDXD_WQ_DISABLED;
+	dev_dbg(dev, "WQ %d disabled\n", wq->id);
+	return 0;
+}
+
+/* Device control bits */
+static inline bool idxd_is_enabled(struct idxd_device *idxd)
+{
+	union gensts_reg gensts;
+
+	gensts.bits = ioread32(idxd->reg_base + IDXD_GENSTATS_OFFSET);
+
+	if (gensts.state == IDXD_DEVICE_STATE_ENABLED)
+		return true;
+	return false;
+}
+
+static int idxd_cmd_wait(struct idxd_device *idxd, u32 *status, int timeout)
+{
+	u32 sts, to = timeout;
+
+	lockdep_assert_held(&idxd->dev_lock);
+	sts = ioread32(idxd->reg_base + IDXD_CMDSTS_OFFSET);
+	while (sts & IDXD_CMDSTS_ACTIVE && --to) {
+		cpu_relax();
+		sts = ioread32(idxd->reg_base + IDXD_CMDSTS_OFFSET);
+	}
+
+	if (to == 0 && sts & IDXD_CMDSTS_ACTIVE) {
+		dev_warn(&idxd->pdev->dev, "%s timed out!\n", __func__);
+		*status = 0;
+		return -EBUSY;
+	}
+
+	*status = sts;
+	return 0;
+}
+
+static int idxd_cmd_send(struct idxd_device *idxd, int cmd_code, u32 operand)
+{
+	union idxd_command_reg cmd;
+	int rc;
+	u32 status;
+
+	lockdep_assert_held(&idxd->dev_lock);
+	rc = idxd_cmd_wait(idxd, &status, IDXD_REG_TIMEOUT);
+	if (rc < 0)
+		return rc;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.cmd = cmd_code;
+	cmd.operand = operand;
+	dev_dbg(&idxd->pdev->dev, "%s: sending cmd: %#x op: %#x\n",
+		__func__, cmd_code, operand);
+	iowrite32(cmd.bits, idxd->reg_base + IDXD_CMD_OFFSET);
+
+	return 0;
+}
+
+int idxd_device_enable(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int rc;
+	u32 status;
+
+	lockdep_assert_held(&idxd->dev_lock);
+	if (idxd_is_enabled(idxd)) {
+		dev_dbg(dev, "Device already enabled\n");
+		return -ENXIO;
+	}
+
+	rc = idxd_cmd_send(idxd, IDXD_CMD_ENABLE_DEVICE, 0);
+	if (rc < 0)
+		return rc;
+	rc = idxd_cmd_wait(idxd, &status, IDXD_REG_TIMEOUT);
+	if (rc < 0)
+		return rc;
+
+	/* If the command is successful or if the device was enabled */
+	if (status != IDXD_CMDSTS_SUCCESS &&
+	    status != IDXD_CMDSTS_ERR_DEV_ENABLED) {
+		dev_dbg(dev, "%s: err_code: %#x\n", __func__, status);
+		return -ENXIO;
+	}
+
+	idxd->state = IDXD_DEV_ENABLED;
+	return 0;
+}
+
+int idxd_device_disable(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int rc;
+	u32 status;
+
+	lockdep_assert_held(&idxd->dev_lock);
+	if (!idxd_is_enabled(idxd)) {
+		dev_dbg(dev, "Device is not enabled\n");
+		return 0;
+	}
+
+	rc = idxd_cmd_send(idxd, IDXD_CMD_DISABLE_DEVICE, 0);
+	if (rc < 0)
+		return rc;
+	rc = idxd_cmd_wait(idxd, &status, IDXD_REG_TIMEOUT);
+	if (rc < 0)
+		return rc;
+
+	/* If the command is successful or if the device was disabled */
+	if (status != IDXD_CMDSTS_SUCCESS &&
+	    !(status & IDXD_CMDSTS_ERR_DIS_DEV_EN)) {
+		dev_dbg(dev, "%s: err_code: %#x\n", __func__, status);
+		rc = -ENXIO;
+		return rc;
+	}
+
+	idxd->state = IDXD_DEV_CONF_READY;
+	return 0;
+}
+
+int __idxd_device_reset(struct idxd_device *idxd)
+{
+	u32 status;
+	int rc;
+
+	rc = idxd_cmd_send(idxd, IDXD_CMD_RESET_DEVICE, 0);
+	if (rc < 0)
+		return rc;
+	rc = idxd_cmd_wait(idxd, &status, IDXD_REG_TIMEOUT);
+	if (rc < 0)
+		return rc;
+
+	return 0;
+}
+
+int idxd_device_reset(struct idxd_device *idxd)
+{
+	unsigned long flags;
+	int rc;
+
+	spin_lock_irqsave(&idxd->dev_lock, flags);
+	rc = __idxd_device_reset(idxd);
+	spin_unlock_irqrestore(&idxd->dev_lock, flags);
+	return rc;
+}
+
+/* Device configuration bits */
+static void idxd_group_config_write(struct idxd_group *group)
+{
+	struct idxd_device *idxd = group->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	int i;
+	u32 grpcfg_offset;
+
+	dev_dbg(dev, "Writing group %d cfg registers\n", group->id);
+
+	/* setup GRPWQCFG */
+	for (i = 0; i < 4; i++) {
+		grpcfg_offset = idxd->grpcfg_offset +
+			group->id * 64 + i * sizeof(u64);
+		iowrite64(group->grpcfg.wqs[i],
+			  idxd->reg_base + grpcfg_offset);
+		dev_dbg(dev, "GRPCFG wq[%d:%d: %#x]: %#llx\n",
+			group->id, i, grpcfg_offset,
+			ioread64(idxd->reg_base + grpcfg_offset));
+	}
+
+	/* setup GRPENGCFG */
+	grpcfg_offset = idxd->grpcfg_offset + group->id * 64 + 32;
+	iowrite64(group->grpcfg.engines, idxd->reg_base + grpcfg_offset);
+	dev_dbg(dev, "GRPCFG engs[%d: %#x]: %#llx\n", group->id,
+		grpcfg_offset, ioread64(idxd->reg_base + grpcfg_offset));
+
+	/* setup GRPFLAGS */
+	grpcfg_offset = idxd->grpcfg_offset + group->id * 64 + 40;
+	iowrite32(group->grpcfg.flags.bits, idxd->reg_base + grpcfg_offset);
+	dev_dbg(dev, "GRPFLAGS flags[%d: %#x]: %#x\n",
+		group->id, grpcfg_offset,
+		ioread32(idxd->reg_base + grpcfg_offset));
+}
+
+static int idxd_groups_config_write(struct idxd_device *idxd)
+
+{
+	union gencfg_reg reg;
+	int i;
+	struct device *dev = &idxd->pdev->dev;
+
+	/* Setup bandwidth token limit */
+	if (idxd->token_limit) {
+		reg.bits = ioread32(idxd->reg_base + IDXD_GENCFG_OFFSET);
+		reg.token_limit = idxd->token_limit;
+		iowrite32(reg.bits, idxd->reg_base + IDXD_GENCFG_OFFSET);
+	}
+
+	dev_dbg(dev, "GENCFG(%#x): %#x\n", IDXD_GENCFG_OFFSET,
+		ioread32(idxd->reg_base + IDXD_GENCFG_OFFSET));
+
+	for (i = 0; i < idxd->max_groups; i++) {
+		struct idxd_group *group = &idxd->groups[i];
+
+		idxd_group_config_write(group);
+	}
+
+	return 0;
+}
+
+static int idxd_wq_config_write(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	u32 wq_offset;
+	int i;
+
+	if (!wq->group)
+		return 0;
+
+	memset(&wq->wqcfg, 0, sizeof(union wqcfg));
+
+	/* byte 0-3 */
+	wq->wqcfg.wq_size = wq->size;
+
+	if (wq->size == 0) {
+		dev_warn(dev, "Incorrect work queue size: 0\n");
+		return -EINVAL;
+	}
+
+	/* bytes 4-7 */
+	wq->wqcfg.wq_thresh = wq->threshold;
+
+	/* byte 8-11 */
+	wq->wqcfg.priv = 1; /* kernel, therefore priv */
+	wq->wqcfg.mode = 1;
+
+	wq->wqcfg.priority = wq->priority;
+
+	/* bytes 12-15 */
+	wq->wqcfg.max_xfer_shift = idxd->hw.gen_cap.max_xfer_shift;
+	wq->wqcfg.max_batch_shift = idxd->hw.gen_cap.max_batch_shift;
+
+	dev_dbg(dev, "WQ %d CFGs\n", wq->id);
+	for (i = 0; i < 8; i++) {
+		wq_offset = idxd->wqcfg_offset + wq->id * 32 + i * sizeof(u32);
+		iowrite32(wq->wqcfg.bits[i], idxd->reg_base + wq_offset);
+		dev_dbg(dev, "WQ[%d][%d][%#x]: %#x\n",
+			wq->id, i, wq_offset,
+			ioread32(idxd->reg_base + wq_offset));
+	}
+
+	return 0;
+}
+
+static int idxd_wqs_config_write(struct idxd_device *idxd)
+{
+	int i, rc;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		rc = idxd_wq_config_write(wq);
+		if (rc < 0)
+			return rc;
+	}
+
+	return 0;
+}
+
+static void idxd_group_flags_setup(struct idxd_device *idxd)
+{
+	int i;
+
+	/* TC-A 0 and TC-B 1 should be defaults */
+	for (i = 0; i < idxd->max_groups; i++) {
+		struct idxd_group *group = &idxd->groups[i];
+
+		if (group->tc_a == -1)
+			group->grpcfg.flags.tc_a = 0;
+		else
+			group->grpcfg.flags.tc_a = group->tc_a;
+		if (group->tc_b == -1)
+			group->grpcfg.flags.tc_b = 1;
+		else
+			group->grpcfg.flags.tc_b = group->tc_b;
+		group->grpcfg.flags.use_token_limit = group->use_token_limit;
+		group->grpcfg.flags.tokens_reserved = group->tokens_reserved;
+		if (group->tokens_allowed)
+			group->grpcfg.flags.tokens_allowed =
+				group->tokens_allowed;
+		else
+			group->grpcfg.flags.tokens_allowed = idxd->max_tokens;
+	}
+}
+
+static int idxd_engines_setup(struct idxd_device *idxd)
+{
+	int i, engines = 0;
+	struct idxd_engine *eng;
+	struct idxd_group *group;
+
+	for (i = 0; i < idxd->max_groups; i++) {
+		group = &idxd->groups[i];
+		group->grpcfg.engines = 0;
+	}
+
+	for (i = 0; i < idxd->max_engines; i++) {
+		eng = &idxd->engines[i];
+		group = eng->group;
+
+		if (!group)
+			continue;
+
+		group->grpcfg.engines |= BIT(eng->id);
+		engines++;
+	}
+
+	if (!engines)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int idxd_wqs_setup(struct idxd_device *idxd)
+{
+	struct idxd_wq *wq;
+	struct idxd_group *group;
+	int i, j, configured = 0;
+	struct device *dev = &idxd->pdev->dev;
+
+	for (i = 0; i < idxd->max_groups; i++) {
+		group = &idxd->groups[i];
+		for (j = 0; j < 4; j++)
+			group->grpcfg.wqs[j] = 0;
+	}
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		wq = &idxd->wqs[i];
+		group = wq->group;
+
+		if (!wq->group)
+			continue;
+		if (!wq->size)
+			continue;
+
+		if (!wq_dedicated(wq)) {
+			dev_warn(dev, "No shared workqueue support.\n");
+			return -EINVAL;
+		}
+
+		group->grpcfg.wqs[wq->id / 64] |= BIT(wq->id % 64);
+		configured++;
+	}
+
+	if (configured == 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+int idxd_device_config(struct idxd_device *idxd)
+{
+	int rc;
+
+	lockdep_assert_held(&idxd->dev_lock);
+	rc = idxd_wqs_setup(idxd);
+	if (rc < 0)
+		return rc;
+
+	rc = idxd_engines_setup(idxd);
+	if (rc < 0)
+		return rc;
+
+	idxd_group_flags_setup(idxd);
+
+	rc = idxd_wqs_config_write(idxd);
+	if (rc < 0)
+		return rc;
+
+	rc = idxd_groups_config_write(idxd);
+	if (rc < 0)
+		return rc;
+
+	return 0;
+}
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
new file mode 100644
index 000000000000..039a3cb84214
--- /dev/null
+++ b/drivers/dma/idxd/idxd.h
@@ -0,0 +1,231 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#ifndef _IDXD_H_
+#define _IDXD_H_
+
+#include <linux/sbitmap.h>
+#include <linux/percpu-rwsem.h>
+#include <linux/wait.h>
+#include "registers.h"
+
+#define IDXD_DRIVER_VERSION	"1.00"
+
+extern struct kmem_cache *idxd_desc_pool;
+
+#define IDXD_REG_TIMEOUT	50
+#define IDXD_DRAIN_TIMEOUT	5000
+
+enum idxd_type {
+	IDXD_TYPE_UNKNOWN = -1,
+	IDXD_TYPE_DSA = 0,
+	IDXD_TYPE_MAX
+};
+
+#define IDXD_NAME_SIZE		128
+
+struct idxd_device_driver {
+	struct device_driver drv;
+};
+
+struct idxd_irq_entry {
+	struct idxd_device *idxd;
+	int id;
+	struct llist_head pending_llist;
+	struct list_head work_list;
+};
+
+struct idxd_group {
+	struct device conf_dev;
+	struct idxd_device *idxd;
+	struct grpcfg grpcfg;
+	int id;
+	int num_engines;
+	int num_wqs;
+	bool use_token_limit;
+	u8 tokens_allowed;
+	u8 tokens_reserved;
+	int tc_a;
+	int tc_b;
+};
+
+#define IDXD_MAX_PRIORITY	0xf
+
+enum idxd_wq_state {
+	IDXD_WQ_DISABLED = 0,
+	IDXD_WQ_ENABLED,
+};
+
+enum idxd_wq_flag {
+	WQ_FLAG_DEDICATED = 0,
+};
+
+enum idxd_wq_type {
+	IDXD_WQT_NONE = 0,
+	IDXD_WQT_KERNEL,
+};
+
+#define IDXD_ALLOCATED_BATCH_SIZE	128U
+#define WQ_NAME_SIZE   1024
+#define WQ_TYPE_SIZE   10
+
+struct idxd_wq {
+	void __iomem *dportal;
+	struct device conf_dev;
+	struct idxd_device *idxd;
+	int id;
+	enum idxd_wq_type type;
+	struct idxd_group *group;
+	int client_count;
+	struct mutex wq_lock;	/* mutex for workqueue */
+	u32 size;
+	u32 threshold;
+	u32 priority;
+	enum idxd_wq_state state;
+	unsigned long flags;
+	union wqcfg wqcfg;
+	atomic_t dq_count;	/* dedicated queue flow control */
+	u32 vec_ptr;		/* interrupt steering */
+	struct dsa_hw_desc **hw_descs;
+	int num_descs;
+	struct dsa_completion_record *compls;
+	dma_addr_t compls_addr;
+	int compls_size;
+	struct idxd_desc **descs;
+	struct dsa_hw_desc *batches;
+	dma_addr_t batches_addr;
+	int batches_size;
+	int batch_size;
+	struct sbitmap sbmap;
+	struct percpu_rw_semaphore submit_lock;
+	wait_queue_head_t submit_waitq;
+	char name[WQ_NAME_SIZE + 1];
+};
+
+struct idxd_engine {
+	struct device conf_dev;
+	int id;
+	struct idxd_group *group;
+	struct idxd_device *idxd;
+};
+
+/* shadow registers */
+struct idxd_hw {
+	u32 version;
+	union gen_cap_reg gen_cap;
+	union wq_cap_reg wq_cap;
+	union group_cap_reg group_cap;
+	union engine_cap_reg engine_cap;
+	struct opcap opcap;
+};
+
+enum idxd_device_state {
+	IDXD_DEV_HALTED = -1,
+	IDXD_DEV_DISABLED = 0,
+	IDXD_DEV_CONF_READY,
+	IDXD_DEV_ENABLED,
+};
+
+enum idxd_device_flag {
+	IDXD_FLAG_CONFIGURABLE = 0,
+};
+
+struct idxd_device {
+	enum idxd_type type;
+	struct device conf_dev;
+	struct list_head list;
+	struct idxd_hw hw;
+	enum idxd_device_state state;
+	unsigned long flags;
+	int id;
+
+	struct pci_dev *pdev;
+	void __iomem *reg_base;
+
+	spinlock_t dev_lock;	/* spinlock for device */
+	struct idxd_group *groups;
+	struct idxd_wq *wqs;
+	struct idxd_engine *engines;
+
+	int num_groups;
+
+	u32 msix_perm_offset;
+	u32 wqcfg_offset;
+	u32 grpcfg_offset;
+	u32 perfmon_offset;
+
+	u64 max_xfer_bytes;
+	u32 max_batch_size;
+	int max_groups;
+	int max_engines;
+	int max_tokens;
+	int max_wqs;
+	int max_wq_size;
+	int token_limit;
+
+	union sw_err_reg sw_err;
+
+	struct msix_entry *msix_entries;
+	int num_wq_irqs;
+	struct idxd_irq_entry *irq_entries;
+};
+
+/* IDXD software descriptor */
+struct idxd_desc {
+	struct dsa_hw_desc *hw;
+	dma_addr_t desc_dma;
+	struct dsa_completion_record *completion;
+	dma_addr_t compl_dma;
+	struct dsa_hw_desc *batch;
+	dma_addr_t batch_dma;
+	struct llist_node llnode;
+	struct list_head list;
+	int id;
+	struct idxd_wq *wq;
+};
+
+#define confdev_to_idxd(dev) container_of(dev, struct idxd_device, conf_dev)
+#define confdev_to_wq(dev) container_of(dev, struct idxd_wq, conf_dev)
+
+static inline bool wq_dedicated(struct idxd_wq *wq)
+{
+	return test_bit(WQ_FLAG_DEDICATED, &wq->flags);
+}
+
+static inline void idxd_set_type(struct idxd_device *idxd)
+{
+	struct pci_dev *pdev = idxd->pdev;
+
+	if (pdev->device == PCI_DEVICE_ID_INTEL_DSA_SPR0)
+		idxd->type = IDXD_TYPE_DSA;
+	else
+		idxd->type = IDXD_TYPE_UNKNOWN;
+}
+
+const char *idxd_get_dev_name(struct idxd_device *idxd);
+
+/* device interrupt control */
+irqreturn_t idxd_irq_handler(int vec, void *data);
+irqreturn_t idxd_misc_thread(int vec, void *data);
+irqreturn_t idxd_wq_thread(int irq, void *data);
+void idxd_mask_error_interrupts(struct idxd_device *idxd);
+void idxd_unmask_error_interrupts(struct idxd_device *idxd);
+void idxd_mask_msix_vectors(struct idxd_device *idxd);
+int idxd_mask_msix_vector(struct idxd_device *idxd, int vec_id);
+int idxd_unmask_msix_vector(struct idxd_device *idxd, int vec_id);
+
+/* device control */
+int idxd_device_enable(struct idxd_device *idxd);
+int idxd_device_disable(struct idxd_device *idxd);
+int idxd_device_reset(struct idxd_device *idxd);
+int __idxd_device_reset(struct idxd_device *idxd);
+void idxd_device_cleanup(struct idxd_device *idxd);
+int idxd_device_config(struct idxd_device *idxd);
+void idxd_device_wqs_clear_state(struct idxd_device *idxd);
+
+/* work queue control */
+int idxd_wq_alloc_resources(struct idxd_wq *wq);
+void idxd_wq_free_resources(struct idxd_wq *wq);
+int idxd_wq_enable(struct idxd_wq *wq);
+int idxd_wq_disable(struct idxd_wq *wq);
+
+#endif
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
new file mode 100644
index 000000000000..aeafc87b0c7e
--- /dev/null
+++ b/drivers/dma/idxd/init.c
@@ -0,0 +1,468 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/pci.h>
+#include <linux/interrupt.h>
+#include <linux/delay.h>
+#include <linux/dma-mapping.h>
+#include <linux/workqueue.h>
+#include <linux/aer.h>
+#include <linux/fs.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/device.h>
+#include <linux/idr.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+
+MODULE_VERSION(IDXD_DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
+
+#define DRV_NAME "idxd"
+
+static struct idr idxd_idrs[IDXD_TYPE_MAX];
+static struct mutex idxd_idr_lock;
+
+static struct pci_device_id idxd_pci_tbl[] = {
+	/* DSA ver 1.0 platforms */
+	{ PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_DSA_SPR0) },
+	{ 0, }
+};
+MODULE_DEVICE_TABLE(pci, idxd_pci_tbl);
+
+static char *idxd_name[] = {
+	"dsa",
+};
+
+const char *idxd_get_dev_name(struct idxd_device *idxd)
+{
+	return idxd_name[idxd->type];
+}
+
+static int idxd_setup_interrupts(struct idxd_device *idxd)
+{
+	struct pci_dev *pdev = idxd->pdev;
+	struct device *dev = &pdev->dev;
+	struct msix_entry *msix;
+	struct idxd_irq_entry *irq_entry;
+	int i, msixcnt;
+	int rc = 0;
+
+	msixcnt = pci_msix_vec_count(pdev);
+	if (msixcnt < 0) {
+		dev_err(dev, "Not MSI-X interrupt capable.\n");
+		goto err_no_irq;
+	}
+
+	idxd->msix_entries = devm_kzalloc(dev, sizeof(struct msix_entry) *
+			msixcnt, GFP_KERNEL);
+	if (!idxd->msix_entries) {
+		rc = -ENOMEM;
+		goto err_no_irq;
+	}
+
+	for (i = 0; i < msixcnt; i++)
+		idxd->msix_entries[i].entry = i;
+
+	rc = pci_enable_msix_exact(pdev, idxd->msix_entries, msixcnt);
+	if (rc) {
+		dev_err(dev, "Failed enabling %d MSIX entries.\n", msixcnt);
+		goto err_no_irq;
+	}
+	dev_dbg(dev, "Enabled %d msix vectors\n", msixcnt);
+
+	/*
+	 * We implement 1 completion list per MSI-X entry except for
+	 * entry 0, which is for errors and others.
+	 */
+	idxd->irq_entries = devm_kcalloc(dev, msixcnt,
+					 sizeof(struct idxd_irq_entry),
+					 GFP_KERNEL);
+	if (!idxd->irq_entries) {
+		rc = -ENOMEM;
+		goto err_no_irq;
+	}
+
+	for (i = 0; i < msixcnt; i++) {
+		idxd->irq_entries[i].id = i;
+		idxd->irq_entries[i].idxd = idxd;
+	}
+
+	msix = &idxd->msix_entries[0];
+	irq_entry = &idxd->irq_entries[0];
+	rc = devm_request_threaded_irq(dev, msix->vector, idxd_irq_handler,
+				       idxd_misc_thread, 0, "idxd-misc",
+				       irq_entry);
+	if (rc < 0) {
+		dev_err(dev, "Failed to allocate misc interrupt.\n");
+		goto err_no_irq;
+	}
+
+	dev_dbg(dev, "Allocated idxd-misc handler on msix vector %d\n",
+		msix->vector);
+
+	/* first MSI-X entry is not for wq interrupts */
+	idxd->num_wq_irqs = msixcnt - 1;
+
+	for (i = 1; i < msixcnt; i++) {
+		msix = &idxd->msix_entries[i];
+		irq_entry = &idxd->irq_entries[i];
+
+		init_llist_head(&idxd->irq_entries[i].pending_llist);
+		INIT_LIST_HEAD(&idxd->irq_entries[i].work_list);
+		rc = devm_request_threaded_irq(dev, msix->vector,
+					       idxd_irq_handler,
+					       idxd_wq_thread, 0,
+					       "idxd-portal", irq_entry);
+		if (rc < 0) {
+			dev_err(dev, "Failed to allocate irq %d.\n",
+				msix->vector);
+			goto err_no_irq;
+		}
+		dev_dbg(dev, "Allocated idxd-msix %d for vector %d\n",
+			i, msix->vector);
+	}
+
+	idxd_unmask_error_interrupts(idxd);
+
+	return 0;
+
+ err_no_irq:
+	/* Disable error interrupt generation */
+	idxd_mask_error_interrupts(idxd);
+	pci_disable_msix(pdev);
+	dev_err(dev, "No usable interrupts\n");
+	return rc;
+}
+
+static void idxd_wqs_free_lock(struct idxd_device *idxd)
+{
+	int i;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		percpu_free_rwsem(&wq->submit_lock);
+	}
+}
+
+static int idxd_setup_internals(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int i;
+
+	idxd->groups = devm_kcalloc(dev, idxd->max_groups,
+				    sizeof(struct idxd_group), GFP_KERNEL);
+	if (!idxd->groups)
+		return -ENOMEM;
+
+	for (i = 0; i < idxd->max_groups; i++) {
+		idxd->groups[i].idxd = idxd;
+		idxd->groups[i].id = i;
+		idxd->groups[i].tc_a = -1;
+		idxd->groups[i].tc_b = -1;
+	}
+
+	idxd->wqs = devm_kcalloc(dev, idxd->max_wqs, sizeof(struct idxd_wq),
+				 GFP_KERNEL);
+	if (!idxd->wqs)
+		return -ENOMEM;
+
+	idxd->engines = devm_kcalloc(dev, idxd->max_engines,
+				     sizeof(struct idxd_engine), GFP_KERNEL);
+	if (!idxd->engines)
+		return -ENOMEM;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+		int rc;
+
+		wq->id = i;
+		wq->idxd = idxd;
+		mutex_init(&wq->wq_lock);
+		atomic_set(&wq->dq_count, 0);
+		init_waitqueue_head(&wq->submit_waitq);
+		rc = percpu_init_rwsem(&wq->submit_lock);
+		if (rc < 0) {
+			idxd_wqs_free_lock(idxd);
+			return rc;
+		}
+	}
+
+	for (i = 0; i < idxd->max_engines; i++) {
+		idxd->engines[i].idxd = idxd;
+		idxd->engines[i].id = i;
+	}
+
+	return 0;
+}
+
+static void idxd_read_table_offsets(struct idxd_device *idxd)
+{
+	union offsets_reg offsets;
+	struct device *dev = &idxd->pdev->dev;
+
+	offsets.bits[0] = ioread64(idxd->reg_base + IDXD_TABLE_OFFSET);
+	offsets.bits[1] = ioread64(idxd->reg_base + IDXD_TABLE_OFFSET
+			+ sizeof(u64));
+	idxd->grpcfg_offset = offsets.grpcfg * 0x100;
+	dev_dbg(dev, "IDXD Group Config Offset: %#x\n", idxd->grpcfg_offset);
+	idxd->wqcfg_offset = offsets.wqcfg * 0x100;
+	dev_dbg(dev, "IDXD Work Queue Config Offset: %#x\n",
+		idxd->wqcfg_offset);
+	idxd->msix_perm_offset = offsets.msix_perm * 0x100;
+	dev_dbg(dev, "IDXD MSIX Permission Offset: %#x\n",
+		idxd->msix_perm_offset);
+	idxd->perfmon_offset = offsets.perfmon * 0x100;
+	dev_dbg(dev, "IDXD Perfmon Offset: %#x\n", idxd->perfmon_offset);
+}
+
+static void idxd_read_caps(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int i;
+
+	/* reading generic capabilities */
+	idxd->hw.gen_cap.bits = ioread64(idxd->reg_base + IDXD_GENCAP_OFFSET);
+	dev_dbg(dev, "gen_cap: %#llx\n", idxd->hw.gen_cap.bits);
+	idxd->max_xfer_bytes = 1ULL << idxd->hw.gen_cap.max_xfer_shift;
+	dev_dbg(dev, "max xfer size: %llu bytes\n", idxd->max_xfer_bytes);
+	idxd->max_batch_size = 1U << idxd->hw.gen_cap.max_batch_shift;
+	dev_dbg(dev, "max batch size: %u\n", idxd->max_batch_size);
+	if (idxd->hw.gen_cap.config_en)
+		set_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags);
+
+	/* reading group capabilities */
+	idxd->hw.group_cap.bits =
+		ioread64(idxd->reg_base + IDXD_GRPCAP_OFFSET);
+	dev_dbg(dev, "group_cap: %#llx\n", idxd->hw.group_cap.bits);
+	idxd->max_groups = idxd->hw.group_cap.num_groups;
+	dev_dbg(dev, "max groups: %u\n", idxd->max_groups);
+	idxd->max_tokens = idxd->hw.group_cap.total_tokens;
+	dev_dbg(dev, "max tokens: %u\n", idxd->max_tokens);
+
+	/* read engine capabilities */
+	idxd->hw.engine_cap.bits =
+		ioread64(idxd->reg_base + IDXD_ENGCAP_OFFSET);
+	dev_dbg(dev, "engine_cap: %#llx\n", idxd->hw.engine_cap.bits);
+	idxd->max_engines = idxd->hw.engine_cap.num_engines;
+	dev_dbg(dev, "max engines: %u\n", idxd->max_engines);
+
+	/* read workqueue capabilities */
+	idxd->hw.wq_cap.bits = ioread64(idxd->reg_base + IDXD_WQCAP_OFFSET);
+	dev_dbg(dev, "wq_cap: %#llx\n", idxd->hw.wq_cap.bits);
+	idxd->max_wq_size = idxd->hw.wq_cap.total_wq_size;
+	dev_dbg(dev, "total workqueue size: %u\n", idxd->max_wq_size);
+	idxd->max_wqs = idxd->hw.wq_cap.num_wqs;
+	dev_dbg(dev, "max workqueues: %u\n", idxd->max_wqs);
+
+	/* reading operation capabilities */
+	for (i = 0; i < 4; i++) {
+		idxd->hw.opcap.bits[i] = ioread64(idxd->reg_base +
+				IDXD_OPCAP_OFFSET + i * sizeof(u64));
+		dev_dbg(dev, "opcap[%d]: %#llx\n", i, idxd->hw.opcap.bits[i]);
+	}
+}
+
+static struct idxd_device *idxd_alloc(struct pci_dev *pdev,
+				      void __iomem * const *iomap)
+{
+	struct device *dev = &pdev->dev;
+	struct idxd_device *idxd;
+
+	idxd = devm_kzalloc(dev, sizeof(struct idxd_device), GFP_KERNEL);
+	if (!idxd)
+		return NULL;
+
+	idxd->pdev = pdev;
+	idxd->reg_base = iomap[IDXD_MMIO_BAR];
+	spin_lock_init(&idxd->dev_lock);
+
+	return idxd;
+}
+
+static int idxd_probe(struct idxd_device *idxd)
+{
+	struct pci_dev *pdev = idxd->pdev;
+	struct device *dev = &pdev->dev;
+	int rc;
+
+	dev_dbg(dev, "%s entered and resetting device\n", __func__);
+	rc = idxd_device_reset(idxd);
+	if (rc < 0)
+		return rc;
+	dev_dbg(dev, "IDXD reset complete\n");
+
+	idxd_read_caps(idxd);
+	idxd_read_table_offsets(idxd);
+
+	rc = idxd_setup_internals(idxd);
+	if (rc)
+		goto err_setup;
+
+	rc = idxd_setup_interrupts(idxd);
+	if (rc)
+		goto err_setup;
+
+	dev_dbg(dev, "IDXD interrupt setup complete.\n");
+
+	mutex_lock(&idxd_idr_lock);
+	idxd->id = idr_alloc(&idxd_idrs[idxd->type], idxd, 0, 0, GFP_KERNEL);
+	mutex_unlock(&idxd_idr_lock);
+	if (idxd->id < 0) {
+		rc = -ENOMEM;
+		goto err_idr_fail;
+	}
+
+	dev_dbg(dev, "IDXD device %d probed successfully\n", idxd->id);
+	return 0;
+
+ err_idr_fail:
+	idxd_mask_error_interrupts(idxd);
+	idxd_mask_msix_vectors(idxd);
+ err_setup:
+	return rc;
+}
+
+static int idxd_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	void __iomem * const *iomap;
+	struct device *dev = &pdev->dev;
+	struct idxd_device *idxd;
+	int rc;
+	unsigned int mask;
+
+	/*
+	 * If the CPU does not support write512, there's no point in
+	 * enumerating the device. We can not utilize it.
+	 */
+	if (!cpu_has_write512())
+		return -ENXIO;
+
+	dev_dbg(dev, "CPU has write64 support\n");
+
+	rc = pcim_enable_device(pdev);
+	if (rc)
+		return rc;
+
+	dev_dbg(dev, "Mapping BARs\n");
+	mask = (1 << IDXD_MMIO_BAR);
+	rc = pcim_iomap_regions(pdev, mask, DRV_NAME);
+	if (rc)
+		return rc;
+
+	iomap = pcim_iomap_table(pdev);
+	if (!iomap)
+		return -ENOMEM;
+
+	dev_dbg(dev, "Set DMA masks\n");
+	rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
+	if (rc)
+		rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+	if (rc)
+		return rc;
+
+	rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));
+	if (rc)
+		rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32));
+	if (rc)
+		return rc;
+
+	dev_dbg(dev, "Alloc IDXD context\n");
+	idxd = idxd_alloc(pdev, iomap);
+	if (!idxd)
+		return -ENOMEM;
+
+	idxd_set_type(idxd);
+
+	dev_dbg(dev, "Set PCI master\n");
+	pci_set_master(pdev);
+	pci_set_drvdata(pdev, idxd);
+
+	idxd->hw.version = ioread32(idxd->reg_base + IDXD_VER_OFFSET);
+	rc = idxd_probe(idxd);
+	if (rc) {
+		dev_err(dev, "Intel(R) IDXD DMA Engine init failed\n");
+		return -ENODEV;
+	}
+
+	dev_info(&pdev->dev, "Intel(R) Accelerator Device (v%x)\n",
+		 idxd->hw.version);
+
+	return 0;
+}
+
+static void idxd_shutdown(struct pci_dev *pdev)
+{
+	struct idxd_device *idxd = pci_get_drvdata(pdev);
+	int rc, i;
+	struct idxd_irq_entry *irq_entry;
+	int msixcnt = pci_msix_vec_count(pdev);
+	unsigned long flags;
+
+	spin_lock_irqsave(&idxd->dev_lock, flags);
+	rc = idxd_device_disable(idxd);
+	spin_unlock_irqrestore(&idxd->dev_lock, flags);
+	if (rc)
+		dev_err(&pdev->dev, "Disabling device failed\n");
+
+	dev_dbg(&pdev->dev, "%s called\n", __func__);
+	idxd_mask_msix_vectors(idxd);
+	idxd_mask_error_interrupts(idxd);
+
+	for (i = 0; i < msixcnt; i++) {
+		irq_entry = &idxd->irq_entries[i];
+		synchronize_irq(idxd->msix_entries[i].vector);
+		if (i == 0)
+			continue;
+	}
+}
+
+static void idxd_remove(struct pci_dev *pdev)
+{
+	struct idxd_device *idxd = pci_get_drvdata(pdev);
+
+	dev_dbg(&pdev->dev, "%s called\n", __func__);
+	idxd_shutdown(pdev);
+	idxd_wqs_free_lock(idxd);
+	mutex_lock(&idxd_idr_lock);
+	idr_remove(&idxd_idrs[idxd->type], idxd->id);
+	mutex_unlock(&idxd_idr_lock);
+}
+
+static struct pci_driver idxd_pci_driver = {
+	.name		= DRV_NAME,
+	.id_table	= idxd_pci_tbl,
+	.probe		= idxd_pci_probe,
+	.remove		= idxd_remove,
+	.shutdown	= idxd_shutdown,
+};
+
+static int __init idxd_init_module(void)
+{
+	int err, i;
+
+	pr_info("%s: Intel(R) Accelerator Devices Driver %s\n",
+		DRV_NAME, IDXD_DRIVER_VERSION);
+
+	mutex_init(&idxd_idr_lock);
+	for (i = 0; i < IDXD_TYPE_MAX; i++)
+		idr_init(&idxd_idrs[i]);
+
+	err = pci_register_driver(&idxd_pci_driver);
+	if (err)
+		return err;
+
+	return 0;
+}
+module_init(idxd_init_module);
+
+static void __exit idxd_exit_module(void)
+{
+	pci_unregister_driver(&idxd_pci_driver);
+}
+module_exit(idxd_exit_module);
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
new file mode 100644
index 000000000000..de4b80973c2f
--- /dev/null
+++ b/drivers/dma/idxd/irq.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <uapi/linux/idxd.h>
+#include "idxd.h"
+#include "registers.h"
+
+void idxd_device_wqs_clear_state(struct idxd_device *idxd)
+{
+	int i;
+
+	lockdep_assert_held(&idxd->dev_lock);
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		wq->state = IDXD_WQ_DISABLED;
+	}
+}
+
+static int idxd_restart(struct idxd_device *idxd)
+{
+	int i, rc;
+
+	lockdep_assert_held(&idxd->dev_lock);
+
+	rc = __idxd_device_reset(idxd);
+	if (rc < 0)
+		goto out;
+
+	rc = idxd_device_config(idxd);
+	if (rc < 0)
+		goto out;
+
+	rc = idxd_device_enable(idxd);
+	if (rc < 0)
+		goto out;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		if (wq->state == IDXD_WQ_ENABLED) {
+			rc = idxd_wq_enable(wq);
+			if (rc < 0) {
+				dev_warn(&idxd->pdev->dev,
+					 "Unable to re-enable wq %s\n",
+					 dev_name(&wq->conf_dev));
+			}
+		}
+	}
+
+	return 0;
+
+ out:
+	idxd_device_wqs_clear_state(idxd);
+	idxd->state = IDXD_DEV_HALTED;
+	return rc;
+}
+
+irqreturn_t idxd_irq_handler(int vec, void *data)
+{
+	struct idxd_irq_entry *irq_entry = data;
+	struct idxd_device *idxd = irq_entry->idxd;
+
+	idxd_mask_msix_vector(idxd, irq_entry->id);
+	return IRQ_WAKE_THREAD;
+}
+
+irqreturn_t idxd_misc_thread(int vec, void *data)
+{
+	struct idxd_irq_entry *irq_entry = data;
+	struct idxd_device *idxd = irq_entry->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	union gensts_reg gensts;
+	u32 cause, val = 0;
+	int i, rc;
+	bool err = false;
+
+	cause = ioread32(idxd->reg_base + IDXD_INTCAUSE_OFFSET);
+
+	if (cause & IDXD_INTC_ERR) {
+		spin_lock_bh(&idxd->dev_lock);
+		for (i = 0; i < 4; i++)
+			idxd->sw_err.bits[i] = ioread64(idxd->reg_base +
+					IDXD_SWERR_OFFSET + i * sizeof(u64));
+		iowrite64(IDXD_SWERR_ACK, idxd->reg_base + IDXD_SWERR_OFFSET);
+		spin_unlock_bh(&idxd->dev_lock);
+		val |= IDXD_INTC_ERR;
+
+		for (i = 0; i < 4; i++)
+			dev_warn(dev, "err[%d]: %#16.16llx\n",
+				 i, idxd->sw_err.bits[i]);
+		err = true;
+	}
+
+	if (cause & IDXD_INTC_CMD) {
+		/* Driver does use command interrupts */
+		val |= IDXD_INTC_CMD;
+	}
+
+	if (cause & IDXD_INTC_OCCUPY) {
+		/* Driver does not utilize occupancy interrupt */
+		val |= IDXD_INTC_OCCUPY;
+	}
+
+	if (cause & IDXD_INTC_PERFMON_OVFL) {
+		/*
+		 * Driver does not utilize perfmon counter overflow interrupt
+		 * yet.
+		 */
+		val |= IDXD_INTC_PERFMON_OVFL;
+	}
+
+	val ^= cause;
+	if (val)
+		dev_warn_once(dev, "Unexpected interrupt cause bits set: %#x\n",
+			      val);
+
+	iowrite32(cause, idxd->reg_base + IDXD_INTCAUSE_OFFSET);
+	if (!err)
+		return IRQ_HANDLED;
+
+	gensts.bits = ioread32(idxd->reg_base + IDXD_GENSTATS_OFFSET);
+	if (gensts.state == IDXD_DEVICE_STATE_HALT) {
+		spin_lock_bh(&idxd->dev_lock);
+		if (gensts.reset_type == IDXD_DEVICE_RESET_SOFTWARE) {
+			rc = idxd_restart(idxd);
+			if (rc < 0)
+				dev_err(&idxd->pdev->dev,
+					"idxd restart failed, device halt.");
+		} else {
+			idxd_device_wqs_clear_state(idxd);
+			idxd->state = IDXD_DEV_HALTED;
+			dev_err(&idxd->pdev->dev,
+				"idxd halted, need %s.\n",
+				gensts.reset_type == IDXD_DEVICE_RESET_FLR ?
+				"FLR" : "system reset");
+		}
+		spin_unlock_bh(&idxd->dev_lock);
+	}
+
+	idxd_unmask_msix_vector(idxd, irq_entry->id);
+	return IRQ_HANDLED;
+}
+
+irqreturn_t idxd_wq_thread(int irq, void *data)
+{
+	struct idxd_irq_entry *irq_entry = data;
+
+	idxd_unmask_msix_vector(irq_entry->idxd, irq_entry->id);
+
+	return IRQ_HANDLED;
+}
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
new file mode 100644
index 000000000000..77275a07fa61
--- /dev/null
+++ b/drivers/dma/idxd/registers.h
@@ -0,0 +1,335 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#ifndef _IDXD_REGISTERS_H_
+#define _IDXD_REGISTERS_H_
+
+/* PCI Config */
+#define PCI_DEVICE_ID_INTEL_DSA_SPR0	0x0b25
+
+#define IDXD_MMIO_BAR		0
+#define IDXD_WQ_BAR		2
+
+/* MMIO Device BAR0 Registers */
+#define IDXD_VER_OFFSET			0x00
+#define IDXD_VER_MAJOR_MASK		0xf0
+#define IDXD_VER_MINOR_MASK		0x0f
+#define GET_IDXD_VER_MAJOR(x)		(((x) & IDXD_VER_MAJOR_MASK) >> 4)
+#define GET_IDXD_VER_MINOR(x)		((x) & IDXD_VER_MINOR_MASK)
+
+union gen_cap_reg {
+	struct {
+		u64 block_on_fault:1;
+		u64 overlap_copy:1;
+		u64 cache_control_mem:1;
+		u64 cache_control_cache:1;
+		u64 rsvd:3;
+		u64 int_handle_req:1;
+		u64 dest_readback:1;
+		u64 drain_readback:1;
+		u64 rsvd2:6;
+		u64 max_xfer_shift:5;
+		u64 max_batch_shift:4;
+		u64 max_ims_mult:6;
+		u64 config_en:1;
+		u64 max_descs_per_engine:8;
+		u64 rsvd3:24;
+	};
+	u64 bits;
+} __packed;
+#define IDXD_GENCAP_OFFSET		0x10
+
+union wq_cap_reg {
+	struct {
+		u64 total_wq_size:16;
+		u64 num_wqs:8;
+		u64 rsvd:24;
+		u64 shared_mode:1;
+		u64 dedicated_mode:1;
+		u64 rsvd2:1;
+		u64 priority:1;
+		u64 occupancy:1;
+		u64 occupancy_int:1;
+		u64 rsvd3:10;
+	};
+	u64 bits;
+} __packed;
+#define IDXD_WQCAP_OFFSET		0x20
+
+union group_cap_reg {
+	struct {
+		u64 num_groups:8;
+		u64 total_tokens:8;
+		u64 token_en:1;
+		u64 token_limit:1;
+		u64 rsvd:46;
+	};
+	u64 bits;
+} __packed;
+#define IDXD_GRPCAP_OFFSET		0x30
+
+union engine_cap_reg {
+	struct {
+		u64 num_engines:8;
+		u64 rsvd:56;
+	};
+	u64 bits;
+} __packed;
+
+#define IDXD_ENGCAP_OFFSET		0x38
+
+#define IDXD_OPCAP_NOOP			0x0001
+#define IDXD_OPCAP_BATCH			0x0002
+#define IDXD_OPCAP_MEMMOVE		0x0008
+struct opcap {
+	u64 bits[4];
+};
+
+#define IDXD_OPCAP_OFFSET		0x40
+
+#define IDXD_TABLE_OFFSET		0x60
+union offsets_reg {
+	struct {
+		u64 grpcfg:16;
+		u64 wqcfg:16;
+		u64 msix_perm:16;
+		u64 ims:16;
+		u64 perfmon:16;
+		u64 rsvd:48;
+	};
+	u64 bits[2];
+} __packed;
+
+#define IDXD_GENCFG_OFFSET		0x80
+union gencfg_reg {
+	struct {
+		u32 token_limit:8;
+		u32 rsvd:4;
+		u32 user_int_en:1;
+		u32 rsvd2:19;
+	};
+	u32 bits;
+} __packed;
+
+#define IDXD_GENCTRL_OFFSET		0x88
+union genctrl_reg {
+	struct {
+		u32 softerr_int_en:1;
+		u32 rsvd:31;
+	};
+	u32 bits;
+} __packed;
+
+#define IDXD_GENSTATS_OFFSET		0x90
+union gensts_reg {
+	struct {
+		u32 state:2;
+		u32 reset_type:2;
+		u32 rsvd:28;
+	};
+	u32 bits;
+} __packed;
+
+enum idxd_device_status_state {
+	IDXD_DEVICE_STATE_DISABLED = 0,
+	IDXD_DEVICE_STATE_ENABLED,
+	IDXD_DEVICE_STATE_DRAIN,
+	IDXD_DEVICE_STATE_HALT,
+};
+
+enum idxd_device_reset_type {
+	IDXD_DEVICE_RESET_SOFTWARE = 0,
+	IDXD_DEVICE_RESET_FLR,
+	IDXD_DEVICE_RESET_WARM,
+	IDXD_DEVICE_RESET_COLD,
+};
+
+#define IDXD_INTCAUSE_OFFSET		0x98
+#define IDXD_INTC_ERR			0x01
+#define IDXD_INTC_CMD			0x02
+#define IDXD_INTC_OCCUPY			0x04
+#define IDXD_INTC_PERFMON_OVFL		0x08
+
+#define IDXD_CMD_OFFSET			0xa0
+union idxd_command_reg {
+	struct {
+		u32 operand:20;
+		u32 cmd:5;
+		u32 rsvd:6;
+		u32 int_req:1;
+	};
+	u32 bits;
+} __packed;
+
+enum idxd_cmd {
+	IDXD_CMD_ENABLE_DEVICE = 1,
+	IDXD_CMD_DISABLE_DEVICE,
+	IDXD_CMD_DRAIN_ALL,
+	IDXD_CMD_ABORT_ALL,
+	IDXD_CMD_RESET_DEVICE,
+	IDXD_CMD_ENABLE_WQ,
+	IDXD_CMD_DISABLE_WQ,
+	IDXD_CMD_DRAIN_WQ,
+	IDXD_CMD_ABORT_WQ,
+	IDXD_CMD_RESET_WQ,
+	IDXD_CMD_DRAIN_PASID,
+	IDXD_CMD_ABORT_PASID,
+	IDXD_CMD_REQUEST_INT_HANDLE,
+};
+
+#define IDXD_CMDSTS_OFFSET		0xa8
+union cmdsts_reg {
+	struct {
+		u8 err;
+		u16 result;
+		u8 rsvd:7;
+		u8 active:1;
+	};
+	u32 bits;
+} __packed;
+#define IDXD_CMDSTS_ACTIVE		0x80000000
+
+enum idxd_cmdsts_err {
+	IDXD_CMDSTS_SUCCESS = 0,
+	IDXD_CMDSTS_INVAL_CMD,
+	IDXD_CMDSTS_INVAL_WQIDX,
+	IDXD_CMDSTS_HW_ERR,
+	/* enable device errors */
+	IDXD_CMDSTS_ERR_DEV_ENABLED = 0x10,
+	IDXD_CMDSTS_ERR_CONFIG,
+	IDXD_CMDSTS_ERR_BUSMASTER_EN,
+	IDXD_CMDSTS_ERR_PASID_INVAL,
+	IDXD_CMDSTS_ERR_WQ_SIZE_ERANGE,
+	IDXD_CMDSTS_ERR_GRP_CONFIG,
+	IDXD_CMDSTS_ERR_GRP_CONFIG2,
+	IDXD_CMDSTS_ERR_GRP_CONFIG3,
+	IDXD_CMDSTS_ERR_GRP_CONFIG4,
+	/* enable wq errors */
+	IDXD_CMDSTS_ERR_DEV_NOTEN = 0x20,
+	IDXD_CMDSTS_ERR_WQ_ENABLED,
+	IDXD_CMDSTS_ERR_WQ_SIZE,
+	IDXD_CMDSTS_ERR_WQ_PRIOR,
+	IDXD_CMDSTS_ERR_WQ_MODE,
+	IDXD_CMDSTS_ERR_BOF_EN,
+	IDXD_CMDSTS_ERR_PASID_EN,
+	IDXD_CMDSTS_ERR_MAX_BATCH_SIZE,
+	IDXD_CMDSTS_ERR_MAX_XFER_SIZE,
+	/* disable device errors */
+	IDXD_CMDSTS_ERR_DIS_DEV_EN = 0x31,
+	/* disable WQ, drain WQ, abort WQ, reset WQ */
+	IDXD_CMDSTS_ERR_DEV_NOT_EN,
+	/* request interrupt handle */
+	IDXD_CMDSTS_ERR_INVAL_INT_IDX = 0x41,
+	IDXD_CMDSTS_ERR_NO_HANDLE,
+};
+
+#define IDXD_SWERR_OFFSET		0xc0
+#define IDXD_SWERR_VALID			0x00000001
+#define IDXD_SWERR_OVERFLOW		0x00000002
+#define IDXD_SWERR_ACK			(IDXD_SWERR_VALID | IDXD_SWERR_OVERFLOW)
+union sw_err_reg {
+	struct {
+		u64 valid:1;
+		u64 overflow:1;
+		u64 desc_valid:1;
+		u64 wq_idx_valid:1;
+		u64 batch:1;
+		u64 fault_rw:1;
+		u64 priv:1;
+		u64 rsvd:1;
+		u64 error:8;
+		u64 wq_idx:8;
+		u64 rsvd2:8;
+		u64 operation:8;
+		u64 pasid:20;
+		u64 rsvd3:4;
+
+		u64 batch_idx:16;
+		u64 rsvd4:16;
+		u64 invalid_flags:32;
+
+		u64 fault_addr;
+
+		u64 rsvd5;
+	};
+	u64 bits[4];
+};
+
+union msix_perm {
+	struct {
+		u32 rsvd:2;
+		u32 ignore:1;
+		u32 pasid_en:1;
+		u32 rsvd2:8;
+		u32 pasid:20;
+	};
+	u32 bits;
+};
+
+union group_flags {
+	struct {
+		u32 tc_a:3;
+		u32 tc_b:3;
+		u32 rsvd:1;
+		u32 use_token_limit:1;
+		u32 tokens_reserved:8;
+		u32 rsvd2:4;
+		u32 tokens_allowed:8;
+		u32 rsvd3:4;
+	};
+	u32 bits;
+} __packed;
+
+struct grpcfg {
+	u64 wqs[4];
+	u64 engines;
+	union group_flags flags;
+} __packed;
+
+union wqcfg {
+	struct {
+		/* bytes 0-3 */
+		u16 wq_size;
+		u16 rsvd;
+
+		/* bytes 4-7 */
+		u16 wq_thresh;
+		u16 rsvd1;
+
+		/* bytes 8-11 */
+		u32 mode:1;	/* shared or dedicated */
+		u32 bof:1;	/* block on fault */
+		u32 rsvd2:2;
+		u32 priority:4;
+		u32 pasid:20;
+		u32 pasid_en:1;
+		u32 priv:1;
+		u32 rsvd3:2;
+
+		/* bytes 12-15 */
+		u32 max_xfer_shift:5;
+		u32 max_batch_shift:4;
+		u32 rsvd4:23;
+
+		/* bytes 16-19 */
+		u16 occupancy_inth;
+		u16 occupancy_table_sel:1;
+		u16 rsvd5:15;
+
+		/* bytes 20-23 */
+		u16 occupancy_limit;
+		u16 occupancy_int_en:1;
+		u16 rsvd6:15;
+
+		/* bytes 24-27 */
+		u16 occupancy;
+		u16 occupancy_int:1;
+		u16 rsvd7:12;
+		u16 mode_support:1;
+		u16 wq_state:2;
+
+		/* bytes 28-31 */
+		u32 rsvd8;
+	};
+	u32 bits[8];
+} __packed;
+#endif
diff --git a/include/uapi/linux/idxd.h b/include/uapi/linux/idxd.h
new file mode 100644
index 000000000000..d562842cc4e2
--- /dev/null
+++ b/include/uapi/linux/idxd.h
@@ -0,0 +1,214 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#ifndef _USR_IDXD_H_
+#define _USR_IDXD_H_
+
+#include <linux/types.h>
+
+/* Descriptor flags */
+#define IDXD_OP_FLAG_FENCE	0x0001
+#define IDXD_OP_FLAG_BOF	0x0002
+#define IDXD_OP_FLAG_CRAV	0x0004
+#define IDXD_OP_FLAG_RCR	0x0008
+#define IDXD_OP_FLAG_RCI	0x0010
+#define IDXD_OP_FLAG_CRSTS	0x0020
+#define IDXD_OP_FLAG_CR		0x0080
+#define IDXD_OP_FLAG_CC		0x0100
+#define IDXD_OP_FLAG_ADDR1_TCS	0x0200
+#define IDXD_OP_FLAG_ADDR2_TCS	0x0400
+#define IDXD_OP_FLAG_ADDR3_TCS	0x0800
+#define IDXD_OP_FLAG_CR_TCS	0x1000
+#define IDXD_OP_FLAG_STORD	0x2000
+#define IDXD_OP_FLAG_DRDBK	0x4000
+#define IDXD_OP_FLAG_DSTS	0x8000
+
+/* Opcode */
+enum dsa_opcode {
+	DSA_OPCODE_NOOP = 0,
+	DSA_OPCODE_BATCH,
+	DSA_OPCODE_DRAIN,
+	DSA_OPCODE_MEMMOVE,
+	DSA_OPCODE_MEMFILL,
+	DSA_OPCODE_COMPARE,
+	DSA_OPCODE_COMPVAL,
+	DSA_OPCODE_CR_DELTA,
+	DSA_OPCODE_AP_DELTA,
+	DSA_OPCODE_DUALCAST,
+	DSA_OPCODE_CRCGEN,
+	DSA_OPCODE_COPY_CRC,
+	DSA_OPCODE_DIF_INS,
+	DSA_OPCODE_DIF_STRP,
+	DSA_OPCODE_DIF_UPDT,
+	DSA_OPCODE_CFLUSH = 0x20,
+};
+
+/* Completion record status */
+enum dsa_completion_status {
+	DSA_COMP_NONE = 0,
+	DSA_COMP_SUCCESS,
+	DSA_COMP_SUCCESS_PRED,
+	DSA_COMP_PAGE_FAULT_NOBOF,
+	DSA_COMP_PAGE_FAULT_IR,
+	DSA_COMP_BATCH_FAIL,
+	DSA_COMP_BATCH_PAGE_FAULT,
+	DSA_COMP_BAD_OPCODE = 0x10,
+	DSA_COMP_INVALID_FLAGS,
+	DSA_COMP_NOZERO_RESERVE,
+	DSA_COMP_XFER_ERANGE,
+	DSA_COMP_DESC_CNT_ERANGE,
+	DSA_COMP_DESCLIST_ALIGN = 0x18,
+	DSA_COMP_INT_HANDLE_INVAL,
+	DSA_COMP_CRA_XLAT,
+	DSA_COMP_CRA_ALIGN,
+	DSA_COMP_ADDR_ALIGN,
+	DSA_COMP_PRIV_BAD,
+	DSA_COMP_HW_ERR1 = 0x20,
+	DSA_COMP_TRANSLATION_FAIL = 0x22,
+};
+
+#define DSA_COMP_STATUS_MASK		0x7f
+#define DSA_COMP_STATUS_WRITE		0x80
+
+struct dsa_batch_desc {
+	uint32_t	pasid:20;
+	uint32_t	rsvd:11;
+	uint32_t	priv:1;
+	uint32_t	flags:24;
+	uint32_t	opcode:8;
+	uint64_t	completion_addr;
+	uint64_t	desc_list_addr;
+	uint64_t	rsvd1;
+	uint32_t	desc_count;
+	uint16_t	interrupt_handle;
+	uint16_t	rsvd2;
+	uint8_t		rsvd3[24];
+} __attribute__((packed));
+
+struct dsa_hw_desc {
+	uint32_t	pasid:20;
+	uint32_t	rsvd:11;
+	uint32_t	priv:1;
+	uint32_t	flags:24;
+	uint32_t	opcode:8;
+	uint64_t	completion_addr;
+	union {
+		uint64_t	src_addr;
+		uint64_t	rdback_addr;
+		uint64_t	pattern;
+	};
+	union {
+		uint64_t	dst_addr;
+		uint64_t	rdback_addr2;
+		uint64_t	src2_addr;
+		uint64_t	comp_pattern;
+	};
+	uint32_t	xfer_size;
+	uint16_t	int_handle;
+	uint16_t	rsvd1;
+	union {
+		uint8_t		expected_res;
+		struct {
+			uint64_t	delta_addr;
+			uint32_t	max_delta_size;
+		};
+		uint32_t	delta_rec_size;
+		uint64_t	dest2;
+		/* CRC */
+		struct {
+			uint32_t	crc_seed;
+			uint32_t	crc_rsvd;
+			uint64_t	seed_addr;
+		};
+		/* DIF check or strip */
+		struct {
+			uint8_t		src_dif_flags;
+			uint8_t		dif_chk_res;
+			uint8_t		dif_chk_flags;
+			uint8_t		dif_chk_res2[5];
+			uint32_t	chk_ref_tag_seed;
+			uint16_t	chk_app_tag_mask;
+			uint16_t	chk_app_tag_seed;
+		};
+		/* DIF insert */
+		struct {
+			uint8_t		dif_ins_res;
+			uint8_t		dest_dif_flag;
+			uint8_t		dif_ins_flags;
+			uint8_t		dif_ins_res2[13];
+			uint32_t	ins_ref_tag_seed;
+			uint16_t	ins_app_tag_mask;
+			uint16_t	ins_app_tag_seed;
+		};
+		/* DIF update */
+		struct {
+			uint8_t		src_upd_flags;
+			uint8_t		upd_dest_flags;
+			uint8_t		dif_upd_flags;
+			uint8_t		dif_upd_res[5];
+			uint32_t	src_ref_tag_seed;
+			uint16_t	src_app_tag_mask;
+			uint16_t	src_app_tag_seed;
+			uint32_t	dest_ref_tag_seed;
+			uint16_t	dest_app_tag_mask;
+			uint16_t	dest_app_tag_seed;
+		};
+
+		uint8_t		op_specific[24];
+	};
+} __attribute__((packed));
+
+struct dsa_raw_desc {
+	uint64_t	field[8];
+} __attribute__((packed));
+
+/*
+ * The status field will be modified by hardware, therefore it should be
+ * volatile and prevent the compiler from optimize the read.
+ */
+struct dsa_completion_record {
+	volatile uint8_t	status;
+	union {
+		uint8_t		result;
+		uint8_t		dif_status;
+	};
+	uint16_t		rsvd;
+	uint32_t		bytes_completed;
+	uint64_t		fault_addr;
+	union {
+		uint16_t	delta_rec_size;
+		uint16_t	crc_val;
+
+		/* DIF check & strip */
+		struct {
+			uint32_t	dif_chk_ref_tag;
+			uint16_t	dif_chk_app_tag_mask;
+			uint16_t	dif_chk_app_tag;
+		};
+
+		/* DIF insert */
+		struct {
+			uint64_t	dif_ins_res;
+			uint32_t	dif_ins_ref_tag;
+			uint16_t	dif_ins_app_tag_mask;
+			uint16_t	dif_ins_app_tag;
+		};
+
+		/* DIF update */
+		struct {
+			uint32_t	dif_upd_src_ref_tag;
+			uint16_t	dif_upd_src_app_tag_mask;
+			uint16_t	dif_upd_src_app_tag;
+			uint32_t	dif_upd_dest_ref_tag;
+			uint16_t	dif_upd_dest_app_tag_mask;
+			uint16_t	dif_upd_dest_app_tag;
+		};
+
+		uint8_t		op_specific[16];
+	};
+} __attribute__((packed));
+
+struct dsa_raw_completion_record {
+	uint64_t	field[4];
+} __attribute__((packed));
+
+#endif


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 09/14] dmaengine: idxd: add configuration component of driver
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (7 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 08/14] dmaengine: idxd: Init and probe for Intel data accelerators Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 10/14] dmaengine: idxd: add descriptor manipulation routines Dave Jiang
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

The device is left unconfigured when the driver is loaded. Various
components are configured via the driver sysfs attributes. Once
configuration is done, the device can be enabled by writing the device name
to the bind attribute of the device driver sysfs. Disabling can be done
similarly. Also the individual work queues can also be enabled and disabled
through the bind/unbind attributes. A constructed hierarchy is created
through the struct device framework in order to provide appropriate
configuration points and device state and status. This hierarchy is
presented off the virtual DSA bus.

i.e. /sys/bus/dsa/...


Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/Makefile    |    2 
 drivers/dma/idxd/device.c    |   25 +
 drivers/dma/idxd/idxd.h      |   24 +
 drivers/dma/idxd/init.c      |   26 +
 drivers/dma/idxd/registers.h |    1 
 drivers/dma/idxd/sysfs.c     | 1434 ++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 1510 insertions(+), 2 deletions(-)
 create mode 100644 drivers/dma/idxd/sysfs.c

diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index 0dd1ca77513f..a552560a03dc 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,2 @@
 obj-$(CONFIG_INTEL_IDXD) += idxd.o
-idxd-y := init.o irq.o device.o
+idxd-y := init.o irq.o device.o sysfs.o
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 88739c11e163..74a60a8bef76 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -292,6 +292,31 @@ int idxd_wq_disable(struct idxd_wq *wq)
 	return 0;
 }
 
+int idxd_wq_map_portal(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct pci_dev *pdev = idxd->pdev;
+	struct device *dev = &pdev->dev;
+	resource_size_t start;
+
+	start = pci_resource_start(pdev, IDXD_WQ_BAR);
+	start = start + wq->id * IDXD_PORTAL_SIZE;
+
+	wq->dportal = devm_ioremap(dev, start, IDXD_PORTAL_SIZE);
+	if (!wq->dportal)
+		return -ENOMEM;
+	dev_dbg(dev, "wq %d portal mapped at %p\n", wq->id, wq->dportal);
+
+	return 0;
+}
+
+void idxd_wq_unmap_portal(struct idxd_wq *wq)
+{
+	struct device *dev = &wq->idxd->pdev->dev;
+
+	devm_iounmap(dev, wq->dportal);
+}
+
 /* Device control bits */
 static inline bool idxd_is_enabled(struct idxd_device *idxd)
 {
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 039a3cb84214..654351105d85 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -201,7 +201,28 @@ static inline void idxd_set_type(struct idxd_device *idxd)
 		idxd->type = IDXD_TYPE_UNKNOWN;
 }
 
+static inline void idxd_wq_get(struct idxd_wq *wq)
+{
+	wq->client_count++;
+}
+
+static inline void idxd_wq_put(struct idxd_wq *wq)
+{
+	wq->client_count--;
+}
+
+static inline int idxd_wq_refcount(struct idxd_wq *wq)
+{
+	return wq->client_count;
+};
+
 const char *idxd_get_dev_name(struct idxd_device *idxd);
+int idxd_register_bus_type(void);
+void idxd_unregister_bus_type(void);
+int idxd_setup_sysfs(struct idxd_device *idxd);
+void idxd_cleanup_sysfs(struct idxd_device *idxd);
+int idxd_register_driver(void);
+void idxd_unregister_driver(void);
 
 /* device interrupt control */
 irqreturn_t idxd_irq_handler(int vec, void *data);
@@ -223,9 +244,12 @@ int idxd_device_config(struct idxd_device *idxd);
 void idxd_device_wqs_clear_state(struct idxd_device *idxd);
 
 /* work queue control */
+void idxd_free_desc(struct idxd_wq *wq, struct idxd_desc *desc);
 int idxd_wq_alloc_resources(struct idxd_wq *wq);
 void idxd_wq_free_resources(struct idxd_wq *wq);
 int idxd_wq_enable(struct idxd_wq *wq);
 int idxd_wq_disable(struct idxd_wq *wq);
+int idxd_wq_map_portal(struct idxd_wq *wq);
+void idxd_wq_unmap_portal(struct idxd_wq *wq);
 
 #endif
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index aeafc87b0c7e..b9a7bfea93b5 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -390,6 +390,14 @@ static int idxd_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return -ENODEV;
 	}
 
+	rc = idxd_setup_sysfs(idxd);
+	if (rc) {
+		dev_err(dev, "IDXD sysfs setup failed\n");
+		return -ENODEV;
+	}
+
+	idxd->state = IDXD_DEV_CONF_READY;
+
 	dev_info(&pdev->dev, "Intel(R) Accelerator Device (v%x)\n",
 		 idxd->hw.version);
 
@@ -427,6 +435,7 @@ static void idxd_remove(struct pci_dev *pdev)
 	struct idxd_device *idxd = pci_get_drvdata(pdev);
 
 	dev_dbg(&pdev->dev, "%s called\n", __func__);
+	idxd_cleanup_sysfs(idxd);
 	idxd_shutdown(pdev);
 	idxd_wqs_free_lock(idxd);
 	mutex_lock(&idxd_idr_lock);
@@ -453,16 +462,31 @@ static int __init idxd_init_module(void)
 	for (i = 0; i < IDXD_TYPE_MAX; i++)
 		idr_init(&idxd_idrs[i]);
 
+	err = idxd_register_bus_type();
+	if (err < 0)
+		return err;
+
+	err = idxd_register_driver();
+	if (err < 0)
+		goto err_idxd_driver_register;
+
 	err = pci_register_driver(&idxd_pci_driver);
 	if (err)
-		return err;
+		goto err_pci_register;
 
 	return 0;
+
+err_pci_register:
+	idxd_unregister_driver();
+err_idxd_driver_register:
+	idxd_unregister_bus_type();
+	return err;
 }
 module_init(idxd_init_module);
 
 static void __exit idxd_exit_module(void)
 {
 	pci_unregister_driver(&idxd_pci_driver);
+	idxd_unregister_bus_type();
 }
 module_exit(idxd_exit_module);
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index 77275a07fa61..b81b4c8d6f58 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -8,6 +8,7 @@
 
 #define IDXD_MMIO_BAR		0
 #define IDXD_WQ_BAR		2
+#define IDXD_PORTAL_SIZE	0x4000
 
 /* MMIO Device BAR0 Registers */
 #define IDXD_VER_OFFSET			0x00
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
new file mode 100644
index 000000000000..fab916a555b2
--- /dev/null
+++ b/drivers/dma/idxd/sysfs.c
@@ -0,0 +1,1434 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+
+static char *idxd_wq_type_names[] = {
+	[IDXD_WQT_NONE]		= "none",
+	[IDXD_WQT_KERNEL]	= "kernel",
+};
+
+static void idxd_conf_device_release(struct device *dev)
+{
+	dev_dbg(dev, "%s for %s\n", __func__, dev_name(dev));
+}
+
+static struct device_type idxd_group_device_type = {
+	.name = "group",
+	.release = idxd_conf_device_release,
+};
+
+static struct device_type idxd_wq_device_type = {
+	.name = "wq",
+	.release = idxd_conf_device_release,
+};
+
+static struct device_type idxd_engine_device_type = {
+	.name = "engine",
+	.release = idxd_conf_device_release,
+};
+
+static struct device_type dsa_device_type = {
+	.name = "dsa",
+	.release = idxd_conf_device_release,
+};
+
+static inline bool is_dsa_dev(struct device *dev)
+{
+	return dev ? dev->type == &dsa_device_type : false;
+}
+
+static inline bool is_idxd_dev(struct device *dev)
+{
+	return is_dsa_dev(dev);
+}
+
+static inline bool is_idxd_wq_dev(struct device *dev)
+{
+	return dev ? dev->type == &idxd_wq_device_type : false;
+}
+
+static int idxd_config_bus_match(struct device *dev,
+				 struct device_driver *drv)
+{
+	int matched = 0;
+
+	if (is_idxd_dev(dev)) {
+		struct idxd_device *idxd = confdev_to_idxd(dev);
+
+		if (idxd->state != IDXD_DEV_CONF_READY)
+			return 0;
+		matched = 1;
+	} else if (is_idxd_wq_dev(dev)) {
+		struct idxd_wq *wq = confdev_to_wq(dev);
+		struct idxd_device *idxd = wq->idxd;
+
+		if (idxd->state < IDXD_DEV_CONF_READY)
+			return 0;
+
+		if (wq->state != IDXD_WQ_DISABLED) {
+			dev_dbg(dev, "%s not disabled\n", dev_name(dev));
+			return 0;
+		}
+		matched = 1;
+	}
+
+	if (matched)
+		dev_dbg(dev, "%s matched\n", dev_name(dev));
+
+	return matched;
+}
+
+static int idxd_config_bus_probe(struct device *dev)
+{
+	int rc;
+	unsigned long flags;
+
+	dev_dbg(dev, "%s called\n", __func__);
+
+	if (is_idxd_dev(dev)) {
+		struct idxd_device *idxd = confdev_to_idxd(dev);
+
+		if (idxd->state != IDXD_DEV_CONF_READY) {
+			dev_warn(dev, "Device not ready for config\n");
+			return -EBUSY;
+		}
+
+		spin_lock_irqsave(&idxd->dev_lock, flags);
+
+		/* Perform IDXD configuration and enabling */
+		rc = idxd_device_config(idxd);
+		if (rc < 0) {
+			spin_unlock_irqrestore(&idxd->dev_lock, flags);
+			dev_warn(dev, "Device config failed: %d\n", rc);
+			return rc;
+		}
+
+		/* start device */
+		rc = idxd_device_enable(idxd);
+		if (rc < 0) {
+			spin_unlock_irqrestore(&idxd->dev_lock, flags);
+			dev_warn(dev, "Device enable failed: %d\n", rc);
+			return rc;
+		}
+
+		spin_unlock_irqrestore(&idxd->dev_lock, flags);
+		dev_info(dev, "Device %s enabled\n", dev_name(dev));
+
+		return 0;
+	} else if (is_idxd_wq_dev(dev)) {
+		struct idxd_wq *wq = confdev_to_wq(dev);
+		struct idxd_device *idxd = wq->idxd;
+
+		mutex_lock(&wq->wq_lock);
+
+		if (idxd->state != IDXD_DEV_ENABLED) {
+			mutex_unlock(&wq->wq_lock);
+			dev_warn(dev, "Enabling while device not enabled.\n");
+			return -EPERM;
+		}
+
+		if (wq->state != IDXD_WQ_DISABLED) {
+			mutex_unlock(&wq->wq_lock);
+			dev_warn(dev, "WQ %d already enabled.\n", wq->id);
+			return -EBUSY;
+		}
+
+		if (!wq->group) {
+			mutex_unlock(&wq->wq_lock);
+			dev_warn(dev, "WQ not attached to group.\n");
+			return -EINVAL;
+		}
+
+		if (strlen(wq->name) == 0) {
+			mutex_unlock(&wq->wq_lock);
+			dev_warn(dev, "WQ name not set.\n");
+			return -EINVAL;
+		}
+
+		rc = idxd_wq_alloc_resources(wq);
+		if (rc < 0) {
+			mutex_unlock(&wq->wq_lock);
+			dev_warn(dev, "WQ resource allocation failed\n");
+			return rc;
+		}
+
+		spin_lock_irqsave(&idxd->dev_lock, flags);
+		rc = idxd_device_config(idxd);
+		if (rc < 0) {
+			spin_unlock_irqrestore(&idxd->dev_lock, flags);
+			mutex_unlock(&wq->wq_lock);
+			dev_warn(dev, "Writing WQ %d config failed: %d\n",
+				 wq->id, rc);
+			return rc;
+		}
+
+		rc = idxd_wq_enable(wq);
+		if (rc < 0) {
+			spin_unlock_irqrestore(&idxd->dev_lock, flags);
+			mutex_unlock(&wq->wq_lock);
+			dev_warn(dev, "WQ %d enabling failed: %d\n",
+				 wq->id, rc);
+			return rc;
+		}
+		spin_unlock_irqrestore(&idxd->dev_lock, flags);
+
+		rc = idxd_wq_map_portal(wq);
+		if (rc < 0) {
+			dev_warn(dev, "wq portal mapping failed: %d\n", rc);
+			rc = idxd_wq_disable(wq);
+			if (rc < 0)
+				dev_warn(dev, "IDXD wq disable failed\n");
+			spin_unlock_irqrestore(&idxd->dev_lock, flags);
+			mutex_unlock(&wq->wq_lock);
+			return rc;
+		}
+
+		wq->client_count = 0;
+
+		dev_info(dev, "wq %s enabled\n", dev_name(&wq->conf_dev));
+		mutex_unlock(&wq->wq_lock);
+		return 0;
+	}
+
+	return -ENODEV;
+}
+
+static void disable_wq(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	unsigned long flags;
+	int rc;
+
+	mutex_lock(&wq->wq_lock);
+	dev_dbg(dev, "%s removing WQ %s\n", __func__, dev_name(&wq->conf_dev));
+	if (wq->state == IDXD_WQ_DISABLED) {
+		mutex_unlock(&wq->wq_lock);
+		return;
+	}
+
+	if (idxd_wq_refcount(wq))
+		dev_warn(dev, "Clients has claim on wq %d: %d\n",
+			 wq->id, idxd_wq_refcount(wq));
+
+	idxd_wq_unmap_portal(wq);
+
+	spin_lock_irqsave(&idxd->dev_lock, flags);
+	rc = idxd_wq_disable(wq);
+	spin_unlock_irqrestore(&idxd->dev_lock, flags);
+
+	idxd_wq_free_resources(wq);
+	wq->client_count = 0;
+	mutex_unlock(&wq->wq_lock);
+
+	if (rc < 0)
+		dev_warn(dev, "Failed to disable %s: %d\n",
+			 dev_name(&wq->conf_dev), rc);
+	else
+		dev_info(dev, "wq %s disabled\n", dev_name(&wq->conf_dev));
+}
+
+static int idxd_config_bus_remove(struct device *dev)
+{
+	int rc;
+	unsigned long flags;
+
+	dev_dbg(dev, "%s called for %s\n", __func__, dev_name(dev));
+
+	/* disable workqueue here */
+	if (is_idxd_wq_dev(dev)) {
+		struct idxd_wq *wq = confdev_to_wq(dev);
+
+		disable_wq(wq);
+	} else if (is_idxd_dev(dev)) {
+		struct idxd_device *idxd = confdev_to_idxd(dev);
+		int i;
+
+		dev_dbg(dev, "%s removing dev %s\n", __func__,
+			dev_name(&idxd->conf_dev));
+		for (i = 0; i < idxd->max_wqs; i++) {
+			struct idxd_wq *wq = &idxd->wqs[i];
+
+			if (wq->state == IDXD_WQ_DISABLED)
+				continue;
+			dev_warn(dev, "Active wq %d on disable %s.\n", i,
+				 dev_name(&idxd->conf_dev));
+			device_release_driver(&wq->conf_dev);
+		}
+
+		spin_lock_irqsave(&idxd->dev_lock, flags);
+		rc = idxd_device_disable(idxd);
+		spin_unlock_irqrestore(&idxd->dev_lock, flags);
+		if (rc < 0)
+			dev_warn(dev, "Device disable failed\n");
+		else
+			dev_info(dev, "Device %s disabled\n", dev_name(dev));
+	}
+
+	return 0;
+}
+
+static void idxd_config_bus_shutdown(struct device *dev)
+{
+	dev_dbg(dev, "%s called\n", __func__);
+}
+
+static struct bus_type dsa_bus_type = {
+	.name = "dsa",
+	.match = idxd_config_bus_match,
+	.probe = idxd_config_bus_probe,
+	.remove = idxd_config_bus_remove,
+	.shutdown = idxd_config_bus_shutdown,
+};
+
+static struct bus_type *idxd_bus_types[] = {
+	&dsa_bus_type
+};
+
+static struct idxd_device_driver dsa_drv = {
+	.drv = {
+		.name = "dsa",
+		.bus = &dsa_bus_type,
+		.owner = THIS_MODULE,
+		.mod_name = KBUILD_MODNAME,
+	},
+};
+
+static struct idxd_device_driver *idxd_drvs[] = {
+	&dsa_drv
+};
+
+static struct bus_type *idxd_get_bus_type(struct idxd_device *idxd)
+{
+	return idxd_bus_types[idxd->type];
+}
+
+static struct device_type *idxd_get_device_type(struct idxd_device *idxd)
+{
+	if (idxd->type == IDXD_TYPE_DSA)
+		return &dsa_device_type;
+	else
+		return NULL;
+}
+
+/* IDXD generic driver setup */
+int idxd_register_driver(void)
+{
+	int i, rc;
+
+	for (i = 0; i < IDXD_TYPE_MAX; i++) {
+		rc = driver_register(&idxd_drvs[i]->drv);
+		if (rc < 0)
+			goto drv_fail;
+	}
+
+	return 0;
+
+drv_fail:
+	for (; i > 0; i--)
+		driver_unregister(&idxd_drvs[i]->drv);
+	return rc;
+}
+
+void idxd_unregister_driver(void)
+{
+	int i;
+
+	for (i = 0; i < IDXD_TYPE_MAX; i++)
+		driver_unregister(&idxd_drvs[i]->drv);
+}
+
+/* IDXD engine attributes */
+static ssize_t engine_group_id_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct idxd_engine *engine =
+		container_of(dev, struct idxd_engine, conf_dev);
+
+	if (engine->group)
+		return sprintf(buf, "%d\n", engine->group->id);
+	else
+		return sprintf(buf, "%d\n", -1);
+}
+
+static ssize_t engine_group_id_store(struct device *dev,
+				     struct device_attribute *attr,
+				     const char *buf, size_t count)
+{
+	struct idxd_engine *engine =
+		container_of(dev, struct idxd_engine, conf_dev);
+	struct idxd_device *idxd = engine->idxd;
+	long id;
+	int rc;
+	struct idxd_group *prevg, *group;
+
+	rc = kstrtol(buf, 10, &id);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (id > idxd->max_groups - 1 || id < -1)
+		return -EINVAL;
+
+	if (id == -1) {
+		if (engine->group) {
+			engine->group->num_engines--;
+			engine->group = NULL;
+		}
+		return count;
+	}
+
+	group = &idxd->groups[id];
+	prevg = engine->group;
+
+	if (prevg)
+		prevg->num_engines--;
+	engine->group = &idxd->groups[id];
+	engine->group->num_engines++;
+
+	return count;
+}
+
+static struct device_attribute dev_attr_engine_group =
+		__ATTR(group_id, 0644, engine_group_id_show,
+		       engine_group_id_store);
+
+static struct attribute *idxd_engine_attributes[] = {
+	&dev_attr_engine_group.attr,
+	NULL,
+};
+
+static const struct attribute_group idxd_engine_attribute_group = {
+	.attrs = idxd_engine_attributes,
+};
+
+static const struct attribute_group *idxd_engine_attribute_groups[] = {
+	&idxd_engine_attribute_group,
+	NULL,
+};
+
+/* Group attributes */
+static ssize_t group_tokens_reserved_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *buf)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+
+	return sprintf(buf, "%u\n", group->tokens_reserved);
+}
+
+static ssize_t group_tokens_reserved_store(struct device *dev,
+					   struct device_attribute *attr,
+					   const char *buf, size_t count)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+	struct idxd_device *idxd = group->idxd;
+	unsigned long val;
+	int rc;
+
+	rc = kstrtoul(buf, 10, &val);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (idxd->state == IDXD_DEV_ENABLED)
+		return -EPERM;
+
+	if (idxd->token_limit == 0)
+		return -EPERM;
+
+	if (val > 0xff)
+		return -EINVAL;
+
+	group->tokens_reserved = val;
+	return count;
+}
+
+static struct device_attribute dev_attr_group_tokens_reserved =
+		__ATTR(tokens_reserved, 0644, group_tokens_reserved_show,
+		       group_tokens_reserved_store);
+
+static ssize_t group_tokens_allowed_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+
+	return sprintf(buf, "%u\n", group->tokens_allowed);
+}
+
+static ssize_t group_tokens_allowed_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+	struct idxd_device *idxd = group->idxd;
+	unsigned long val;
+	int rc;
+
+	rc = kstrtoul(buf, 10, &val);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (idxd->state == IDXD_DEV_ENABLED)
+		return -EPERM;
+
+	if (idxd->token_limit == 0)
+		return -EPERM;
+
+	if (val > 0xff)
+		return -EINVAL;
+
+	group->tokens_allowed = val;
+	return count;
+}
+
+static struct device_attribute dev_attr_group_tokens_allowed =
+		__ATTR(tokens_allowed, 0644, group_tokens_allowed_show,
+		       group_tokens_allowed_store);
+
+static ssize_t group_use_token_limit_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *buf)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+
+	return sprintf(buf, "%u\n", group->use_token_limit);
+}
+
+static ssize_t group_use_token_limit_store(struct device *dev,
+					   struct device_attribute *attr,
+					   const char *buf, size_t count)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+	struct idxd_device *idxd = group->idxd;
+	unsigned long val;
+	int rc;
+
+	rc = kstrtoul(buf, 10, &val);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (idxd->state == IDXD_DEV_ENABLED)
+		return -EPERM;
+
+	if (idxd->token_limit == 0)
+		return -EPERM;
+
+	group->use_token_limit = !!val;
+	return count;
+}
+
+static struct device_attribute dev_attr_group_use_token_limit =
+		__ATTR(use_token_limit, 0644, group_use_token_limit_show,
+		       group_use_token_limit_store);
+
+static ssize_t group_engines_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+	int i, rc = 0;
+	char *tmp = buf;
+	struct idxd_device *idxd = group->idxd;
+
+	for (i = 0; i < idxd->max_engines; i++) {
+		struct idxd_engine *engine = &idxd->engines[i];
+
+		if (!engine->group)
+			continue;
+
+		if (engine->group->id == group->id)
+			rc += sprintf(tmp + rc, "engine%d.%d ",
+					idxd->id, engine->id);
+	}
+
+	rc--;
+	rc += sprintf(tmp + rc, "\n");
+
+	return rc;
+}
+
+static struct device_attribute dev_attr_group_engines =
+		__ATTR(engines, 0444, group_engines_show, NULL);
+
+static ssize_t group_work_queues_show(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+	int i, rc = 0;
+	char *tmp = buf;
+	struct idxd_device *idxd = group->idxd;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		if (!wq->group)
+			continue;
+
+		if (wq->group->id == group->id)
+			rc += sprintf(tmp + rc, "wq%d.%d ",
+					idxd->id, wq->id);
+	}
+
+	rc--;
+	rc += sprintf(tmp + rc, "\n");
+
+	return rc;
+}
+
+static struct device_attribute dev_attr_group_work_queues =
+		__ATTR(work_queues, 0444, group_work_queues_show, NULL);
+
+static ssize_t group_traffic_class_a_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *buf)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+
+	return sprintf(buf, "%d\n", group->tc_a);
+}
+
+static ssize_t group_traffic_class_a_store(struct device *dev,
+					   struct device_attribute *attr,
+					   const char *buf, size_t count)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+	struct idxd_device *idxd = group->idxd;
+	long val;
+	int rc;
+
+	rc = kstrtol(buf, 10, &val);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (idxd->state == IDXD_DEV_ENABLED)
+		return -EPERM;
+
+	if (val < 0 || val > 7)
+		return -EINVAL;
+
+	group->tc_a = val;
+	return count;
+}
+
+static struct device_attribute dev_attr_group_traffic_class_a =
+		__ATTR(traffic_class_a, 0644, group_traffic_class_a_show,
+		       group_traffic_class_a_store);
+
+static ssize_t group_traffic_class_b_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *buf)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+
+	return sprintf(buf, "%d\n", group->tc_b);
+}
+
+static ssize_t group_traffic_class_b_store(struct device *dev,
+					   struct device_attribute *attr,
+					   const char *buf, size_t count)
+{
+	struct idxd_group *group =
+		container_of(dev, struct idxd_group, conf_dev);
+	struct idxd_device *idxd = group->idxd;
+	long val;
+	int rc;
+
+	rc = kstrtol(buf, 10, &val);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (idxd->state == IDXD_DEV_ENABLED)
+		return -EPERM;
+
+	if (val < 0 || val > 7)
+		return -EINVAL;
+
+	group->tc_b = val;
+	return count;
+}
+
+static struct device_attribute dev_attr_group_traffic_class_b =
+		__ATTR(traffic_class_b, 0644, group_traffic_class_b_show,
+		       group_traffic_class_b_store);
+
+static struct attribute *idxd_group_attributes[] = {
+	&dev_attr_group_work_queues.attr,
+	&dev_attr_group_engines.attr,
+	&dev_attr_group_use_token_limit.attr,
+	&dev_attr_group_tokens_allowed.attr,
+	&dev_attr_group_tokens_reserved.attr,
+	&dev_attr_group_traffic_class_a.attr,
+	&dev_attr_group_traffic_class_b.attr,
+	NULL,
+};
+
+static const struct attribute_group idxd_group_attribute_group = {
+	.attrs = idxd_group_attributes,
+};
+
+static const struct attribute_group *idxd_group_attribute_groups[] = {
+	&idxd_group_attribute_group,
+	NULL,
+};
+
+/* IDXD work queue attribs */
+static ssize_t wq_clients_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	return sprintf(buf, "%d\n", wq->client_count);
+}
+
+static struct device_attribute dev_attr_wq_clients =
+		__ATTR(clients, 0444, wq_clients_show, NULL);
+
+static ssize_t wq_state_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	switch (wq->state) {
+	case IDXD_WQ_DISABLED:
+		return sprintf(buf, "disabled\n");
+	case IDXD_WQ_ENABLED:
+		return sprintf(buf, "enabled\n");
+	}
+
+	return sprintf(buf, "unknown\n");
+}
+
+static struct device_attribute dev_attr_wq_state =
+		__ATTR(state, 0444, wq_state_show, NULL);
+
+static ssize_t wq_group_id_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	if (wq->group)
+		return sprintf(buf, "%u\n", wq->group->id);
+	else
+		return sprintf(buf, "-1\n");
+}
+
+static ssize_t wq_group_id_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+	struct idxd_device *idxd = wq->idxd;
+	long id;
+	int rc;
+	struct idxd_group *prevg, *group;
+
+	rc = kstrtol(buf, 10, &id);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (wq->state != IDXD_WQ_DISABLED)
+		return -EPERM;
+
+	if (id > idxd->max_groups - 1 || id < -1)
+		return -EINVAL;
+
+	if (id == -1) {
+		if (wq->group) {
+			wq->group->num_wqs--;
+			wq->group = NULL;
+		}
+		return count;
+	}
+
+	group = &idxd->groups[id];
+	prevg = wq->group;
+
+	if (prevg)
+		prevg->num_wqs--;
+	wq->group = group;
+	group->num_wqs++;
+	return count;
+}
+
+static struct device_attribute dev_attr_wq_group_id =
+		__ATTR(group_id, 0644, wq_group_id_show, wq_group_id_store);
+
+static ssize_t wq_mode_show(struct device *dev, struct device_attribute *attr,
+			    char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	return sprintf(buf, "%s\n",
+			wq_dedicated(wq) ? "dedicated" : "shared");
+}
+
+static ssize_t wq_mode_store(struct device *dev,
+			     struct device_attribute *attr, const char *buf,
+			     size_t count)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+	struct idxd_device *idxd = wq->idxd;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (wq->state != IDXD_WQ_DISABLED)
+		return -EPERM;
+
+	if (sysfs_streq(buf, "dedicated")) {
+		set_bit(WQ_FLAG_DEDICATED, &wq->flags);
+		wq->threshold = 0;
+	} else {
+		return -EINVAL;
+	}
+
+	return count;
+}
+
+static struct device_attribute dev_attr_wq_mode =
+		__ATTR(mode, 0644, wq_mode_show, wq_mode_store);
+
+static ssize_t wq_size_show(struct device *dev, struct device_attribute *attr,
+			    char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	return sprintf(buf, "%u\n", wq->size);
+}
+
+static ssize_t wq_size_store(struct device *dev,
+			     struct device_attribute *attr, const char *buf,
+			     size_t count)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+	unsigned long size;
+	struct idxd_device *idxd = wq->idxd;
+	int rc;
+
+	rc = kstrtoul(buf, 10, &size);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (wq->state != IDXD_WQ_DISABLED)
+		return -EPERM;
+
+	if (size > idxd->max_wq_size)
+		return -EINVAL;
+
+	wq->size = size;
+	return count;
+}
+
+static struct device_attribute dev_attr_wq_size =
+		__ATTR(size, 0644, wq_size_show, wq_size_store);
+
+static ssize_t wq_priority_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	return sprintf(buf, "%u\n", wq->priority);
+}
+
+static ssize_t wq_priority_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+	unsigned long prio;
+	struct idxd_device *idxd = wq->idxd;
+	int rc;
+
+	rc = kstrtoul(buf, 10, &prio);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (wq->state != IDXD_WQ_DISABLED)
+		return -EPERM;
+
+	if (prio > IDXD_MAX_PRIORITY)
+		return -EINVAL;
+
+	wq->priority = prio;
+	return count;
+}
+
+static struct device_attribute dev_attr_wq_priority =
+		__ATTR(priority, 0644, wq_priority_show, wq_priority_store);
+
+static ssize_t wq_type_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	switch (wq->type) {
+	case IDXD_WQT_KERNEL:
+		return sprintf(buf, "%s\n",
+			       idxd_wq_type_names[IDXD_WQT_KERNEL]);
+	case IDXD_WQT_NONE:
+	default:
+		return sprintf(buf, "%s\n",
+			       idxd_wq_type_names[IDXD_WQT_NONE]);
+	}
+
+	return -EINVAL;
+}
+
+static ssize_t wq_type_store(struct device *dev,
+			     struct device_attribute *attr, const char *buf,
+			     size_t count)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+	enum idxd_wq_type old_type;
+
+	if (wq->state != IDXD_WQ_DISABLED)
+		return -EPERM;
+
+	old_type = wq->type;
+	if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_KERNEL]))
+		wq->type = IDXD_WQT_KERNEL;
+	else
+		wq->type = IDXD_WQT_NONE;
+
+	/* If we are changing queue type, clear the name */
+	if (wq->type != old_type)
+		memset(wq->name, 0, WQ_NAME_SIZE + 1);
+
+	return count;
+}
+
+static struct device_attribute dev_attr_wq_type =
+		__ATTR(type, 0644, wq_type_show, wq_type_store);
+
+static ssize_t wq_name_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	return sprintf(buf, "%s\n", wq->name);
+}
+
+static ssize_t wq_name_store(struct device *dev,
+			     struct device_attribute *attr, const char *buf,
+			     size_t count)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	if (wq->state != IDXD_WQ_DISABLED)
+		return -EPERM;
+
+	if (strlen(buf) > WQ_NAME_SIZE || strlen(buf) == 0)
+		return -EINVAL;
+
+	memset(wq->name, 0, WQ_NAME_SIZE + 1);
+	strncpy(wq->name, buf, WQ_NAME_SIZE);
+	strreplace(wq->name, '\n', '\0');
+	return count;
+}
+
+static struct device_attribute dev_attr_wq_name =
+		__ATTR(name, 0644, wq_name_show, wq_name_store);
+
+static struct attribute *idxd_wq_attributes[] = {
+	&dev_attr_wq_clients.attr,
+	&dev_attr_wq_state.attr,
+	&dev_attr_wq_group_id.attr,
+	&dev_attr_wq_mode.attr,
+	&dev_attr_wq_size.attr,
+	&dev_attr_wq_priority.attr,
+	&dev_attr_wq_type.attr,
+	&dev_attr_wq_name.attr,
+	NULL,
+};
+
+static const struct attribute_group idxd_wq_attribute_group = {
+	.attrs = idxd_wq_attributes,
+};
+
+static const struct attribute_group *idxd_wq_attribute_groups[] = {
+	&idxd_wq_attribute_group,
+	NULL,
+};
+
+/* IDXD device attribs */
+static ssize_t max_work_queues_size_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->max_wq_size);
+}
+static DEVICE_ATTR_RO(max_work_queues_size);
+
+static ssize_t max_groups_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->max_groups);
+}
+static DEVICE_ATTR_RO(max_groups);
+
+static ssize_t max_work_queues_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->max_wqs);
+}
+static DEVICE_ATTR_RO(max_work_queues);
+
+static ssize_t max_engines_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->max_engines);
+}
+static DEVICE_ATTR_RO(max_engines);
+
+static ssize_t numa_node_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%d\n", dev_to_node(&idxd->pdev->dev));
+}
+static DEVICE_ATTR_RO(numa_node);
+
+static ssize_t max_batch_size_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->max_batch_size);
+}
+static DEVICE_ATTR_RO(max_batch_size);
+
+static ssize_t max_transfer_size_show(struct device *dev,
+				      struct device_attribute *attr,
+				      char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%llu\n", idxd->max_xfer_bytes);
+}
+static DEVICE_ATTR_RO(max_transfer_size);
+
+static ssize_t op_cap_show(struct device *dev,
+			   struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%#llx\n", idxd->hw.opcap.bits[0]);
+}
+static DEVICE_ATTR_RO(op_cap);
+
+static ssize_t configurable_show(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n",
+			test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags));
+}
+static DEVICE_ATTR_RO(configurable);
+
+static ssize_t clients_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+	unsigned long flags;
+	int count = 0, i;
+
+	spin_lock_irqsave(&idxd->dev_lock, flags);
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		count += wq->client_count;
+	}
+	spin_unlock_irqrestore(&idxd->dev_lock, flags);
+
+	return sprintf(buf, "%d\n", count);
+}
+static DEVICE_ATTR_RO(clients);
+
+static ssize_t state_show(struct device *dev,
+			  struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	switch (idxd->state) {
+	case IDXD_DEV_DISABLED:
+	case IDXD_DEV_CONF_READY:
+		return sprintf(buf, "disabled\n");
+	case IDXD_DEV_ENABLED:
+		return sprintf(buf, "enabled\n");
+	case IDXD_DEV_HALTED:
+		return sprintf(buf, "halted\n");
+	}
+
+	return sprintf(buf, "unknown\n");
+}
+static DEVICE_ATTR_RO(state);
+
+static ssize_t errors_show(struct device *dev,
+			   struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+	int i, out = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&idxd->dev_lock, flags);
+	for (i = 0; i < 4; i++)
+		out += sprintf(buf + out, "%#018llx ", idxd->sw_err.bits[i]);
+	spin_unlock_irqrestore(&idxd->dev_lock, flags);
+	out--;
+	out += sprintf(buf + out, "\n");
+	return out;
+}
+static DEVICE_ATTR_RO(errors);
+
+static ssize_t max_tokens_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->max_tokens);
+}
+static DEVICE_ATTR_RO(max_tokens);
+
+static ssize_t token_limit_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->token_limit);
+}
+
+static ssize_t token_limit_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+	unsigned long val;
+	int rc;
+
+	rc = kstrtoul(buf, 10, &val);
+	if (rc < 0)
+		return -EINVAL;
+
+	if (idxd->state == IDXD_DEV_ENABLED)
+		return -EPERM;
+
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+		return -EPERM;
+
+	if (!idxd->hw.group_cap.token_limit)
+		return -EPERM;
+
+	if (val > idxd->hw.group_cap.total_tokens)
+		return -EINVAL;
+
+	idxd->token_limit = val;
+	return count;
+}
+static DEVICE_ATTR_RW(token_limit);
+
+static struct attribute *idxd_device_attributes[] = {
+	&dev_attr_max_groups.attr,
+	&dev_attr_max_work_queues.attr,
+	&dev_attr_max_work_queues_size.attr,
+	&dev_attr_max_engines.attr,
+	&dev_attr_numa_node.attr,
+	&dev_attr_max_batch_size.attr,
+	&dev_attr_max_transfer_size.attr,
+	&dev_attr_op_cap.attr,
+	&dev_attr_configurable.attr,
+	&dev_attr_clients.attr,
+	&dev_attr_state.attr,
+	&dev_attr_errors.attr,
+	&dev_attr_max_tokens.attr,
+	&dev_attr_token_limit.attr,
+	NULL,
+};
+
+static const struct attribute_group idxd_device_attribute_group = {
+	.attrs = idxd_device_attributes,
+};
+
+static const struct attribute_group *idxd_attribute_groups[] = {
+	&idxd_device_attribute_group,
+	NULL,
+};
+
+static int idxd_setup_engine_sysfs(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int i, rc;
+
+	for (i = 0; i < idxd->max_engines; i++) {
+		struct idxd_engine *engine = &idxd->engines[i];
+
+		engine->conf_dev.parent = &idxd->conf_dev;
+		dev_set_name(&engine->conf_dev, "engine%d.%d",
+			     idxd->id, engine->id);
+		engine->conf_dev.bus = idxd_get_bus_type(idxd);
+		engine->conf_dev.groups = idxd_engine_attribute_groups;
+		engine->conf_dev.type = &idxd_engine_device_type;
+		dev_dbg(dev, "Engine device register: %s\n",
+			dev_name(&engine->conf_dev));
+		rc = device_register(&engine->conf_dev);
+		if (rc < 0) {
+			put_device(&engine->conf_dev);
+			goto cleanup;
+		}
+	}
+
+	return 0;
+
+cleanup:
+	while (i--) {
+		struct idxd_engine *engine = &idxd->engines[i];
+
+		device_unregister(&engine->conf_dev);
+	}
+	return rc;
+}
+
+static int idxd_setup_group_sysfs(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int i, rc;
+
+	for (i = 0; i < idxd->max_groups; i++) {
+		struct idxd_group *group = &idxd->groups[i];
+
+		group->conf_dev.parent = &idxd->conf_dev;
+		dev_set_name(&group->conf_dev, "group%d.%d",
+			     idxd->id, group->id);
+		group->conf_dev.bus = idxd_get_bus_type(idxd);
+		group->conf_dev.groups = idxd_group_attribute_groups;
+		group->conf_dev.type = &idxd_group_device_type;
+		dev_dbg(dev, "Group device register: %s\n",
+			dev_name(&group->conf_dev));
+		rc = device_register(&group->conf_dev);
+		if (rc < 0) {
+			put_device(&group->conf_dev);
+			goto cleanup;
+		}
+	}
+
+	return 0;
+
+cleanup:
+	while (i--) {
+		struct idxd_group *group = &idxd->groups[i];
+
+		device_unregister(&group->conf_dev);
+	}
+	return rc;
+}
+
+static int idxd_setup_wq_sysfs(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int i, rc;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		wq->conf_dev.parent = &idxd->conf_dev;
+		dev_set_name(&wq->conf_dev, "wq%d.%d", idxd->id, wq->id);
+		wq->conf_dev.bus = idxd_get_bus_type(idxd);
+		wq->conf_dev.groups = idxd_wq_attribute_groups;
+		wq->conf_dev.type = &idxd_wq_device_type;
+		dev_dbg(dev, "WQ device register: %s\n",
+			dev_name(&wq->conf_dev));
+		rc = device_register(&wq->conf_dev);
+		if (rc < 0) {
+			put_device(&wq->conf_dev);
+			goto cleanup;
+		}
+	}
+
+	return 0;
+
+cleanup:
+	while (i--) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		device_unregister(&wq->conf_dev);
+	}
+	return rc;
+}
+
+static int idxd_setup_device_sysfs(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int rc;
+	char devname[IDXD_NAME_SIZE];
+
+	sprintf(devname, "%s%d", idxd_get_dev_name(idxd), idxd->id);
+	idxd->conf_dev.parent = dev;
+	dev_set_name(&idxd->conf_dev, "%s", devname);
+	idxd->conf_dev.bus = idxd_get_bus_type(idxd);
+	idxd->conf_dev.groups = idxd_attribute_groups;
+	idxd->conf_dev.type = idxd_get_device_type(idxd);
+
+	dev_dbg(dev, "IDXD device register: %s\n", dev_name(&idxd->conf_dev));
+	rc = device_register(&idxd->conf_dev);
+	if (rc < 0) {
+		put_device(&idxd->conf_dev);
+		return rc;
+	}
+
+	return 0;
+}
+
+int idxd_setup_sysfs(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int rc;
+
+	rc = idxd_setup_device_sysfs(idxd);
+	if (rc < 0) {
+		dev_dbg(dev, "Device sysfs registering failed: %d\n", rc);
+		return rc;
+	}
+
+	rc = idxd_setup_wq_sysfs(idxd);
+	if (rc < 0) {
+		/* unregister conf dev */
+		dev_dbg(dev, "Work Queue sysfs registering failed: %d\n", rc);
+		return rc;
+	}
+
+	rc = idxd_setup_group_sysfs(idxd);
+	if (rc < 0) {
+		/* unregister conf dev */
+		dev_dbg(dev, "Group sysfs registering failed: %d\n", rc);
+		return rc;
+	}
+
+	rc = idxd_setup_engine_sysfs(idxd);
+	if (rc < 0) {
+		/* unregister conf dev */
+		dev_dbg(dev, "Engine sysfs registering failed: %d\n", rc);
+		return rc;
+	}
+
+	return 0;
+}
+
+void idxd_cleanup_sysfs(struct idxd_device *idxd)
+{
+	int i;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		device_unregister(&wq->conf_dev);
+	}
+
+	for (i = 0; i < idxd->max_engines; i++) {
+		struct idxd_engine *engine = &idxd->engines[i];
+
+		device_unregister(&engine->conf_dev);
+	}
+
+	for (i = 0; i < idxd->max_groups; i++) {
+		struct idxd_group *group = &idxd->groups[i];
+
+		device_unregister(&group->conf_dev);
+	}
+
+	device_unregister(&idxd->conf_dev);
+}
+
+int idxd_register_bus_type(void)
+{
+	int i, rc;
+
+	for (i = 0; i < IDXD_TYPE_MAX; i++) {
+		rc = bus_register(idxd_bus_types[i]);
+		if (rc < 0)
+			goto bus_err;
+	}
+
+	return 0;
+
+bus_err:
+	for (; i > 0; i--)
+		bus_unregister(idxd_bus_types[i]);
+	return rc;
+}
+
+void idxd_unregister_bus_type(void)
+{
+	int i;
+
+	for (i = 0; i < IDXD_TYPE_MAX; i++)
+		bus_unregister(idxd_bus_types[i]);
+}


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 10/14] dmaengine: idxd: add descriptor manipulation routines
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (8 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 09/14] dmaengine: idxd: add configuration component of driver Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 11/14] dmaengine: idxd: connect idxd to dmaengine subsystem Dave Jiang
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

This commit adds helper functions for DSA descriptor allocation, setup,
submission, and free operations.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>

---

idxd_submit_desc() and idxd_alloc_desc() are used in the next patch in the
series.
---
 drivers/dma/idxd/Makefile |    2 -
 drivers/dma/idxd/submit.c |  127 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 128 insertions(+), 1 deletion(-)
 create mode 100644 drivers/dma/idxd/submit.c

diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index a552560a03dc..50eca12015e2 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,2 @@
 obj-$(CONFIG_INTEL_IDXD) += idxd.o
-idxd-y := init.o irq.o device.o sysfs.o
+idxd-y := init.o irq.o device.o sysfs.o submit.o
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
new file mode 100644
index 000000000000..2dcd13f9f654
--- /dev/null
+++ b/drivers/dma/idxd/submit.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/dmaengine.h>
+#include <uapi/linux/idxd.h>
+#include "../dmaengine.h"
+#include "idxd.h"
+#include "registers.h"
+
+static struct idxd_desc *idxd_alloc_desc(struct idxd_wq *wq, bool nonblock)
+{
+	struct idxd_desc *desc;
+	int idx;
+	struct idxd_device *idxd = wq->idxd;
+
+	if (idxd->state != IDXD_DEV_ENABLED)
+		return ERR_PTR(-EIO);
+
+	if (!nonblock)
+		percpu_down_read(&wq->submit_lock);
+	else if (!percpu_down_read_trylock(&wq->submit_lock))
+		return ERR_PTR(-EBUSY);
+
+	if (!atomic_add_unless(&wq->dq_count, 1, wq->size)) {
+		int rc;
+
+		if (nonblock) {
+			percpu_up_read(&wq->submit_lock);
+			return ERR_PTR(-EAGAIN);
+		}
+
+		percpu_up_read(&wq->submit_lock);
+		percpu_down_write(&wq->submit_lock);
+		rc = wait_event_interruptible(wq->submit_waitq,
+				atomic_add_unless(&wq->dq_count, 1, wq->size) ||
+				idxd->state != IDXD_DEV_ENABLED);
+		percpu_up_write(&wq->submit_lock);
+		if (rc < 0)
+			return ERR_PTR(-EINTR);
+		if (idxd->state != IDXD_DEV_ENABLED)
+			return ERR_PTR(-EIO);
+	} else {
+		percpu_up_read(&wq->submit_lock);
+	}
+
+	idx = sbitmap_get(&wq->sbmap, 0, false);
+	if (idx < 0) {
+		atomic_dec(&wq->dq_count);
+		return ERR_PTR(-EAGAIN);
+	}
+
+	desc = wq->descs[idx];
+	memset(desc->hw, 0, sizeof(struct dsa_hw_desc));
+	memset(desc->completion, 0, sizeof(struct dsa_completion_record));
+	return desc;
+}
+
+void idxd_free_desc(struct idxd_wq *wq, struct idxd_desc *desc)
+{
+	atomic_dec(&wq->dq_count);
+
+	sbitmap_clear_bit(&wq->sbmap, desc->id);
+	wake_up(&wq->submit_waitq);
+}
+
+static int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc,
+			    bool nonblock)
+{
+	struct idxd_device *idxd = wq->idxd;
+	int vec = desc->hw->int_handle;
+
+	if (idxd->state != IDXD_DEV_ENABLED)
+		return -EIO;
+
+	/*
+	 * The wmb() flushes writes to coherent DMA data before possibly
+	 * triggering a DMA read. The wmb() is necessary even on UP because
+	 * the recipient is a device.
+	 */
+	wmb();
+	iosubmit_cmds512(wq->dportal, desc->hw, 1);
+
+	/*
+	 * Pending the descriptor to the lockless list for the irq_entry
+	 * that we designated the descriptor to.
+	 */
+	llist_add(&desc->llnode, &idxd->irq_entries[vec].pending_llist);
+
+	return 0;
+}
+
+static inline void idxd_prep_desc_common(struct idxd_wq *wq,
+					 struct dsa_hw_desc *hw, char opcode,
+					 u64 addr_f1, u64 addr_f2, u64 len,
+					 u64 compl, u32 flags)
+{
+	hw->flags = flags;
+	hw->opcode = opcode;
+	hw->src_addr = addr_f1;
+	hw->dst_addr = addr_f2;
+	hw->xfer_size = len;
+	hw->priv = !!(wq->type == IDXD_WQT_KERNEL);
+	hw->completion_addr = compl;
+
+	/*
+	 * Descriptor completion vectors are 1-8 for MSIX. We will round
+	 * robin through the 8 vectors.
+	 */
+	hw->int_handle = ++wq->vec_ptr;
+	wq->vec_ptr = wq->vec_ptr & 7;
+}
+
+static inline void set_desc_addresses(struct dma_request *req,
+				      u64 *src, u64 *dst)
+{
+		*src = sg_dma_address(&req->sg[0]);
+		*dst = req->pg_dma;
+}
+
+static inline void set_completion_address(struct idxd_desc *desc,
+					  u64 *compl_addr)
+{
+		*compl_addr = desc->compl_dma;
+}


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 11/14] dmaengine: idxd: connect idxd to dmaengine subsystem
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (9 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 10/14] dmaengine: idxd: add descriptor manipulation routines Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:24 ` [PATCH RFC 12/14] dmaengine: request submit optimization Dave Jiang
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

Add plumbing for dmaengine subsystem connection. The driver register a DMA
device per DSA device. The channels are dynamically registered when a
workqueue is configured to be "kernel:dmanegine" type. The driver will
utilize the newly introduced DMA request API calls to provide a lockless
descriptor submission path.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/Kconfig       |    1 
 drivers/dma/idxd/Makefile |    2 -
 drivers/dma/idxd/device.c |    2 +
 drivers/dma/idxd/dma.c    |  119 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/idxd.h   |   14 +++++
 drivers/dma/idxd/init.c   |   48 ++++++++++++++++++
 drivers/dma/idxd/irq.c    |  101 ++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/submit.c |   51 +++++++++++++++++++
 drivers/dma/idxd/sysfs.c  |   28 +++++++++++
 9 files changed, 364 insertions(+), 2 deletions(-)
 create mode 100644 drivers/dma/idxd/dma.c

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 5f9419c35960..416a40cafc27 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -282,6 +282,7 @@ config INTEL_IDXD
 	tristate "Intel Data Accelerators support"
 	depends on PCI && X86_64
 	select DMA_ENGINE
+	select DMA_ENGINE_REQUEST
 	select SBITMAP
 	help
 	  Enable support for the Intel(R) data accelerators present
diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index 50eca12015e2..a036ba0e77d2 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,2 @@
 obj-$(CONFIG_INTEL_IDXD) += idxd.o
-idxd-y := init.o irq.o device.o sysfs.o submit.o
+idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 74a60a8bef76..49638d3a2151 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -5,7 +5,9 @@
 #include <linux/module.h>
 #include <linux/pci.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/dmaengine.h>
 #include <uapi/linux/idxd.h>
+#include "../dmaengine.h"
 #include "idxd.h"
 #include "registers.h"
 
diff --git a/drivers/dma/idxd/dma.c b/drivers/dma/idxd/dma.c
new file mode 100644
index 000000000000..07fbc98668ae
--- /dev/null
+++ b/drivers/dma/idxd/dma.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/dmaengine.h>
+#include <uapi/linux/idxd.h>
+#include "../dmaengine.h"
+#include "registers.h"
+#include "idxd.h"
+
+void idxd_parse_completion_status(u8 status, enum dmaengine_tx_result *res)
+{
+	u8 code = status & DSA_COMP_STATUS_MASK;
+
+	switch (code) {
+	case DSA_COMP_SUCCESS:
+		*res = DMA_TRANS_NOERROR;
+		break;
+	case DSA_COMP_HW_ERR1:
+		*res = DMA_TRANS_READ_FAILED;
+		break;
+	default:
+		*res = DMA_TRANS_ERROR;
+		break;
+	}
+}
+
+static int idxd_dma_submit_request(struct dma_chan *chan,
+				   struct dma_request *req)
+{
+	struct idxd_wq *wq = container_of(chan, struct idxd_wq, dma_chan);
+
+	if (req->cmd == DMA_MEMCPY)
+		return idxd_submit_memcpy(wq, req);
+
+	return -EINVAL;
+}
+
+static int idxd_dma_alloc_chan_resources(struct dma_chan *chan)
+{
+	struct idxd_wq *wq = container_of(chan, struct idxd_wq, dma_chan);
+	struct device *dev = &wq->idxd->pdev->dev;
+
+	idxd_wq_get(wq);
+	dev_dbg(dev, "%s: client_count: %d\n", __func__, idxd_wq_refcount(wq));
+	return 0;
+}
+
+static void idxd_dma_free_chan_resources(struct dma_chan *chan)
+{
+	struct idxd_wq *wq = container_of(chan, struct idxd_wq, dma_chan);
+	struct device *dev = &wq->idxd->pdev->dev;
+
+	idxd_wq_put(wq);
+	dev_dbg(dev, "%s: client_count: %d\n", __func__, idxd_wq_refcount(wq));
+}
+
+int idxd_register_dma_device(struct idxd_device *idxd)
+{
+	struct dma_device *dma = &idxd->dma_dev;
+
+	INIT_LIST_HEAD(&dma->channels);
+	dma->dev = &idxd->pdev->dev;
+
+	if (idxd->hw.opcap.bits[0] & IDXD_OPCAP_MEMMOVE)
+		dma_cap_set(DMA_MEMCPY, dma->cap_mask);
+
+	if (idxd->hw.opcap.bits[0] & IDXD_OPCAP_NOOP)
+		dma_cap_set(DMA_INTERRUPT, dma->cap_mask);
+
+	dma->device_submit_request = idxd_dma_submit_request;
+	dma->device_alloc_chan_resources = idxd_dma_alloc_chan_resources;
+	dma->device_free_chan_resources = idxd_dma_free_chan_resources;
+
+	return dma_async_request_device_register(&idxd->dma_dev);
+}
+
+void idxd_unregister_dma_device(struct idxd_device *idxd)
+{
+	dma_async_device_unregister(&idxd->dma_dev);
+}
+
+int idxd_register_dma_channel(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct dma_device *dma = &idxd->dma_dev;
+	struct dma_chan *chan = &wq->dma_chan;
+	struct idxd_group *group = wq->group;
+	int rc;
+
+	memset(&wq->dma_chan, 0, sizeof(struct dma_chan));
+	chan->device = dma;
+	list_add_tail(&chan->device_node, &dma->channels);
+	chan->max_sgs = wq->batch_size;
+	chan->depth = wq->size +
+		idxd->hw.gen_cap.max_descs_per_engine * group->num_engines;
+
+	rc = dma_async_device_channel_register(dma, chan);
+	if (rc < 0)
+		return rc;
+
+	rc = dma_chan_alloc_request_resources(chan);
+	if (rc < 0) {
+		dma_async_device_channel_unregister(dma, chan);
+		return rc;
+	}
+
+	return 0;
+}
+
+void idxd_unregister_dma_channel(struct idxd_wq *wq)
+{
+	dma_chan_free_request_resources(&wq->dma_chan);
+	dma_async_device_channel_unregister(&wq->idxd->dma_dev, &wq->dma_chan);
+}
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 654351105d85..7f94e50267cc 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -4,6 +4,7 @@
 #define _IDXD_H_
 
 #include <linux/sbitmap.h>
+#include <linux/dmaengine.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/wait.h>
 #include "registers.h"
@@ -96,6 +97,7 @@ struct idxd_wq {
 	int batches_size;
 	int batch_size;
 	struct sbitmap sbmap;
+	struct dma_chan dma_chan;
 	struct percpu_rw_semaphore submit_lock;
 	wait_queue_head_t submit_waitq;
 	char name[WQ_NAME_SIZE + 1];
@@ -167,6 +169,8 @@ struct idxd_device {
 	struct msix_entry *msix_entries;
 	int num_wq_irqs;
 	struct idxd_irq_entry *irq_entries;
+
+	struct dma_device dma_dev;
 };
 
 /* IDXD software descriptor */
@@ -181,6 +185,7 @@ struct idxd_desc {
 	struct list_head list;
 	int id;
 	struct idxd_wq *wq;
+	struct dma_request *req;
 };
 
 #define confdev_to_idxd(dev) container_of(dev, struct idxd_device, conf_dev)
@@ -252,4 +257,13 @@ int idxd_wq_disable(struct idxd_wq *wq);
 int idxd_wq_map_portal(struct idxd_wq *wq);
 void idxd_wq_unmap_portal(struct idxd_wq *wq);
 
+/* submission */
+int idxd_submit_memcpy(struct idxd_wq *wq, struct dma_request *req);
+
+/* dmaengine */
+int idxd_register_dma_device(struct idxd_device *idxd);
+void idxd_unregister_dma_device(struct idxd_device *idxd);
+int idxd_register_dma_channel(struct idxd_wq *wq);
+void idxd_unregister_dma_channel(struct idxd_wq *wq);
+void idxd_parse_completion_status(u8 status, enum dmaengine_tx_result *res);
 #endif
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index b9a7bfea93b5..a030eaaf0ab1 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -15,6 +15,8 @@
 #include <linux/device.h>
 #include <linux/idr.h>
 #include <uapi/linux/idxd.h>
+#include <linux/dmaengine.h>
+#include "../dmaengine.h"
 #include "registers.h"
 #include "idxd.h"
 
@@ -404,6 +406,50 @@ static int idxd_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	return 0;
 }
 
+static void idxd_flush_pending_llist(struct idxd_irq_entry *ie)
+{
+	struct idxd_desc *desc, *itr;
+	struct llist_node *head;
+	struct dma_request *req;
+
+	head = llist_del_all(&ie->pending_llist);
+	if (!head)
+		return;
+
+	llist_for_each_entry_safe(desc, itr, head, llnode) {
+		req = desc->req;
+		if (!desc->completion->status)
+			req->result.result = DMA_TRANS_ABORTED;
+		else if (desc->completion->status == DSA_COMP_SUCCESS)
+			req->result.result = DMA_TRANS_NOERROR;
+		else
+			req->result.result = DMA_TRANS_ERROR;
+
+		dmaengine_request_complete(req);
+		idxd_free_desc(desc->wq, desc);
+	}
+}
+
+static void idxd_flush_work_list(struct idxd_irq_entry *ie)
+{
+	struct idxd_desc *desc, *iter;
+	struct dma_request *req;
+
+	list_for_each_entry_safe(desc, iter, &ie->work_list, list) {
+		req = desc->req;
+		list_del(&desc->list);
+		if (!desc->completion->status)
+			req->result.result = DMA_TRANS_ABORTED;
+		else if (desc->completion->status == DSA_COMP_SUCCESS)
+			req->result.result = DMA_TRANS_NOERROR;
+		else
+			req->result.result = DMA_TRANS_ERROR;
+
+		dmaengine_request_complete(req);
+		idxd_free_desc(desc->wq, desc);
+	}
+}
+
 static void idxd_shutdown(struct pci_dev *pdev)
 {
 	struct idxd_device *idxd = pci_get_drvdata(pdev);
@@ -427,6 +473,8 @@ static void idxd_shutdown(struct pci_dev *pdev)
 		synchronize_irq(idxd->msix_entries[i].vector);
 		if (i == 0)
 			continue;
+		idxd_flush_pending_llist(irq_entry);
+		idxd_flush_work_list(irq_entry);
 	}
 }
 
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
index de4b80973c2f..b4adeb2817d1 100644
--- a/drivers/dma/idxd/irq.c
+++ b/drivers/dma/idxd/irq.c
@@ -5,7 +5,9 @@
 #include <linux/module.h>
 #include <linux/pci.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/dmaengine.h>
 #include <uapi/linux/idxd.h>
+#include "../dmaengine.h"
 #include "idxd.h"
 #include "registers.h"
 
@@ -146,11 +148,110 @@ irqreturn_t idxd_misc_thread(int vec, void *data)
 	return IRQ_HANDLED;
 }
 
+static int irq_process_pending_llist(struct idxd_irq_entry *irq_entry,
+				     int *processed)
+{
+	struct idxd_desc *desc, *t;
+	struct llist_node *head;
+	int queued = 0;
+	struct dma_request *req;
+
+	head = llist_del_all(&irq_entry->pending_llist);
+	if (!head)
+		return 0;
+
+	llist_for_each_entry_safe(desc, t, head, llnode) {
+		req = desc->req;
+		if (desc->completion->status) {
+			if ((desc->completion->status & DSA_COMP_STATUS_MASK) !=
+					DSA_COMP_SUCCESS)
+				idxd_parse_completion_status(desc->completion->status,
+							     &req->result.result);
+
+			dmaengine_request_complete(req);
+			idxd_free_desc(desc->wq, desc);
+			(*processed)++;
+		} else {
+			list_add_tail(&desc->list, &irq_entry->work_list);
+			queued++;
+		}
+	}
+
+	return queued;
+}
+
+static int irq_process_work_list(struct idxd_irq_entry *irq_entry,
+				 int *processed)
+{
+	struct list_head *node, *next;
+	int queued = 0;
+	struct dma_request *req;
+
+	if (list_empty(&irq_entry->work_list))
+		return 0;
+
+	list_for_each_safe(node, next, &irq_entry->work_list) {
+		struct idxd_desc *desc =
+			container_of(node, struct idxd_desc, list);
+
+		req = desc->req;
+		if (desc->completion->status) {
+			list_del(&desc->list);
+			/* process and callback */
+			if ((desc->completion->status & DSA_COMP_STATUS_MASK) !=
+					DSA_COMP_SUCCESS)
+				idxd_parse_completion_status(desc->completion->status,
+							     &req->result.result);
+
+			dmaengine_request_complete(req);
+			idxd_free_desc(desc->wq, desc);
+			(*processed)++;
+		} else {
+			queued++;
+		}
+	}
+
+	return queued;
+}
+
 irqreturn_t idxd_wq_thread(int irq, void *data)
 {
 	struct idxd_irq_entry *irq_entry = data;
+	int rc, processed = 0, retry = 0;
+
+	/*
+	 * There are two lists we are processing. The pending_llist is where
+	 * submmiter adds all the submitted descriptor after sending it to
+	 * the workqueue. It's a lockless singly linked list. The work_list
+	 * is the common linux double linked list. We are in a scenario of
+	 * multiple producers and a single consumer. The producers are all
+	 * the kernel submitters of descriptors, and the consumer is the
+	 * kernel irq handler thread for the msix vector when using threaded
+	 * irq. To work with the restrictions of llist to remain lockless,
+	 * we are doing the following steps:
+	 * 1. Iterate through the work_list and process any completed
+	 *    descriptor. Delete the completed entries during iteration.
+	 * 2. llist_del_all() from the pending list.
+	 * 3. Iterate through the llist that was deleted from the pending list
+	 *    and process the completed entries.
+	 * 4. If the entry is still waiting on hardware, list_add_tail() to
+	 *    the work_list.
+	 * 5. Repeat until no more descriptors.
+	 */
+	do {
+		rc = irq_process_work_list(irq_entry, &processed);
+		if (rc != 0) {
+			retry++;
+			continue;
+		}
+
+		rc = irq_process_pending_llist(irq_entry, &processed);
+	} while (rc != 0 && retry != 10);
 
 	idxd_unmask_msix_vector(irq_entry->idxd, irq_entry->id);
 
+	if (processed == 0)
+		return IRQ_NONE;
+
 	return IRQ_HANDLED;
 }
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index 2dcd13f9f654..f7baa1bbb0c7 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -87,7 +87,9 @@ static int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc,
 	 * Pending the descriptor to the lockless list for the irq_entry
 	 * that we designated the descriptor to.
 	 */
-	llist_add(&desc->llnode, &idxd->irq_entries[vec].pending_llist);
+	if (desc->req->flags & DMA_PREP_INTERRUPT)
+		llist_add(&desc->llnode,
+			  &idxd->irq_entries[vec].pending_llist);
 
 	return 0;
 }
@@ -125,3 +127,50 @@ static inline void set_completion_address(struct idxd_desc *desc,
 {
 		*compl_addr = desc->compl_dma;
 }
+
+static void op_flag_setup(struct idxd_wq *wq, struct dma_request *req,
+			  u32 *desc_flags)
+{
+	*desc_flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR;
+	if (req->flags & DMA_PREP_INTERRUPT)
+		*desc_flags |= IDXD_OP_FLAG_RCI;
+	if (req->flags & DMA_PREP_FENCE)
+		*desc_flags |= IDXD_OP_FLAG_FENCE;
+}
+
+int idxd_submit_memcpy(struct idxd_wq *wq, struct dma_request *req)
+{
+	u32 desc_flags;
+	struct idxd_device *idxd = wq->idxd;
+	struct idxd_desc *desc;
+	int rc;
+	bool nonblock;
+	u64 compl_addr, src, dst;
+
+	if (wq->state != IDXD_WQ_ENABLED)
+		return -EPERM;
+
+	if (req->bvec.bv_len > idxd->max_xfer_bytes)
+		return -EINVAL;
+
+	op_flag_setup(wq, req, &desc_flags);
+	nonblock = !!(req->flags & DMA_SUBMIT_NONBLOCK);
+	desc = idxd_alloc_desc(wq, nonblock);
+	if (IS_ERR(desc))
+		return PTR_ERR(desc);
+
+	set_completion_address(desc, &compl_addr);
+	set_desc_addresses(req, &src, &dst);
+	idxd_prep_desc_common(wq, desc->hw, DSA_OPCODE_MEMMOVE,
+			      src, dst, req->bvec.bv_len, compl_addr,
+			      desc_flags);
+	desc->req = req;
+
+	rc = idxd_submit_desc(wq, desc, nonblock);
+	if (rc < 0) {
+		idxd_free_desc(wq, desc);
+		return rc;
+	}
+
+	return 0;
+}
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index fab916a555b2..a7468d22c287 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -55,6 +55,14 @@ static inline bool is_idxd_wq_dev(struct device *dev)
 	return dev ? dev->type == &idxd_wq_device_type : false;
 }
 
+static inline bool is_idxd_wq_dmaengine(struct idxd_wq *wq)
+{
+	if (wq->type == IDXD_WQT_KERNEL &&
+	    strcmp(wq->name, "dmaengine") == 0)
+		return true;
+	return false;
+}
+
 static int idxd_config_bus_match(struct device *dev,
 				 struct device_driver *drv)
 {
@@ -122,6 +130,12 @@ static int idxd_config_bus_probe(struct device *dev)
 		spin_unlock_irqrestore(&idxd->dev_lock, flags);
 		dev_info(dev, "Device %s enabled\n", dev_name(dev));
 
+		rc = idxd_register_dma_device(idxd);
+		if (rc < 0) {
+			spin_unlock_irqrestore(&idxd->dev_lock, flags);
+			dev_dbg(dev, "Failed to register dmaengine device\n");
+			return rc;
+		}
 		return 0;
 	} else if (is_idxd_wq_dev(dev)) {
 		struct idxd_wq *wq = confdev_to_wq(dev);
@@ -194,6 +208,16 @@ static int idxd_config_bus_probe(struct device *dev)
 		wq->client_count = 0;
 
 		dev_info(dev, "wq %s enabled\n", dev_name(&wq->conf_dev));
+
+		if (is_idxd_wq_dmaengine(wq)) {
+			rc = idxd_register_dma_channel(wq);
+			if (rc < 0) {
+				dev_dbg(dev, "DMA channel register failed\n");
+				mutex_unlock(&wq->wq_lock);
+				return rc;
+			}
+		}
+
 		mutex_unlock(&wq->wq_lock);
 		return 0;
 	}
@@ -215,6 +239,9 @@ static void disable_wq(struct idxd_wq *wq)
 		return;
 	}
 
+	if (is_idxd_wq_dmaengine(wq))
+		idxd_unregister_dma_channel(wq);
+
 	if (idxd_wq_refcount(wq))
 		dev_warn(dev, "Clients has claim on wq %d: %d\n",
 			 wq->id, idxd_wq_refcount(wq));
@@ -264,6 +291,7 @@ static int idxd_config_bus_remove(struct device *dev)
 			device_release_driver(&wq->conf_dev);
 		}
 
+		idxd_unregister_dma_device(idxd);
 		spin_lock_irqsave(&idxd->dev_lock, flags);
 		rc = idxd_device_disable(idxd);
 		spin_unlock_irqrestore(&idxd->dev_lock, flags);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 12/14] dmaengine: request submit optimization
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (10 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 11/14] dmaengine: idxd: connect idxd to dmaengine subsystem Dave Jiang
@ 2019-11-20 21:24 ` Dave Jiang
  2019-11-20 21:25 ` [PATCH RFC 13/14] dmaengine: idxd: add char driver to expose submission portal to userland Dave Jiang
  2019-11-20 21:25 ` [PATCH RFC 14/14] dmaengine: idxd: add sysfs ABI for idxd driver Dave Jiang
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:24 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

Adding dsa direct call to dmaengine for optimization. Spectre-v2 makes
indirect branches expensive. Adding direct call to the driver in order
to mitigate that and reduce cycles to initiate a descriptor submit.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/dma.c    |    5 +++--
 include/linux/dmaengine.h |    6 +++++-
 include/linux/idxd.h      |   23 +++++++++++++++++++++++
 usr/include/Makefile      |    1 +
 4 files changed, 32 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/idxd.h

diff --git a/drivers/dma/idxd/dma.c b/drivers/dma/idxd/dma.c
index 07fbc98668ae..b9b4621e504a 100644
--- a/drivers/dma/idxd/dma.c
+++ b/drivers/dma/idxd/dma.c
@@ -8,6 +8,7 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/dmaengine.h>
 #include <uapi/linux/idxd.h>
+#include <linux/idxd.h>
 #include "../dmaengine.h"
 #include "registers.h"
 #include "idxd.h"
@@ -29,8 +30,7 @@ void idxd_parse_completion_status(u8 status, enum dmaengine_tx_result *res)
 	}
 }
 
-static int idxd_dma_submit_request(struct dma_chan *chan,
-				   struct dma_request *req)
+int idxd_dma_submit_request(struct dma_chan *chan, struct dma_request *req)
 {
 	struct idxd_wq *wq = container_of(chan, struct idxd_wq, dma_chan);
 
@@ -39,6 +39,7 @@ static int idxd_dma_submit_request(struct dma_chan *chan,
 
 	return -EINVAL;
 }
+EXPORT_SYMBOL_GPL(idxd_dma_submit_request);
 
 static int idxd_dma_alloc_chan_resources(struct dma_chan *chan)
 {
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 220d241d71ed..cebfa8db60a0 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -1395,6 +1395,7 @@ static inline int dma_get_slave_caps(struct dma_chan *chan,
 }
 #endif
 
+#include <linux/idxd.h>
 /* dmaengine_submit_request - helper routine for caller to submit
  *				a DMA request.
  * @chan: dma channel context
@@ -1412,7 +1413,10 @@ static inline int dmaengine_submit_request(struct dma_chan *chan,
 	if (!ddev->device_submit_request)
 		return -EINVAL;
 
-	return ddev->device_submit_request(chan, req);
+	if (ddev->device_submit_request == idxd_dma_submit_request)
+		return idxd_dma_submit_request(chan, req);
+	else
+		return ddev->device_submit_request(chan, req);
 }
 
 /* dmaengine_submit_request_and_wait - helper routine for caller to submit
diff --git a/include/linux/idxd.h b/include/linux/idxd.h
new file mode 100644
index 000000000000..3b7b3fbe86c8
--- /dev/null
+++ b/include/linux/idxd.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#ifndef _LINUX_IDXD_H_
+#define _LINUX_IDXD_H_
+
+struct dmaengine_result;
+struct dma_request;
+struct dma_chan;
+
+typedef void (*dma_async_tx_callback_result)(void *dma_async_param,
+		const struct dmaengine_result *result);
+
+#if IS_ENABLED(CONFIG_INTEL_IDXD)
+int idxd_dma_submit_request(struct dma_chan *chan, struct dma_request *req);
+#else
+static inline int idxd_dma_submit_request(struct dma_chan *chan,
+					  struct dma_request *req)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
+#endif
diff --git a/usr/include/Makefile b/usr/include/Makefile
index 57b20f7b6729..e5cf680efbc2 100644
--- a/usr/include/Makefile
+++ b/usr/include/Makefile
@@ -29,6 +29,7 @@ header-test- += linux/android/binderfs.h
 header-test-$(CONFIG_CPU_BIG_ENDIAN) += linux/byteorder/big_endian.h
 header-test-$(CONFIG_CPU_LITTLE_ENDIAN) += linux/byteorder/little_endian.h
 header-test- += linux/coda.h
+header-test- += linux/iadx.h
 header-test- += linux/elfcore.h
 header-test- += linux/errqueue.h
 header-test- += linux/fsmap.h


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 13/14] dmaengine: idxd: add char driver to expose submission portal to userland
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (11 preceding siblings ...)
  2019-11-20 21:24 ` [PATCH RFC 12/14] dmaengine: request submit optimization Dave Jiang
@ 2019-11-20 21:25 ` Dave Jiang
  2019-11-20 21:25 ` [PATCH RFC 14/14] dmaengine: idxd: add sysfs ABI for idxd driver Dave Jiang
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:25 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

Create a char device region that will allow acquisition of user portals in
order to allow applications to submit DMA operations. A char device will be
created per work queue that gets exposed. The workqueue type "user"
is used to mark a work queue for user char device. For example if the
workqueue 0 of DSA device 0 is marked for char device, then a device node
of /dev/dsa/wq0.0 will be created.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/Makefile |    2 
 drivers/dma/idxd/cdev.c   |  304 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/device.c |    2 
 drivers/dma/idxd/idxd.h   |   38 ++++++
 drivers/dma/idxd/init.c   |   10 +
 drivers/dma/idxd/irq.c    |   18 +++
 drivers/dma/idxd/submit.c |    4 -
 drivers/dma/idxd/sysfs.c  |   52 +++++++-
 8 files changed, 425 insertions(+), 5 deletions(-)
 create mode 100644 drivers/dma/idxd/cdev.c

diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index a036ba0e77d2..8978b898d777 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,2 @@
 obj-$(CONFIG_INTEL_IDXD) += idxd.o
-idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o
+idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o
diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
new file mode 100644
index 000000000000..a90d38a9769a
--- /dev/null
+++ b/drivers/dma/idxd/cdev.c
@@ -0,0 +1,304 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/intel-svm.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+
+struct idxd_cdev_context {
+	const char *name;
+	dev_t devt;
+	struct ida minor_ida;
+};
+
+/*
+ * ictx is an array based off of accelerator types. enum idxd_type
+ * is used as index
+ */
+static struct idxd_cdev_context ictx[IDXD_TYPE_MAX] = {
+	{ .name = "dsa" },
+};
+
+struct idxd_user_context {
+	struct idxd_wq *wq;
+	struct task_struct *task;
+	unsigned int flags;
+};
+
+enum idxd_cdev_cleanup {
+	CDEV_NORMAL = 0,
+	CDEV_FAILED,
+};
+
+static void idxd_cdev_dev_release(struct device *dev)
+{
+	dev_dbg(dev, "releasing cdev device\n");
+	kfree(dev);
+}
+
+static struct device_type idxd_cdev_device_type = {
+	.name = "idxd_cdev",
+	.release = idxd_cdev_dev_release,
+};
+
+static inline struct idxd_cdev *inode_idxd_cdev(struct inode *inode)
+{
+	struct cdev *cdev = inode->i_cdev;
+
+	return container_of(cdev, struct idxd_cdev, cdev);
+}
+
+static inline struct idxd_wq *idxd_cdev_wq(struct idxd_cdev *idxd_cdev)
+{
+	return container_of(idxd_cdev, struct idxd_wq, idxd_cdev);
+}
+
+static inline struct idxd_wq *inode_wq(struct inode *inode)
+{
+	return idxd_cdev_wq(inode_idxd_cdev(inode));
+}
+
+static int idxd_cdev_open(struct inode *inode, struct file *filp)
+{
+	struct idxd_user_context *ctx;
+	struct idxd_device *idxd;
+	struct idxd_wq *wq;
+	struct device *dev;
+	struct idxd_cdev *idxd_cdev;
+
+	wq = inode_wq(inode);
+	idxd = wq->idxd;
+	dev = &idxd->pdev->dev;
+	idxd_cdev = &wq->idxd_cdev;
+
+	dev_dbg(dev, "%s called\n", __func__);
+
+	if (idxd_wq_refcount(wq) > 1 && wq_dedicated(wq))
+		return -EBUSY;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->wq = wq;
+	filp->private_data = ctx;
+	idxd_wq_get(wq);
+	return 0;
+}
+
+static int idxd_cdev_release(struct inode *node, struct file *filep)
+{
+	struct idxd_user_context *ctx = filep->private_data;
+	struct idxd_wq *wq = ctx->wq;
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+
+	dev_dbg(dev, "%s called\n", __func__);
+	filep->private_data = NULL;
+
+	kfree(ctx);
+	idxd_wq_put(wq);
+	return 0;
+}
+
+static int check_vma(struct idxd_wq *wq, struct vm_area_struct *vma,
+		     const char *func)
+{
+	struct device *dev = &wq->idxd->pdev->dev;
+
+	if ((vma->vm_end - vma->vm_start) > PAGE_SIZE) {
+		dev_info_ratelimited(dev,
+				     "%s: %s: mapping too large: %lu\n",
+				     current->comm, func,
+				     vma->vm_end - vma->vm_start);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int idxd_cdev_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct idxd_user_context *ctx = filp->private_data;
+	struct idxd_wq *wq = ctx->wq;
+	struct idxd_device *idxd = wq->idxd;
+	struct pci_dev *pdev = idxd->pdev;
+	phys_addr_t base = pci_resource_start(pdev, IDXD_WQ_BAR);
+	unsigned long pfn;
+	int rc;
+
+	dev_dbg(&pdev->dev, "%s called\n", __func__);
+	rc = check_vma(wq, vma, __func__);
+
+	vma->vm_flags |= VM_DONTCOPY;
+	pfn = (base + idxd_get_wq_portal_full_offset(wq->id,
+				IDXD_PORTAL_LIMITED)) >> PAGE_SHIFT;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_private_data = ctx;
+
+	return io_remap_pfn_range(vma, vma->vm_start, pfn, PAGE_SIZE,
+			vma->vm_page_prot);
+}
+
+static __poll_t idxd_cdev_poll(struct file *filp,
+			       struct poll_table_struct *wait)
+{
+	struct idxd_user_context *ctx = filp->private_data;
+	struct idxd_wq *wq = ctx->wq;
+	struct idxd_device *idxd = wq->idxd;
+	struct idxd_cdev *idxd_cdev = &wq->idxd_cdev;
+	unsigned long flags;
+	__poll_t out = 0;
+
+	poll_wait(filp, &idxd_cdev->err_queue, wait);
+	spin_lock_irqsave(&idxd->dev_lock, flags);
+	if (idxd->sw_err.valid)
+		out = EPOLLIN | EPOLLRDNORM;
+	spin_unlock_irqrestore(&idxd->dev_lock, flags);
+
+	return out;
+}
+
+static const struct file_operations idxd_cdev_fops = {
+	.owner = THIS_MODULE,
+	.open = idxd_cdev_open,
+	.release = idxd_cdev_release,
+	.mmap = idxd_cdev_mmap,
+	.poll = idxd_cdev_poll,
+};
+
+int idxd_cdev_get_major(struct idxd_device *idxd)
+{
+	return MAJOR(ictx[idxd->type].devt);
+}
+
+static int idxd_wq_cdev_dev_setup(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct idxd_cdev *idxd_cdev = &wq->idxd_cdev;
+	struct idxd_cdev_context *cdev_ctx;
+	struct device *dev;
+	int minor, rc;
+
+	idxd_cdev->dev = kzalloc(sizeof(*idxd_cdev->dev), GFP_KERNEL);
+	if (!idxd_cdev->dev)
+		return -ENOMEM;
+
+	dev = idxd_cdev->dev;
+	dev->parent = &idxd->pdev->dev;
+	dev_set_name(dev, "%s/wq%u.%u", idxd_get_dev_name(idxd),
+		     idxd->id, wq->id);
+	dev->bus = idxd_get_bus_type(idxd);
+
+	cdev_ctx = &ictx[wq->idxd->type];
+	minor = ida_simple_get(&cdev_ctx->minor_ida, 0, MINORMASK, GFP_KERNEL);
+	if (minor < 0) {
+		rc = minor;
+		goto ida_err;
+	}
+
+	dev->devt = MKDEV(MAJOR(cdev_ctx->devt), minor);
+	dev->type = &idxd_cdev_device_type;
+	rc = device_register(dev);
+	if (rc < 0) {
+		dev_err(&idxd->pdev->dev, "device register failed\n");
+		put_device(dev);
+		goto dev_reg_err;
+	}
+	idxd_cdev->minor = minor;
+
+	return 0;
+
+ dev_reg_err:
+	ida_simple_remove(&cdev_ctx->minor_ida, MINOR(dev->devt));
+ ida_err:
+	kfree(dev);
+	idxd_cdev->dev = NULL;
+	return rc;
+}
+
+static void idxd_wq_cdev_cleanup(struct idxd_wq *wq,
+				 enum idxd_cdev_cleanup cdev_state)
+{
+	struct idxd_cdev *idxd_cdev = &wq->idxd_cdev;
+	struct idxd_cdev_context *cdev_ctx;
+
+	cdev_ctx = &ictx[wq->idxd->type];
+	if (cdev_state == CDEV_NORMAL)
+		cdev_del(&idxd_cdev->cdev);
+	device_unregister(idxd_cdev->dev);
+	/*
+	 * The device_type->release() will be called on the device and free
+	 * the allocated struct device. We can just forget it.
+	 */
+	ida_simple_remove(&cdev_ctx->minor_ida, idxd_cdev->minor);
+	idxd_cdev->dev = NULL;
+	idxd_cdev->minor = -1;
+}
+
+int idxd_wq_add_cdev(struct idxd_wq *wq)
+{
+	struct idxd_cdev *idxd_cdev = &wq->idxd_cdev;
+	struct cdev *cdev = &idxd_cdev->cdev;
+	struct device *dev;
+	int rc;
+
+	rc = idxd_wq_cdev_dev_setup(wq);
+	if (rc < 0)
+		return rc;
+
+	dev = idxd_cdev->dev;
+	cdev_init(cdev, &idxd_cdev_fops);
+	cdev_set_parent(cdev, &dev->kobj);
+	rc = cdev_add(cdev, dev->devt, 1);
+	if (rc) {
+		dev_dbg(&wq->idxd->pdev->dev, "cdev_add failed: %d\n", rc);
+		idxd_wq_cdev_cleanup(wq, CDEV_FAILED);
+		return rc;
+	}
+
+	init_waitqueue_head(&idxd_cdev->err_queue);
+	idxd_wq_get(wq);
+	return 0;
+}
+
+void idxd_wq_del_cdev(struct idxd_wq *wq)
+{
+	idxd_wq_cdev_cleanup(wq, CDEV_NORMAL);
+	idxd_wq_put(wq);
+}
+
+int idxd_cdev_register(void)
+{
+	int rc, i;
+
+	for (i = 0; i < IDXD_TYPE_MAX; i++) {
+		ida_init(&ictx[i].minor_ida);
+		rc = alloc_chrdev_region(&ictx[i].devt, 0, MINORMASK,
+					 ictx[i].name);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+void idxd_cdev_remove(void)
+{
+	int i;
+
+	for (i = 0; i < IDXD_TYPE_MAX; i++) {
+		unregister_chrdev_region(ictx[i].devt, MINORMASK);
+		ida_destroy(&ictx[i].minor_ida);
+	}
+}
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 49638d3a2151..fd33e2985e95 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -545,7 +545,7 @@ static int idxd_wq_config_write(struct idxd_wq *wq)
 	wq->wqcfg.wq_thresh = wq->threshold;
 
 	/* byte 8-11 */
-	wq->wqcfg.priv = 1; /* kernel, therefore priv */
+	wq->wqcfg.priv = !!(wq->type == IDXD_WQT_KERNEL);
 	wq->wqcfg.mode = 1;
 
 	wq->wqcfg.priority = wq->priority;
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 7f94e50267cc..62ee01ab0901 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -7,6 +7,7 @@
 #include <linux/dmaengine.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/wait.h>
+#include <linux/cdev.h>
 #include "registers.h"
 
 #define IDXD_DRIVER_VERSION	"1.00"
@@ -63,6 +64,14 @@ enum idxd_wq_flag {
 enum idxd_wq_type {
 	IDXD_WQT_NONE = 0,
 	IDXD_WQT_KERNEL,
+	IDXD_WQT_USER,
+};
+
+struct idxd_cdev {
+	struct cdev cdev;
+	struct device *dev;
+	int minor;
+	struct wait_queue_head err_queue;
 };
 
 #define IDXD_ALLOCATED_BATCH_SIZE	128U
@@ -72,6 +81,7 @@ enum idxd_wq_type {
 struct idxd_wq {
 	void __iomem *dportal;
 	struct device conf_dev;
+	struct idxd_cdev idxd_cdev;
 	struct idxd_device *idxd;
 	int id;
 	enum idxd_wq_type type;
@@ -139,6 +149,7 @@ struct idxd_device {
 	enum idxd_device_state state;
 	unsigned long flags;
 	int id;
+	int major;
 
 	struct pci_dev *pdev;
 	void __iomem *reg_base;
@@ -191,11 +202,29 @@ struct idxd_desc {
 #define confdev_to_idxd(dev) container_of(dev, struct idxd_device, conf_dev)
 #define confdev_to_wq(dev) container_of(dev, struct idxd_wq, conf_dev)
 
+extern struct bus_type dsa_bus_type;
+
 static inline bool wq_dedicated(struct idxd_wq *wq)
 {
 	return test_bit(WQ_FLAG_DEDICATED, &wq->flags);
 }
 
+enum idxd_portal_prot {
+	IDXD_PORTAL_UNLIMITED = 0,
+	IDXD_PORTAL_LIMITED,
+};
+
+static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
+{
+	return prot * 0x1000;
+}
+
+static inline int idxd_get_wq_portal_full_offset(int wq_id,
+						 enum idxd_portal_prot prot)
+{
+	return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot);
+}
+
 static inline void idxd_set_type(struct idxd_device *idxd)
 {
 	struct pci_dev *pdev = idxd->pdev;
@@ -228,6 +257,7 @@ int idxd_setup_sysfs(struct idxd_device *idxd);
 void idxd_cleanup_sysfs(struct idxd_device *idxd);
 int idxd_register_driver(void);
 void idxd_unregister_driver(void);
+struct bus_type *idxd_get_bus_type(struct idxd_device *idxd);
 
 /* device interrupt control */
 irqreturn_t idxd_irq_handler(int vec, void *data);
@@ -266,4 +296,12 @@ void idxd_unregister_dma_device(struct idxd_device *idxd);
 int idxd_register_dma_channel(struct idxd_wq *wq);
 void idxd_unregister_dma_channel(struct idxd_wq *wq);
 void idxd_parse_completion_status(u8 status, enum dmaengine_tx_result *res);
+
+/* cdev */
+int idxd_cdev_register(void);
+void idxd_cdev_remove(void);
+int idxd_cdev_get_major(struct idxd_device *idxd);
+int idxd_wq_add_cdev(struct idxd_wq *wq);
+void idxd_wq_del_cdev(struct idxd_wq *wq);
+
 #endif
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index a030eaaf0ab1..224b079f6fdc 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -188,6 +188,7 @@ static int idxd_setup_internals(struct idxd_device *idxd)
 		mutex_init(&wq->wq_lock);
 		atomic_set(&wq->dq_count, 0);
 		init_waitqueue_head(&wq->submit_waitq);
+		wq->idxd_cdev.minor = -1;
 		rc = percpu_init_rwsem(&wq->submit_lock);
 		if (rc < 0) {
 			idxd_wqs_free_lock(idxd);
@@ -320,6 +321,8 @@ static int idxd_probe(struct idxd_device *idxd)
 		goto err_idr_fail;
 	}
 
+	idxd->major = idxd_cdev_get_major(idxd);
+
 	dev_dbg(dev, "IDXD device %d probed successfully\n", idxd->id);
 	return 0;
 
@@ -518,6 +521,10 @@ static int __init idxd_init_module(void)
 	if (err < 0)
 		goto err_idxd_driver_register;
 
+	err = idxd_cdev_register();
+	if (err)
+		goto err_cdev_register;
+
 	err = pci_register_driver(&idxd_pci_driver);
 	if (err)
 		goto err_pci_register;
@@ -525,6 +532,8 @@ static int __init idxd_init_module(void)
 	return 0;
 
 err_pci_register:
+	idxd_cdev_remove();
+err_cdev_register:
 	idxd_unregister_driver();
 err_idxd_driver_register:
 	idxd_unregister_bus_type();
@@ -535,6 +544,7 @@ module_init(idxd_init_module);
 static void __exit idxd_exit_module(void)
 {
 	pci_unregister_driver(&idxd_pci_driver);
+	idxd_cdev_remove();
 	idxd_unregister_bus_type();
 }
 module_exit(idxd_exit_module);
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
index b4adeb2817d1..442f21eeb0fb 100644
--- a/drivers/dma/idxd/irq.c
+++ b/drivers/dma/idxd/irq.c
@@ -89,6 +89,24 @@ irqreturn_t idxd_misc_thread(int vec, void *data)
 			idxd->sw_err.bits[i] = ioread64(idxd->reg_base +
 					IDXD_SWERR_OFFSET + i * sizeof(u64));
 		iowrite64(IDXD_SWERR_ACK, idxd->reg_base + IDXD_SWERR_OFFSET);
+
+		if (idxd->sw_err.valid && idxd->sw_err.wq_idx_valid) {
+			int id = idxd->sw_err.wq_idx;
+			struct idxd_wq *wq = &idxd->wqs[id];
+
+			if (wq->type == IDXD_WQT_USER)
+				wake_up_interruptible(&wq->idxd_cdev.err_queue);
+		} else {
+			int i;
+
+			for (i = 0; i < idxd->max_wqs; i++) {
+				struct idxd_wq *wq = &idxd->wqs[i];
+
+				if (wq->type == IDXD_WQT_USER)
+					wake_up_interruptible(&wq->idxd_cdev.err_queue);
+			}
+		}
+
 		spin_unlock_bh(&idxd->dev_lock);
 		val |= IDXD_INTC_ERR;
 
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index f7baa1bbb0c7..0e6dea34edc9 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -71,17 +71,19 @@ static int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc,
 {
 	struct idxd_device *idxd = wq->idxd;
 	int vec = desc->hw->int_handle;
+	void __iomem *portal;
 
 	if (idxd->state != IDXD_DEV_ENABLED)
 		return -EIO;
 
+	portal = wq->dportal + idxd_get_wq_portal_offset(IDXD_PORTAL_UNLIMITED);
 	/*
 	 * The wmb() flushes writes to coherent DMA data before possibly
 	 * triggering a DMA read. The wmb() is necessary even on UP because
 	 * the recipient is a device.
 	 */
 	wmb();
-	iosubmit_cmds512(wq->dportal, desc->hw, 1);
+	iosubmit_cmds512(portal, desc->hw, 1);
 
 	/*
 	 * Pending the descriptor to the lockless list for the irq_entry
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index a7468d22c287..81a41a2b84ce 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -13,6 +13,7 @@
 static char *idxd_wq_type_names[] = {
 	[IDXD_WQT_NONE]		= "none",
 	[IDXD_WQT_KERNEL]	= "kernel",
+	[IDXD_WQT_USER]		= "user",
 };
 
 static void idxd_conf_device_release(struct device *dev)
@@ -63,6 +64,11 @@ static inline bool is_idxd_wq_dmaengine(struct idxd_wq *wq)
 	return false;
 }
 
+static inline bool is_idxd_wq_cdev(struct idxd_wq *wq)
+{
+	return wq->type == IDXD_WQT_USER ? true : false;
+}
+
 static int idxd_config_bus_match(struct device *dev,
 				 struct device_driver *drv)
 {
@@ -109,6 +115,9 @@ static int idxd_config_bus_probe(struct device *dev)
 			return -EBUSY;
 		}
 
+		if (!try_module_get(THIS_MODULE))
+			return -ENXIO;
+
 		spin_lock_irqsave(&idxd->dev_lock, flags);
 
 		/* Perform IDXD configuration and enabling */
@@ -216,6 +225,13 @@ static int idxd_config_bus_probe(struct device *dev)
 				mutex_unlock(&wq->wq_lock);
 				return rc;
 			}
+		} else if (is_idxd_wq_cdev(wq)) {
+			rc = idxd_wq_add_cdev(wq);
+			if (rc < 0) {
+				dev_dbg(dev, "Cdev creation failed\n");
+				mutex_unlock(&wq->wq_lock);
+				return rc;
+			}
 		}
 
 		mutex_unlock(&wq->wq_lock);
@@ -241,6 +257,8 @@ static void disable_wq(struct idxd_wq *wq)
 
 	if (is_idxd_wq_dmaengine(wq))
 		idxd_unregister_dma_channel(wq);
+	else if (is_idxd_wq_cdev(wq))
+		idxd_wq_del_cdev(wq);
 
 	if (idxd_wq_refcount(wq))
 		dev_warn(dev, "Clients has claim on wq %d: %d\n",
@@ -295,10 +313,12 @@ static int idxd_config_bus_remove(struct device *dev)
 		spin_lock_irqsave(&idxd->dev_lock, flags);
 		rc = idxd_device_disable(idxd);
 		spin_unlock_irqrestore(&idxd->dev_lock, flags);
+		module_put(THIS_MODULE);
 		if (rc < 0)
 			dev_warn(dev, "Device disable failed\n");
 		else
 			dev_info(dev, "Device %s disabled\n", dev_name(dev));
+
 	}
 
 	return 0;
@@ -309,7 +329,7 @@ static void idxd_config_bus_shutdown(struct device *dev)
 	dev_dbg(dev, "%s called\n", __func__);
 }
 
-static struct bus_type dsa_bus_type = {
+struct bus_type dsa_bus_type = {
 	.name = "dsa",
 	.match = idxd_config_bus_match,
 	.probe = idxd_config_bus_probe,
@@ -334,7 +354,7 @@ static struct idxd_device_driver *idxd_drvs[] = {
 	&dsa_drv
 };
 
-static struct bus_type *idxd_get_bus_type(struct idxd_device *idxd)
+struct bus_type *idxd_get_bus_type(struct idxd_device *idxd)
 {
 	return idxd_bus_types[idxd->type];
 }
@@ -938,6 +958,9 @@ static ssize_t wq_type_show(struct device *dev,
 	case IDXD_WQT_KERNEL:
 		return sprintf(buf, "%s\n",
 			       idxd_wq_type_names[IDXD_WQT_KERNEL]);
+	case IDXD_WQT_USER:
+		return sprintf(buf, "%s\n",
+			       idxd_wq_type_names[IDXD_WQT_USER]);
 	case IDXD_WQT_NONE:
 	default:
 		return sprintf(buf, "%s\n",
@@ -960,6 +983,8 @@ static ssize_t wq_type_store(struct device *dev,
 	old_type = wq->type;
 	if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_KERNEL]))
 		wq->type = IDXD_WQT_KERNEL;
+	else if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_USER]))
+		wq->type = IDXD_WQT_USER;
 	else
 		wq->type = IDXD_WQT_NONE;
 
@@ -1002,6 +1027,17 @@ static ssize_t wq_name_store(struct device *dev,
 static struct device_attribute dev_attr_wq_name =
 		__ATTR(name, 0644, wq_name_show, wq_name_store);
 
+static ssize_t wq_cdev_minor_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+
+	return sprintf(buf, "%d\n", wq->idxd_cdev.minor);
+}
+
+static struct device_attribute dev_attr_wq_cdev_minor =
+		__ATTR(cdev_minor, 0444, wq_cdev_minor_show, NULL);
+
 static struct attribute *idxd_wq_attributes[] = {
 	&dev_attr_wq_clients.attr,
 	&dev_attr_wq_state.attr,
@@ -1011,6 +1047,7 @@ static struct attribute *idxd_wq_attributes[] = {
 	&dev_attr_wq_priority.attr,
 	&dev_attr_wq_type.attr,
 	&dev_attr_wq_name.attr,
+	&dev_attr_wq_cdev_minor.attr,
 	NULL,
 };
 
@@ -1224,6 +1261,16 @@ static ssize_t token_limit_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(token_limit);
 
+static ssize_t cdev_major_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->major);
+}
+static DEVICE_ATTR_RO(cdev_major);
+
 static struct attribute *idxd_device_attributes[] = {
 	&dev_attr_max_groups.attr,
 	&dev_attr_max_work_queues.attr,
@@ -1239,6 +1286,7 @@ static struct attribute *idxd_device_attributes[] = {
 	&dev_attr_errors.attr,
 	&dev_attr_max_tokens.attr,
 	&dev_attr_token_limit.attr,
+	&dev_attr_cdev_major.attr,
 	NULL,
 };
 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 14/14] dmaengine: idxd: add sysfs ABI for idxd driver
  2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
                   ` (12 preceding siblings ...)
  2019-11-20 21:25 ` [PATCH RFC 13/14] dmaengine: idxd: add char driver to expose submission portal to userland Dave Jiang
@ 2019-11-20 21:25 ` Dave Jiang
  13 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 21:25 UTC (permalink / raw)
  To: dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

From: Jing Lin <jing.lin@intel.com>

Add the sysfs ABI information for idxd driver in
Documentation/ABI/stable directory.

Signed-off-by: Jing Lin <jing.lin@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 Documentation/ABI/stable/sysfs-driver-dma-idxd |  171 ++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 Documentation/ABI/stable/sysfs-driver-dma-idxd

diff --git a/Documentation/ABI/stable/sysfs-driver-dma-idxd b/Documentation/ABI/stable/sysfs-driver-dma-idxd
new file mode 100644
index 000000000000..f4be46cc6cb6
--- /dev/null
+++ b/Documentation/ABI/stable/sysfs-driver-dma-idxd
@@ -0,0 +1,171 @@
+What:           sys/bus/dsa/devices/dsa<m>/cdev_major
+Date:           Oct 25, 2019
+KernelVersion: 	5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:	The major number that the character device driver assigned to
+		this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/errors
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The error information for this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/max_batch_size
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The largest number of work descriptors in a batch.
+
+What:           sys/bus/dsa/devices/dsa<m>/max_work_queues_size
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The maximum work queue size supported by this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/max_engines
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The maximum number of engines supported by this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/max_groups
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The maximum number of groups can be created under this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/max_tokens
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The total number of bandwidth tokens supported by this device.
+		The bandwidth tokens represent resources within the DSA
+		implementation, and these resources are allocated by engines to
+		support operations.
+
+What:           sys/bus/dsa/devices/dsa<m>/max_transfer_size
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The number of bytes to be read from the source address to
+		perform the operation. The maximum transfer size is dependent on
+		the workqueue the descriptor was submitted to.
+
+What:           sys/bus/dsa/devices/dsa<m>/max_work_queues
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The maximum work queue number that this device supports.
+
+What:           sys/bus/dsa/devices/dsa<m>/numa_node
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The numa node number for this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/op_cap
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The operation capability bit mask specify the operation types
+		supported by the this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/state
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The state information of this device. It can be either enabled
+		or disabled.
+
+What:           sys/bus/dsa/devices/dsa<m>/group<m>.<n>
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The assigned group under this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/engine<m>.<n>
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The assigned engine under this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/wq<m>.<n>
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The assigned work queue under this device.
+
+What:           sys/bus/dsa/devices/dsa<m>/configurable
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    To indicate if this device is configurable or not.
+
+What:           sys/bus/dsa/devices/dsa<m>/token_limit
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The maximum number of bandwidth tokens that may be in use at
+		one time by operations that access low bandwidth memory in the
+		device.
+
+What:           sys/bus/dsa/devices/wq<m>.<n>/group_id
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The group id that this work queue belongs to.
+
+What:           sys/bus/dsa/devices/wq<m>.<n>/size
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The work queue size for this work queue.
+
+What:           sys/bus/dsa/devices/wq<m>.<n>/type
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The type of this work queue, it can be "kernel" type for work
+		queue usages in the kernel space or "user" type for work queue
+		usages by applications in user space.
+
+What:           sys/bus/dsa/devices/wq<m>.<n>/cdev_minor
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The minor number assigned to this work queue by the character
+		device driver.
+
+What:           sys/bus/dsa/devices/wq<m>.<n>/mode
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The work queue mode type for this work queue.
+
+What:           sys/bus/dsa/devices/wq<m>.<n>/priority
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The priority value of this work queue, it is a vlue relative to
+		other work queue in the same group to control quality of service
+		for dispatching work from multiple workqueues in the same group.
+
+What:           sys/bus/dsa/devices/wq<m>.<n>/state
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The current state of the work queue.
+
+What:           sys/bus/dsa/devices/wq<m>.<n>/threshold
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The number of entries in this work queue that may be filled
+		via a limited portal.
+
+What:           sys/bus/dsa/devices/engine<m>.<n>/group_id
+Date:           Oct 25, 2019
+KernelVersion:  5.6.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The group that this engine belongs to.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 21:23 ` [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction Dave Jiang
@ 2019-11-20 21:50   ` Dave Hansen
  2019-11-20 23:46     ` Dave Jiang
  2019-11-20 21:53   ` Borislav Petkov
  1 sibling, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2019-11-20 21:50 UTC (permalink / raw)
  To: Dave Jiang, dmaengine, linux-kernel, vkoul
  Cc: dan.j.williams, tony.luck, jing.lin, ashok.raj, sanjay.k.kumar,
	megha.dey, jacob.jun.pan, yi.l.liu, axboe, akpm, tglx, mingo, bp,
	fenghua.yu, hpa

On 11/20/19 1:23 PM, Dave Jiang wrote:
> +static inline void __iowrite512(void __iomem *__dst, const void *src)
> +{
> +	volatile struct { char _[64]; } *dst = __dst;

This _looks_ like gibberish.  I know it's not, but it is subtle enough
that it really needs specific comments.

> +static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
> +				    size_t count)
> +{
> +	const u8 *from = src;
> +	const u8 *end = from + count * 64;
> +
> +	if (!cpu_has_write512())
> +		return;
> +
> +	while (from < end) {
> +		__iowrite512(dst, from);
> +		from += 64;
> +	}
> +}

Won't this silently just drop things if the CPU doesn't have movdir64b
support?

It seems like this shouldn't be called at all if
!cpu_has_write512(), but wouldn't something like this be mroe appropriate?

	if (!cpu_has_write512()) {
		WARN_ON_ONCE(1);
		return;
	}

Is the caller just supposed to infer that "dst" was never overwritten?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 21:23 ` [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction Dave Jiang
  2019-11-20 21:50   ` Dave Hansen
@ 2019-11-20 21:53   ` Borislav Petkov
  2019-11-20 23:19     ` Luck, Tony
  2019-11-21  0:10     ` Dave Jiang
  1 sibling, 2 replies; 35+ messages in thread
From: Borislav Petkov @ 2019-11-20 21:53 UTC (permalink / raw)
  To: Dave Jiang
  Cc: dmaengine, linux-kernel, vkoul, dan.j.williams, tony.luck,
	jing.lin, ashok.raj, sanjay.k.kumar, megha.dey, jacob.jun.pan,
	yi.l.liu, axboe, akpm, tglx, mingo, fenghua.yu, hpa

On Wed, Nov 20, 2019 at 02:23:49PM -0700, Dave Jiang wrote:
> +/**
> + * iosubmit_cmds512 - copy data to single MMIO location, in 512-bit units

Where is the alignment check on that data before doing the copying?

> + * @dst: destination, in MMIO space (must be 512-bit aligned)
> + * @src: source
> + * @count: number of 512 bits quantities to submit

Where's that check on the data?

> + *
> + * Submit data from kernel space to MMIO space, in units of 512 bits at a
> + * time.  Order of access is not guaranteed, nor is a memory barrier
> + * performed afterwards.
> + */
> +static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
> +				    size_t count)

An iosubmit function which returns void and doesn't tell its callers
whether it succeeded or not? That looks non-optimal to say the least.

Why isn't there a fallback function which to call when the CPU doesn't
support movdir64b?

Because then you can use alternative_call() and have the thing work
regardless of hardware support for MOVDIR*.

> +{
> +	const u8 *from = src;
> +	const u8 *end = from + count * 64;
> +
> +	if (!cpu_has_write512())

If anything, that thing needs to go and you should use

  static_cpu_has(X86_FEATURE_MOVDIR64B)

as it looks to me like you would care about speed on this fast path?
Yes, no?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 21:53   ` Borislav Petkov
@ 2019-11-20 23:19     ` Luck, Tony
  2019-11-20 23:26       ` Borislav Petkov
                         ` (2 more replies)
  2019-11-21  0:10     ` Dave Jiang
  1 sibling, 3 replies; 35+ messages in thread
From: Luck, Tony @ 2019-11-20 23:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Jiang, dmaengine, linux-kernel, vkoul, dan.j.williams,
	jing.lin, ashok.raj, sanjay.k.kumar, megha.dey, jacob.jun.pan,
	yi.l.liu, axboe, akpm, tglx, mingo, fenghua.yu, hpa

On Wed, Nov 20, 2019 at 10:53:39PM +0100, Borislav Petkov wrote:
> On Wed, Nov 20, 2019 at 02:23:49PM -0700, Dave Jiang wrote:
> > +static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
> > +				    size_t count)
> 
> An iosubmit function which returns void and doesn't tell its callers
> whether it succeeded or not? That looks non-optimal to say the least.

That's the underlying functionality of the MOVDIR64B instruction. A
posted write so no way to know if it succeeded. When using dedicated
queues the caller must keep count of how many operations are in flight
and not send more than the depth of the queue.

> Why isn't there a fallback function which to call when the CPU doesn't
> support movdir64b?

This particular driver has no option for fallback. Descriptors can
only be submitted with MOVDIR64B (to dedicated queues ... in later
patch series support for shared queues will be added, but those require
ENQCMD or ENQCMDS to submit).

The driver bails out at the beginning of the probe routine if the
necessary instructions are not supported:

+       /*
+        * If the CPU does not support write512, there's no point in
+        * enumerating the device. We can not utilize it.
+        */
+       if (!cpu_has_write512())
+               return -ENXIO;

Though we should always get past that as this PCI device ID shouldn't
every appear on a system that doesn't have the support. Device is on
the die, not a plug-in card.

> Because then you can use alternative_call() and have the thing work
> regardless of hardware support for MOVDIR*.
> 
> > +{
> > +	const u8 *from = src;
> > +	const u8 *end = from + count * 64;
> > +
> > +	if (!cpu_has_write512())
> 
> If anything, that thing needs to go and you should use
> 
>   static_cpu_has(X86_FEATURE_MOVDIR64B)
> 
> as it looks to me like you would care about speed on this fast path?
> Yes, no?

That might be a better.

-Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 23:19     ` Luck, Tony
@ 2019-11-20 23:26       ` Borislav Petkov
  2019-11-21  0:15         ` Luck, Tony
  2019-11-21  0:27         ` Dan Williams
  2019-11-21  0:21       ` Dan Williams
  2019-11-21  0:22       ` Thomas Gleixner
  2 siblings, 2 replies; 35+ messages in thread
From: Borislav Petkov @ 2019-11-20 23:26 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Dave Jiang, dmaengine, linux-kernel, vkoul, dan.j.williams,
	jing.lin, ashok.raj, sanjay.k.kumar, megha.dey, jacob.jun.pan,
	yi.l.liu, axboe, akpm, tglx, mingo, fenghua.yu, hpa

On Wed, Nov 20, 2019 at 03:19:23PM -0800, Luck, Tony wrote:
> That's the underlying functionality of the MOVDIR64B instruction. A
> posted write so no way to know if it succeeded.

So how do you know whether any of the writes went through?

> When using dedicated queues the caller must keep count of how many
> operations are in flight and not send more than the depth of the
> queue.

This way?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 21:50   ` Dave Hansen
@ 2019-11-20 23:46     ` Dave Jiang
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-20 23:46 UTC (permalink / raw)
  To: Hansen, Dave, dmaengine, linux-kernel, vkoul
  Cc: Williams, Dan J, Luck, Tony, Lin, Jing, Raj, Ashok, Kumar,
	Sanjay K, Dey, Megha, Pan, Jacob jun, Liu, Yi L, axboe, akpm,
	tglx, mingo, bp, Yu, Fenghua, hpa



On 11/20/19 2:50 PM, Hansen, Dave wrote:
> On 11/20/19 1:23 PM, Dave Jiang wrote:
>> +static inline void __iowrite512(void __iomem *__dst, const void *src)
>> +{
>> +	volatile struct { char _[64]; } *dst = __dst;
> 
> This _looks_ like gibberish.  I know it's not, but it is subtle enough
> that it really needs specific comments.

I'll add comments explaining.

> 
>> +static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
>> +				    size_t count)
>> +{
>> +	const u8 *from = src;
>> +	const u8 *end = from + count * 64;
>> +
>> +	if (!cpu_has_write512())
>> +		return;
>> +
>> +	while (from < end) {
>> +		__iowrite512(dst, from);
>> +		from += 64;
>> +	}
>> +}
> 
> Won't this silently just drop things if the CPU doesn't have movdir64b
> support?
> 
> It seems like this shouldn't be called at all if
> !cpu_has_write512(), but wouldn't something like this be mroe appropriate?
> 
> 	if (!cpu_has_write512()) {
> 		WARN_ON_ONCE(1);
> 		return;
> 	}
> 
> Is the caller just supposed to infer that "dst" was never overwritten?
> 

Thanks. I'll add the WARN().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 21:53   ` Borislav Petkov
  2019-11-20 23:19     ` Luck, Tony
@ 2019-11-21  0:10     ` Dave Jiang
  2019-11-21 10:59       ` Borislav Petkov
  1 sibling, 1 reply; 35+ messages in thread
From: Dave Jiang @ 2019-11-21  0:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: dmaengine, linux-kernel, vkoul, Williams, Dan J, Luck, Tony, Lin,
	Jing, Raj, Ashok, Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun,
	Liu, Yi L, axboe, akpm, tglx, mingo, Yu, Fenghua, hpa



On 11/20/19 2:53 PM, Borislav Petkov wrote:
> On Wed, Nov 20, 2019 at 02:23:49PM -0700, Dave Jiang wrote:
>> +/**
>> + * iosubmit_cmds512 - copy data to single MMIO location, in 512-bit units
> 
> Where is the alignment check on that data before doing the copying?

I'll add the check on the destination address. The call is modeled after 
__iowrite64_copy() / __iowrite32_copy() in lib/iomap_copy.c. Looks like 
those functions do not check for the alignment requirements either.

> 
>> + * @dst: destination, in MMIO space (must be 512-bit aligned)
>> + * @src: source
>> + * @count: number of 512 bits quantities to submit
> 
> Where's that check on the data?

I don't follow?

> 
>> + *
>> + * Submit data from kernel space to MMIO space, in units of 512 bits at a
>> + * time.  Order of access is not guaranteed, nor is a memory barrier
>> + * performed afterwards.
>> + */
>> +static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
>> +				    size_t count)
> 
> An iosubmit function which returns void and doesn't tell its callers
> whether it succeeded or not? That looks non-optimal to say the least.
> 
> Why isn't there a fallback function which to call when the CPU doesn't
> support movdir64b?
> 
> Because then you can use alternative_call() and have the thing work
> regardless of hardware support for MOVDIR*.

Looks like Tony answered this part.

> 
>> +{
>> +	const u8 *from = src;
>> +	const u8 *end = from + count * 64;
>> +
>> +	if (!cpu_has_write512())
> 
> If anything, that thing needs to go and you should use
> 
>    static_cpu_has(X86_FEATURE_MOVDIR64B)
> 
> as it looks to me like you would care about speed on this fast path?
> Yes, no?
> 

Yes thank you!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 23:26       ` Borislav Petkov
@ 2019-11-21  0:15         ` Luck, Tony
  2019-11-21  0:27         ` Dan Williams
  1 sibling, 0 replies; 35+ messages in thread
From: Luck, Tony @ 2019-11-21  0:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jiang, Dave, dmaengine, linux-kernel, vkoul, Williams, Dan J,
	Lin, Jing, Raj, Ashok, Kumar, Sanjay K, Dey, Megha, Pan,
	Jacob jun, Liu, Yi L, axboe, akpm, tglx, mingo, Yu, Fenghua, hpa

>> When using dedicated queues the caller must keep count of how many
>> operations are in flight and not send more than the depth of the
>> queue.
>
> This way?

That's the only practical way. The device does keep a count of dropped
attempts ... so in theory you could go read that ... but that would give up
much of the value proposition of low cost to submit work if you had to do
an MMIO read after every submission.

-Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 23:19     ` Luck, Tony
  2019-11-20 23:26       ` Borislav Petkov
@ 2019-11-21  0:21       ` Dan Williams
  2019-11-21  0:22       ` Thomas Gleixner
  2 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2019-11-21  0:21 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Dave Jiang, dmaengine,
	Linux Kernel Mailing List, Vinod Koul, Jing Lin, Raj, Ashok,
	Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun, Liu, Yi L,
	Jens Axboe, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Fenghua Yu, H. Peter Anvin

On Wed, Nov 20, 2019 at 3:19 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> On Wed, Nov 20, 2019 at 10:53:39PM +0100, Borislav Petkov wrote:
> > On Wed, Nov 20, 2019 at 02:23:49PM -0700, Dave Jiang wrote:
> > > +static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
> > > +                               size_t count)
> >
> > An iosubmit function which returns void and doesn't tell its callers
> > whether it succeeded or not? That looks non-optimal to say the least.
>
> That's the underlying functionality of the MOVDIR64B instruction. A
> posted write so no way to know if it succeeded. When using dedicated
> queues the caller must keep count of how many operations are in flight
> and not send more than the depth of the queue.
>
> > Why isn't there a fallback function which to call when the CPU doesn't
> > support movdir64b?
>
> This particular driver has no option for fallback. Descriptors can
> only be submitted with MOVDIR64B (to dedicated queues ... in later
> patch series support for shared queues will be added, but those require
> ENQCMD or ENQCMDS to submit).

I think
>
> The driver bails out at the beginning of the probe routine if the
> necessary instructions are not supported:
>
> +       /*
> +        * If the CPU does not support write512, there's no point in
> +        * enumerating the device. We can not utilize it.
> +        */
> +       if (!cpu_has_write512())
> +               return -ENXIO;
>
> Though we should always get past that as this PCI device ID shouldn't
> every appear on a system that doesn't have the support. Device is on
> the die, not a plug-in card.
>
> > Because then you can use alternative_call() and have the thing work
> > regardless of hardware support for MOVDIR*.
> >
> > > +{
> > > +   const u8 *from = src;
> > > +   const u8 *end = from + count * 64;
> > > +
> > > +   if (!cpu_has_write512())
> >
> > If anything, that thing needs to go and you should use
> >
> >   static_cpu_has(X86_FEATURE_MOVDIR64B)
> >
> > as it looks to me like you would care about speed on this fast path?
> > Yes, no?
>
> That might be a better.

It's meant to be identical.

The expectation was that cpu_has_write512() could be used generically
in drivers like the pmem driver that have a use for movdir64b outside
of DSA command submission use case. On x86 it would just be #define'd
to static_cpu_has(X86_FEATURE_MOVDIR64B), on other archs something
else in the future. For pmem if cpu_has_write512() is false it falls
back to talking to platform firmware for error management. Case1 from
the changelog.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 23:19     ` Luck, Tony
  2019-11-20 23:26       ` Borislav Petkov
  2019-11-21  0:21       ` Dan Williams
@ 2019-11-21  0:22       ` Thomas Gleixner
  2019-11-21  0:27         ` Dave Jiang
  2 siblings, 1 reply; 35+ messages in thread
From: Thomas Gleixner @ 2019-11-21  0:22 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Dave Jiang, dmaengine, linux-kernel, vkoul,
	dan.j.williams, jing.lin, ashok.raj, sanjay.k.kumar, megha.dey,
	jacob.jun.pan, yi.l.liu, axboe, akpm, mingo, fenghua.yu, hpa

On Wed, 20 Nov 2019, Luck, Tony wrote:
> On Wed, Nov 20, 2019 at 10:53:39PM +0100, Borislav Petkov wrote:
> > On Wed, Nov 20, 2019 at 02:23:49PM -0700, Dave Jiang wrote:
> > > +static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
> > > +				    size_t count)
> > 
> > An iosubmit function which returns void and doesn't tell its callers
> > whether it succeeded or not? That looks non-optimal to say the least.
> 
> That's the underlying functionality of the MOVDIR64B instruction. A
> posted write so no way to know if it succeeded. When using dedicated
> queues the caller must keep count of how many operations are in flight
> and not send more than the depth of the queue.
> 
> > Why isn't there a fallback function which to call when the CPU doesn't
> > support movdir64b?
> 
> This particular driver has no option for fallback. Descriptors can
> only be submitted with MOVDIR64B (to dedicated queues ... in later
> patch series support for shared queues will be added, but those require
> ENQCMD or ENQCMDS to submit).
> 
> The driver bails out at the beginning of the probe routine if the
> necessary instructions are not supported:
> 
> +       /*
> +        * If the CPU does not support write512, there's no point in
> +        * enumerating the device. We can not utilize it.
> +        */
> +       if (!cpu_has_write512())
> +               return -ENXIO;
> 
> Though we should always get past that as this PCI device ID shouldn't
> every appear on a system that doesn't have the support. Device is on
> the die, not a plug-in card.

Then the condition in the iosubmit function is just prone to silently paper
over any bug in a driver:

> +       if (!cpu_has_write512())
> +               return;

This should at least issue a WARN_ON_ONCE()

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-21  0:22       ` Thomas Gleixner
@ 2019-11-21  0:27         ` Dave Jiang
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Jiang @ 2019-11-21  0:27 UTC (permalink / raw)
  To: Thomas Gleixner, Luck, Tony
  Cc: Borislav Petkov, dmaengine, linux-kernel, vkoul, Williams, Dan J,
	Lin, Jing, Raj, Ashok, Kumar, Sanjay K, Dey, Megha, Pan,
	Jacob jun, Liu, Yi L, axboe, akpm, mingo, Yu, Fenghua, hpa



On 11/20/19 5:22 PM, Thomas Gleixner wrote:
> On Wed, 20 Nov 2019, Luck, Tony wrote:
>> On Wed, Nov 20, 2019 at 10:53:39PM +0100, Borislav Petkov wrote:
>>> On Wed, Nov 20, 2019 at 02:23:49PM -0700, Dave Jiang wrote:
>>>> +static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
>>>> +				    size_t count)
>>>
>>> An iosubmit function which returns void and doesn't tell its callers
>>> whether it succeeded or not? That looks non-optimal to say the least.
>>
>> That's the underlying functionality of the MOVDIR64B instruction. A
>> posted write so no way to know if it succeeded. When using dedicated
>> queues the caller must keep count of how many operations are in flight
>> and not send more than the depth of the queue.
>>
>>> Why isn't there a fallback function which to call when the CPU doesn't
>>> support movdir64b?
>>
>> This particular driver has no option for fallback. Descriptors can
>> only be submitted with MOVDIR64B (to dedicated queues ... in later
>> patch series support for shared queues will be added, but those require
>> ENQCMD or ENQCMDS to submit).
>>
>> The driver bails out at the beginning of the probe routine if the
>> necessary instructions are not supported:
>>
>> +       /*
>> +        * If the CPU does not support write512, there's no point in
>> +        * enumerating the device. We can not utilize it.
>> +        */
>> +       if (!cpu_has_write512())
>> +               return -ENXIO;
>>
>> Though we should always get past that as this PCI device ID shouldn't
>> every appear on a system that doesn't have the support. Device is on
>> the die, not a plug-in card.
> 
> Then the condition in the iosubmit function is just prone to silently paper
> over any bug in a driver:
> 
>> +       if (!cpu_has_write512())
>> +               return;
> 
> This should at least issue a WARN_ON_ONCE()

Thanks! I'll be adding the WARN_ON_ONCE() for the cap check. Also with 
the alignment check Borislav mentioned, I'll add a WARN_ON() for failures.


> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-20 23:26       ` Borislav Petkov
  2019-11-21  0:15         ` Luck, Tony
@ 2019-11-21  0:27         ` Dan Williams
  2019-11-21  0:53           ` Thomas Gleixner
  1 sibling, 1 reply; 35+ messages in thread
From: Dan Williams @ 2019-11-21  0:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Dave Jiang, dmaengine, Linux Kernel Mailing List,
	Vinod Koul, Jing Lin, Raj, Ashok, Kumar, Sanjay K, Dey, Megha,
	Pan, Jacob jun, Liu, Yi L, Jens Axboe, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Fenghua Yu, H. Peter Anvin

On Wed, Nov 20, 2019 at 3:27 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Nov 20, 2019 at 03:19:23PM -0800, Luck, Tony wrote:
> > That's the underlying functionality of the MOVDIR64B instruction. A
> > posted write so no way to know if it succeeded.
>
> So how do you know whether any of the writes went through?

It's identical to the writel() mmio-write to start a SATA command
transfer. The higher level device driver protocol validates that the
command went through, ultimately with a timeout. There's no return
value for iosubmit_cmds512() for the same reason there's no return
value for the other iowrite primitives.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-21  0:27         ` Dan Williams
@ 2019-11-21  0:53           ` Thomas Gleixner
  2019-11-21  1:32             ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Thomas Gleixner @ 2019-11-21  0:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Borislav Petkov, Luck, Tony, Dave Jiang, dmaengine,
	Linux Kernel Mailing List, Vinod Koul, Jing Lin, Raj, Ashok,
	Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun, Liu, Yi L,
	Jens Axboe, Andrew Morton, Ingo Molnar, Fenghua Yu,
	H. Peter Anvin

On Wed, 20 Nov 2019, Dan Williams wrote:
> On Wed, Nov 20, 2019 at 3:27 PM Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Wed, Nov 20, 2019 at 03:19:23PM -0800, Luck, Tony wrote:
> > > That's the underlying functionality of the MOVDIR64B instruction. A
> > > posted write so no way to know if it succeeded.
> >
> > So how do you know whether any of the writes went through?
> 
> It's identical to the writel() mmio-write to start a SATA command
> transfer. The higher level device driver protocol validates that the
> command went through, ultimately with a timeout. There's no return
> value for iosubmit_cmds512() for the same reason there's no return
> value for the other iowrite primitives.

With the difference that other iowrite primitive have no dependencies on
cpu feature bits and cannot fail on the software level.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-21  0:53           ` Thomas Gleixner
@ 2019-11-21  1:32             ` Dan Williams
  2019-11-21 10:37               ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2019-11-21  1:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Luck, Tony, Dave Jiang, dmaengine,
	Linux Kernel Mailing List, Vinod Koul, Jing Lin, Raj, Ashok,
	Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun, Liu, Yi L,
	Jens Axboe, Andrew Morton, Ingo Molnar, Fenghua Yu,
	H. Peter Anvin

On Wed, Nov 20, 2019 at 4:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, 20 Nov 2019, Dan Williams wrote:
> > On Wed, Nov 20, 2019 at 3:27 PM Borislav Petkov <bp@alien8.de> wrote:
> > >
> > > On Wed, Nov 20, 2019 at 03:19:23PM -0800, Luck, Tony wrote:
> > > > That's the underlying functionality of the MOVDIR64B instruction. A
> > > > posted write so no way to know if it succeeded.
> > >
> > > So how do you know whether any of the writes went through?
> >
> > It's identical to the writel() mmio-write to start a SATA command
> > transfer. The higher level device driver protocol validates that the
> > command went through, ultimately with a timeout. There's no return
> > value for iosubmit_cmds512() for the same reason there's no return
> > value for the other iowrite primitives.
>
> With the difference that other iowrite primitive have no dependencies on
> cpu feature bits and cannot fail on the software level.

True, but that would be a driver coding mistake flagged by the
WARN_ON_ONCE, and the failure is static. The driver must check for
static_cpu_has(X86_FEATURE_MOVDIR64B) once at init, but it need not
check again on every command submission.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-21  1:32             ` Dan Williams
@ 2019-11-21 10:37               ` Borislav Petkov
  0 siblings, 0 replies; 35+ messages in thread
From: Borislav Petkov @ 2019-11-21 10:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Luck, Tony, Dave Jiang, dmaengine,
	Linux Kernel Mailing List, Vinod Koul, Jing Lin, Raj, Ashok,
	Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun, Liu, Yi L,
	Jens Axboe, Andrew Morton, Ingo Molnar, Fenghua Yu,
	H. Peter Anvin

On Wed, Nov 20, 2019 at 05:32:51PM -0800, Dan Williams wrote:
> True, but that would be a driver coding mistake flagged by the
> WARN_ON_ONCE, and the failure is static. The driver must check for
> static_cpu_has(X86_FEATURE_MOVDIR64B) once at init,

So if you do that at driver init time, you don't need the static variant
- simply use boot_cpu_has(). But if this function is going to be used on
other platforms, then you need the check in the function and it must be
static_cpu_has() for speed.

The static_cpu_has() thing is a soft-of once check anyway because it
gets patched by alternatives and after that it is 0 overhead.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-21  0:10     ` Dave Jiang
@ 2019-11-21 10:59       ` Borislav Petkov
  2019-11-21 16:52         ` Dave Jiang
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2019-11-21 10:59 UTC (permalink / raw)
  To: Dave Jiang
  Cc: dmaengine, linux-kernel, vkoul, Williams, Dan J, Luck, Tony, Lin,
	Jing, Raj, Ashok, Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun,
	Liu, Yi L, axboe, akpm, tglx, mingo, Yu, Fenghua, hpa

On Wed, Nov 20, 2019 at 05:10:41PM -0700, Dave Jiang wrote:
> I'll add the check on the destination address. The call is modeled after
> __iowrite64_copy() / __iowrite32_copy() in lib/iomap_copy.c. Looks like
> those functions do not check for the alignment requirements either.

So just because they don't check, you don't need to check either?

Can you guarantee that all callers will always do the right thing?

I mean, if you don't care too much, why even write "(must be 512-bit
aligned)"? Who cares then if the data is aligned or not...

> > > + * @dst: destination, in MMIO space (must be 512-bit aligned)
> > > + * @src: source
> > > + * @count: number of 512 bits quantities to submit
> > 
> > Where's that check on the data?
> 
> I don't follow?

What do you do if the caller doesn't submit data in 512 bits quantities?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-21 10:59       ` Borislav Petkov
@ 2019-11-21 16:52         ` Dave Jiang
  2019-11-22  8:59           ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Jiang @ 2019-11-21 16:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: dmaengine, linux-kernel, vkoul, Williams, Dan J, Luck, Tony, Lin,
	Jing, Raj, Ashok, Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun,
	Liu, Yi L, axboe, akpm, tglx, mingo, Yu, Fenghua, hpa



On 11/21/19 3:59 AM, Borislav Petkov wrote:
> On Wed, Nov 20, 2019 at 05:10:41PM -0700, Dave Jiang wrote:
>> I'll add the check on the destination address. The call is modeled after
>> __iowrite64_copy() / __iowrite32_copy() in lib/iomap_copy.c. Looks like
>> those functions do not check for the alignment requirements either.
> 
> So just because they don't check, you don't need to check either?

No what I mean was those primitives are missing the checks and we should 
probably address that at some point.

> 
> Can you guarantee that all callers will always do the right thing?
> 
> I mean, if you don't care too much, why even write "(must be 512-bit
> aligned)"? Who cares then if the data is aligned or not...
> 


>>>> + * @dst: destination, in MMIO space (must be 512-bit aligned)
>>>> + * @src: source
>>>> + * @count: number of 512 bits quantities to submit
>>>
>>> Where's that check on the data?
>>
>> I don't follow?
> 
> What do you do if the caller doesn't submit data in 512 bits quantities?
> 

How would I detect that? Add a size (in bytes) parameter for the total 
source data?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-21 16:52         ` Dave Jiang
@ 2019-11-22  8:59           ` Borislav Petkov
  2019-11-22 17:20             ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2019-11-22  8:59 UTC (permalink / raw)
  To: Dave Jiang
  Cc: dmaengine, linux-kernel, vkoul, Williams, Dan J, Luck, Tony, Lin,
	Jing, Raj, Ashok, Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun,
	Liu, Yi L, axboe, akpm, tglx, mingo, Yu, Fenghua, hpa

On Thu, Nov 21, 2019 at 09:52:19AM -0700, Dave Jiang wrote:
> No what I mean was those primitives are missing the checks and we should
> probably address that at some point.

Oh, patches are always welcome! :)

> How would I detect that? Add a size (in bytes) parameter for the total
> source data?

Sure.

So, here's the deal: the more I look at this thing, the more I think
this iosubmit_cmds512() function should not be in a generic header but
in an intel-/driver-specific one. Why?

Well, movdir64b is Intel-only for now, you don't have a fallback
option for the platforms which do not support that insn and it is more
preferential for you to do the feature check once at driver init and
then call the function because you *know* you have movdir64b support
and not have any feature check in the function itself, not even a fast
static_cpu_has() one.

And this way you can do away with alignment and size checks because you
control what your driver does.

If it turns out that this function needs to be shared with other
platforms, then we can consider lifting it into a generic header and
making it more generic.

Ok?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-22  8:59           ` Borislav Petkov
@ 2019-11-22 17:20             ` Dan Williams
  2019-11-22 18:44               ` Borislav Petkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2019-11-22 17:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Jiang, dmaengine, linux-kernel, vkoul, Luck, Tony, Lin,
	Jing, Raj, Ashok, Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun,
	Liu, Yi L, axboe, akpm, tglx, mingo, Yu, Fenghua, hpa

On Fri, Nov 22, 2019 at 1:00 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Nov 21, 2019 at 09:52:19AM -0700, Dave Jiang wrote:
> > No what I mean was those primitives are missing the checks and we should
> > probably address that at some point.
>
> Oh, patches are always welcome! :)
>
> > How would I detect that? Add a size (in bytes) parameter for the total
> > source data?
>
> Sure.
>
> So, here's the deal: the more I look at this thing, the more I think
> this iosubmit_cmds512() function should not be in a generic header but
> in an intel-/driver-specific one. Why?
>
> Well, movdir64b is Intel-only for now, you don't have a fallback
> option for the platforms which do not support that insn and it is more
> preferential for you to do the feature check once at driver init and
> then call the function because you *know* you have movdir64b support
> and not have any feature check in the function itself, not even a fast
> static_cpu_has() one.
>
> And this way you can do away with alignment and size checks because you
> control what your driver does.
>
> If it turns out that this function needs to be shared with other
> platforms, then we can consider lifting it into a generic header and
> making it more generic.
>
> Ok?

I do agree that iosubmit_cmds512() can live in a driver specific
header, and it was my fault for advising Dave to make it generic. The
long story of how that came to pass below, but the short story is yes,
lets just make this one driver specific.

The long story is that there is already line of sight for a need for
other generic movdir64b() helpers as mentioned in the changelog, and
iosubmit_cmds512() got wrapped up in that momentum.

For those cases the thought would be to have memset512() for case1 and
__iowrite512_copy() for case3. Where memset512() writes a
non-incrementing source to an incrementing destination, and
__iowrite512_copy() copies an incrementing source to an incrementing
destination. Those 2 helpers *would* have fallbacks, but with the
option to use something like cpu_has_write512() to check in advance
whether those routines will fallback, or not.

That can be a discussion for a future patchset when those users arrive.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-22 17:20             ` Dan Williams
@ 2019-11-22 18:44               ` Borislav Petkov
  2019-11-22 18:50                 ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2019-11-22 18:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Jiang, dmaengine, linux-kernel, vkoul, Luck, Tony, Lin,
	Jing, Raj, Ashok, Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun,
	Liu, Yi L, axboe, akpm, tglx, mingo, Yu, Fenghua, hpa

On Fri, Nov 22, 2019 at 09:20:39AM -0800, Dan Williams wrote:
> For those cases the thought would be to have memset512() for case1 and
> __iowrite512_copy() for case3. Where memset512() writes a
> non-incrementing source to an incrementing destination, and
> __iowrite512_copy() copies an incrementing source to an incrementing
> destination. Those 2 helpers *would* have fallbacks, but with the
> option to use something like cpu_has_write512() to check in advance
> whether those routines will fallback, or not.
> 
> That can be a discussion for a future patchset when those users arrive.

Oh, sure, of course.

My only angle is very simple: if the MOVDIR* et al is only supported on
upcoming Intel platforms and looking at the use cases:

1. clear poison/MKTME
3. copy iomem in big chunks

I'm going to venture a guess that those two cases are going to be
happening only on Intel platforms which already support MODVIR*. So
wouldn't really need to do any generic helpers because those use cases
are very specific already. Which would make your feature detection a
one-time, driver-init time thing anyway...

Unless I misunderstand those cases and there really is a use case
where the thing would fallback and the fallback would really be for an
"unenlightened" platform without that MOVDIR* hw support...?

Hmmm.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction
  2019-11-22 18:44               ` Borislav Petkov
@ 2019-11-22 18:50                 ` Dan Williams
  0 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2019-11-22 18:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Jiang, dmaengine, linux-kernel, vkoul, Luck, Tony, Lin,
	Jing, Raj, Ashok, Kumar, Sanjay K, Dey, Megha, Pan, Jacob jun,
	Liu, Yi L, axboe, akpm, tglx, mingo, Yu, Fenghua, hpa

On Fri, Nov 22, 2019 at 10:44 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Nov 22, 2019 at 09:20:39AM -0800, Dan Williams wrote:
> > For those cases the thought would be to have memset512() for case1 and
> > __iowrite512_copy() for case3. Where memset512() writes a
> > non-incrementing source to an incrementing destination, and
> > __iowrite512_copy() copies an incrementing source to an incrementing
> > destination. Those 2 helpers *would* have fallbacks, but with the
> > option to use something like cpu_has_write512() to check in advance
> > whether those routines will fallback, or not.
> >
> > That can be a discussion for a future patchset when those users arrive.
>
> Oh, sure, of course.
>
> My only angle is very simple: if the MOVDIR* et al is only supported on
> upcoming Intel platforms and looking at the use cases:
>
> 1. clear poison/MKTME
> 3. copy iomem in big chunks
>
> I'm going to venture a guess that those two cases are going to be
> happening only on Intel platforms which already support MODVIR*. So
> wouldn't really need to do any generic helpers because those use cases
> are very specific already. Which would make your feature detection a
> one-time, driver-init time thing anyway...
>
> Unless I misunderstand those cases and there really is a use case
> where the thing would fallback and the fallback would really be for an
> "unenlightened" platform without that MOVDIR* hw support...?

At least for something like __iowrite512_copy() it would indeed be
something an unenlightened driver could call. Those drivers would
simply be looking for opportunistic efficiency and could tolerate a
fallback. Just like current __iowrite64_copy() users don't care if
that routine falls back internally.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, back to index

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-20 21:23 [PATCH RFC 00/14] idxd driver for Intel Data Streaming Accelerator Dave Jiang
2019-11-20 21:23 ` [PATCH RFC 01/14] x86/asm: add iosubmit_cmds512() based on movdir64b CPU instruction Dave Jiang
2019-11-20 21:50   ` Dave Hansen
2019-11-20 23:46     ` Dave Jiang
2019-11-20 21:53   ` Borislav Petkov
2019-11-20 23:19     ` Luck, Tony
2019-11-20 23:26       ` Borislav Petkov
2019-11-21  0:15         ` Luck, Tony
2019-11-21  0:27         ` Dan Williams
2019-11-21  0:53           ` Thomas Gleixner
2019-11-21  1:32             ` Dan Williams
2019-11-21 10:37               ` Borislav Petkov
2019-11-21  0:21       ` Dan Williams
2019-11-21  0:22       ` Thomas Gleixner
2019-11-21  0:27         ` Dave Jiang
2019-11-21  0:10     ` Dave Jiang
2019-11-21 10:59       ` Borislav Petkov
2019-11-21 16:52         ` Dave Jiang
2019-11-22  8:59           ` Borislav Petkov
2019-11-22 17:20             ` Dan Williams
2019-11-22 18:44               ` Borislav Petkov
2019-11-22 18:50                 ` Dan Williams
2019-11-20 21:23 ` [PATCH RFC 02/14] dmaengine: break out channel registration Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 03/14] dmaengine: add new dma device registration Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 04/14] mm: create common code from request allocation based from blk-mq code Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 05/14] dmaengine: add dma_request support functions Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 06/14] dmaengine: add dma request submit and completion path support Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 07/14] dmaengine: update dmatest to support dma request Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 08/14] dmaengine: idxd: Init and probe for Intel data accelerators Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 09/14] dmaengine: idxd: add configuration component of driver Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 10/14] dmaengine: idxd: add descriptor manipulation routines Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 11/14] dmaengine: idxd: connect idxd to dmaengine subsystem Dave Jiang
2019-11-20 21:24 ` [PATCH RFC 12/14] dmaengine: request submit optimization Dave Jiang
2019-11-20 21:25 ` [PATCH RFC 13/14] dmaengine: idxd: add char driver to expose submission portal to userland Dave Jiang
2019-11-20 21:25 ` [PATCH RFC 14/14] dmaengine: idxd: add sysfs ABI for idxd driver Dave Jiang

dmaengine Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/dmaengine/0 dmaengine/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dmaengine dmaengine/ https://lore.kernel.org/dmaengine \
		dmaengine@vger.kernel.org
	public-inbox-index dmaengine

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.dmaengine


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git