dmaengine.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
@ 2020-04-21 23:33 Dave Jiang
  2020-04-21 23:33 ` [PATCH RFC 01/15] drivers/base: Introduce platform_msi_ops Dave Jiang
                   ` (15 more replies)
  0 siblings, 16 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:33 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

The actual code is independent of the stage 2 driver code submission that adds
support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
support dedicated workqueue on a guest with no vIOMMU.
  
A new device type "mdev" is introduced for the idxd driver. This allows the wq
to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
queue (wq) is enabled, an uuid generated by the user can be added to the wq
through the uuid sysfs attribute for the wq.  After the association, a mdev can
be created using this UUID. The mdev driver code will associate the uuid and
setup the mdev on the driver side. When the create operation is successful, the
uuid can be passed to qemu. When the guest boots up, it should discover a DSA
device when doing PCI discovery.

For example:
1. Enable wq with “mdev” wq type
2. A user generated UUID is associated with a wq:
echo $UUID > /sys/bus/dsa/devices/wq0.0/uuid
3. The uuid is written to the mdev class sysfs path:
echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-wq/create
4. Pass the following parameter to qemu:
"-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
 
Since the mdev is an emulated device with a single wq, the guest will see a DSA
device with a single wq. With no vIOMMU support, the behavior will be the same
as the stage 1 driver running with no IOMMU turned on on the bare metal host. 
The difference is that the wq exported through mdev will have the read only
config bit set for configuration. This means that the device does not require
the typical configuration. After enabling the device, the user must set the WQ
type and name. That is all is necessary to enable the WQ and start using it.
The single wq configuration is not the only way to create the mdev. Multi wq
support for mdev will be in the future works.
 
The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
interrupts for the guest. This preserves MSIX for host usages and also allows a
significantly larger number of interrupt vectors for guest usage.

The idxd driver implements IMS as on-device memory mapped unified storage. Each
interrupt message is stored as a DWORD size data payload and a 64-bit address
(same as MSI-X). Access to the IMS is through the host idxd driver. All the IMS
interrupt messages are stored in the remappable format. Hence, if the driver
enables IMS, interrupt remapping is also enabled by default. 
 
This patchset extends the existing platfrom-msi.c which already provides a
generic mechanism to support non-PCI compliant MSI interrupts for platform
devices to provide the IMS infrastructure. 

More details about IMS, its implementation in the the kernel, common
misconceptions about IMS and the basic driver changes required to support IMS
can be found under Documentations/interrupt_message_store.txt

[1]: https://lore.kernel.org/lkml/157965011794.73301.15960052071729101309.stgit@djiang5-desk3.ch.intel.com/
[2]: https://software.intel.com/en-us/articles/intel-sdm
[3]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[4]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
[5]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
[6]: https://intel.github.io/idxd/
[7]: https://github.com/intel/idxd-driver idxd-stage3

---

Dave Jiang (5):
      dmaengine: idxd: add config support for readonly devices
      dmaengine: idxd: add IMS support in base driver
      dmaengine: idxd: add device support functions in prep for mdev
      dmaengine: idxd: add support for VFIO mediated device
      dmaengine: idxd: add error notification from host driver to mediated device

Jing Lin (1):
      dmaengine: idxd: add ABI documentation for mediated device support

Lu Baolu (2):
      vfio/mdev: Add a member for iommu domain in mdev_device
      vfio/type1: Save domain when attach domain to mdev

Megha Dey (7):
      drivers/base: Introduce platform_msi_ops
      drivers/base: Introduce a new platform-msi list
      drivers/base: Allocate/free platform-msi interrupts by group
      drivers/base: Add support for a new IMS irq domain
      ims-msi: Add mask/unmask routines
      ims-msi: Enable IMS interrupts
      Documentation: Interrupt Message store


 Documentation/ABI/stable/sysfs-driver-dma-idxd |   18 
 Documentation/ims-howto.rst                    |  210 +++
 arch/x86/include/asm/hw_irq.h                  |    7 
 arch/x86/include/asm/irq_remapping.h           |    6 
 drivers/base/Kconfig                           |    9 
 drivers/base/Makefile                          |    1 
 drivers/base/core.c                            |    1 
 drivers/base/ims-msi.c                         |  162 ++
 drivers/base/platform-msi.c                    |  202 ++-
 drivers/dma/Kconfig                            |    4 
 drivers/dma/idxd/Makefile                      |    2 
 drivers/dma/idxd/cdev.c                        |    3 
 drivers/dma/idxd/device.c                      |  325 ++++-
 drivers/dma/idxd/dma.c                         |    9 
 drivers/dma/idxd/idxd.h                        |   55 +
 drivers/dma/idxd/init.c                        |   81 +
 drivers/dma/idxd/irq.c                         |    6 
 drivers/dma/idxd/mdev.c                        | 1727 ++++++++++++++++++++++++
 drivers/dma/idxd/mdev.h                        |  105 +
 drivers/dma/idxd/registers.h                   |   10 
 drivers/dma/idxd/submit.c                      |   31 
 drivers/dma/idxd/sysfs.c                       |  199 ++-
 drivers/dma/idxd/vdev.c                        |  603 ++++++++
 drivers/dma/idxd/vdev.h                        |   43 +
 drivers/dma/mv_xor_v2.c                        |    6 
 drivers/dma/qcom/hidma.c                       |    6 
 drivers/iommu/arm-smmu-v3.c                    |    6 
 drivers/iommu/intel-iommu.c                    |    2 
 drivers/iommu/intel_irq_remapping.c            |   31 
 drivers/irqchip/irq-mbigen.c                   |    8 
 drivers/irqchip/irq-mvebu-icu.c                |    6 
 drivers/mailbox/bcm-flexrm-mailbox.c           |    6 
 drivers/perf/arm_smmuv3_pmu.c                  |    6 
 drivers/vfio/mdev/mdev_core.c                  |   22 
 drivers/vfio/mdev/mdev_private.h               |    2 
 drivers/vfio/vfio_iommu_type1.c                |   52 +
 include/linux/device.h                         |    3 
 include/linux/intel-iommu.h                    |    3 
 include/linux/list.h                           |   36 +
 include/linux/mdev.h                           |   13 
 include/linux/msi.h                            |   93 +
 kernel/irq/msi.c                               |   43 -
 42 files changed, 4009 insertions(+), 154 deletions(-)
 create mode 100644 Documentation/ims-howto.rst
 create mode 100644 drivers/base/ims-msi.c
 create mode 100644 drivers/dma/idxd/mdev.c
 create mode 100644 drivers/dma/idxd/mdev.h
 create mode 100644 drivers/dma/idxd/vdev.c
 create mode 100644 drivers/dma/idxd/vdev.h

--

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH RFC 01/15] drivers/base: Introduce platform_msi_ops
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
@ 2020-04-21 23:33 ` Dave Jiang
  2020-04-26  7:01   ` Greg KH
  2020-04-21 23:33 ` [PATCH RFC 02/15] drivers/base: Introduce a new platform-msi list Dave Jiang
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:33 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Megha Dey <megha.dey@linux.intel.com>

This is a preparatory patch to introduce Interrupt Message Store (IMS).

Until now, platform-msi.c provided a generic way to handle non-PCI MSI
interrupts. Platform-msi uses its parent chip's mask/unmask routines
and only provides a way to write the message in the generating device.

Newly creeping non-PCI complaint MSI-like interrupts (Intel's IMS for
instance) might need to provide a device specific mask and unmask callback
as well, apart from the write function.

Hence, introduce a new structure platform_msi_ops, which would provide
device specific write function as well as other device specific callbacks
(mask/unmask).

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 drivers/base/platform-msi.c          |   27 ++++++++++++++-------------
 drivers/dma/mv_xor_v2.c              |    6 +++++-
 drivers/dma/qcom/hidma.c             |    6 +++++-
 drivers/iommu/arm-smmu-v3.c          |    6 +++++-
 drivers/irqchip/irq-mbigen.c         |    8 ++++++--
 drivers/irqchip/irq-mvebu-icu.c      |    6 +++++-
 drivers/mailbox/bcm-flexrm-mailbox.c |    6 +++++-
 drivers/perf/arm_smmuv3_pmu.c        |    6 +++++-
 include/linux/msi.h                  |   24 ++++++++++++++++++------
 9 files changed, 68 insertions(+), 27 deletions(-)

diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index 8da314b81eab..1a3af5f33802 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -21,11 +21,11 @@
  * and the callback to write the MSI message.
  */
 struct platform_msi_priv_data {
-	struct device		*dev;
-	void 			*host_data;
-	msi_alloc_info_t	arg;
-	irq_write_msi_msg_t	write_msg;
-	int			devid;
+	struct device			*dev;
+	void				*host_data;
+	msi_alloc_info_t		arg;
+	const struct platform_msi_ops	*ops;
+	int				devid;
 };
 
 /* The devid allocator */
@@ -83,7 +83,7 @@ static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
 
 	priv_data = desc->platform.msi_priv_data;
 
-	priv_data->write_msg(desc, msg);
+	priv_data->ops->write_msg(desc, msg);
 }
 
 static void platform_msi_update_chip_ops(struct msi_domain_info *info)
@@ -194,7 +194,7 @@ struct irq_domain *platform_msi_create_irq_domain(struct fwnode_handle *fwnode,
 
 static struct platform_msi_priv_data *
 platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
-			     irq_write_msi_msg_t write_msi_msg)
+			     const struct platform_msi_ops *platform_ops)
 {
 	struct platform_msi_priv_data *datap;
 	/*
@@ -203,7 +203,8 @@ platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
 	 * accordingly (which would impact the max number of MSI
 	 * capable devices).
 	 */
-	if (!dev->msi_domain || !write_msi_msg || !nvec || nvec > MAX_DEV_MSIS)
+	if (!dev->msi_domain || !platform_ops->write_msg || !nvec ||
+	    nvec > MAX_DEV_MSIS)
 		return ERR_PTR(-EINVAL);
 
 	if (dev->msi_domain->bus_token != DOMAIN_BUS_PLATFORM_MSI) {
@@ -227,7 +228,7 @@ platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
 		return ERR_PTR(err);
 	}
 
-	datap->write_msg = write_msi_msg;
+	datap->ops = platform_ops;
 	datap->dev = dev;
 
 	return datap;
@@ -249,12 +250,12 @@ static void platform_msi_free_priv_data(struct platform_msi_priv_data *data)
  * Zero for success, or an error code in case of failure
  */
 int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
-				   irq_write_msi_msg_t write_msi_msg)
+				   const struct platform_msi_ops *platform_ops)
 {
 	struct platform_msi_priv_data *priv_data;
 	int err;
 
-	priv_data = platform_msi_alloc_priv_data(dev, nvec, write_msi_msg);
+	priv_data = platform_msi_alloc_priv_data(dev, nvec, platform_ops);
 	if (IS_ERR(priv_data))
 		return PTR_ERR(priv_data);
 
@@ -324,7 +325,7 @@ struct irq_domain *
 __platform_msi_create_device_domain(struct device *dev,
 				    unsigned int nvec,
 				    bool is_tree,
-				    irq_write_msi_msg_t write_msi_msg,
+				    const struct platform_msi_ops *platform_ops,
 				    const struct irq_domain_ops *ops,
 				    void *host_data)
 {
@@ -332,7 +333,7 @@ __platform_msi_create_device_domain(struct device *dev,
 	struct irq_domain *domain;
 	int err;
 
-	data = platform_msi_alloc_priv_data(dev, nvec, write_msi_msg);
+	data = platform_msi_alloc_priv_data(dev, nvec, platform_ops);
 	if (IS_ERR(data))
 		return NULL;
 
diff --git a/drivers/dma/mv_xor_v2.c b/drivers/dma/mv_xor_v2.c
index 157c959311ea..426f520f3765 100644
--- a/drivers/dma/mv_xor_v2.c
+++ b/drivers/dma/mv_xor_v2.c
@@ -706,6 +706,10 @@ static int mv_xor_v2_resume(struct platform_device *dev)
 	return 0;
 }
 
+static const struct platform_msi_ops mv_xor_v2_msi_ops = {
+	.write_msg	= mv_xor_v2_set_msi_msg,
+};
+
 static int mv_xor_v2_probe(struct platform_device *pdev)
 {
 	struct mv_xor_v2_device *xor_dev;
@@ -761,7 +765,7 @@ static int mv_xor_v2_probe(struct platform_device *pdev)
 	}
 
 	ret = platform_msi_domain_alloc_irqs(&pdev->dev, 1,
-					     mv_xor_v2_set_msi_msg);
+					     &mv_xor_v2_msi_ops);
 	if (ret)
 		goto disable_clk;
 
diff --git a/drivers/dma/qcom/hidma.c b/drivers/dma/qcom/hidma.c
index 411f91fde734..65371535ba26 100644
--- a/drivers/dma/qcom/hidma.c
+++ b/drivers/dma/qcom/hidma.c
@@ -678,6 +678,10 @@ static void hidma_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
 		writel(msg->data, dmadev->dev_evca + 0x120);
 	}
 }
+
+static const struct platform_msi_ops hidma_msi_ops = {
+	.write_msg	= hidma_write_msi_msg,
+};
 #endif
 
 static void hidma_free_msis(struct hidma_dev *dmadev)
@@ -703,7 +707,7 @@ static int hidma_request_msi(struct hidma_dev *dmadev,
 	struct msi_desc *failed_desc = NULL;
 
 	rc = platform_msi_domain_alloc_irqs(&pdev->dev, HIDMA_MSI_INTS,
-					    hidma_write_msi_msg);
+					    &hidma_msi_ops);
 	if (rc)
 		return rc;
 
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 82508730feb7..764e284202f1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3425,6 +3425,10 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
 	writel_relaxed(ARM_SMMU_MEMATTR_DEVICE_nGnRE, smmu->base + cfg[2]);
 }
 
+static const struct platform_msi_ops arm_smmu_msi_ops = {
+	.write_msg	= arm_smmu_write_msi_msg,
+};
+
 static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
 {
 	struct msi_desc *desc;
@@ -3449,7 +3453,7 @@ static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
 	}
 
 	/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
-	ret = platform_msi_domain_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
+	ret = platform_msi_domain_alloc_irqs(dev, nvec, &arm_smmu_msi_ops);
 	if (ret) {
 		dev_warn(dev, "failed to allocate MSIs - falling back to wired irqs\n");
 		return;
diff --git a/drivers/irqchip/irq-mbigen.c b/drivers/irqchip/irq-mbigen.c
index 6b566bba263b..ff5b75751974 100644
--- a/drivers/irqchip/irq-mbigen.c
+++ b/drivers/irqchip/irq-mbigen.c
@@ -226,6 +226,10 @@ static const struct irq_domain_ops mbigen_domain_ops = {
 	.free		= irq_domain_free_irqs_common,
 };
 
+static const struct platform_msi_ops mbigen_msi_ops = {
+	.write_msg	= mbigen_write_msg,
+};
+
 static int mbigen_of_create_domain(struct platform_device *pdev,
 				   struct mbigen_device *mgn_chip)
 {
@@ -254,7 +258,7 @@ static int mbigen_of_create_domain(struct platform_device *pdev,
 		}
 
 		domain = platform_msi_create_device_domain(&child->dev, num_pins,
-							   mbigen_write_msg,
+							   &mbigen_msi_ops,
 							   &mbigen_domain_ops,
 							   mgn_chip);
 		if (!domain) {
@@ -302,7 +306,7 @@ static int mbigen_acpi_create_domain(struct platform_device *pdev,
 		return -EINVAL;
 
 	domain = platform_msi_create_device_domain(&pdev->dev, num_pins,
-						   mbigen_write_msg,
+						   &mbigen_msi_ops,
 						   &mbigen_domain_ops,
 						   mgn_chip);
 	if (!domain)
diff --git a/drivers/irqchip/irq-mvebu-icu.c b/drivers/irqchip/irq-mvebu-icu.c
index 547045d89c4b..49b6390470bb 100644
--- a/drivers/irqchip/irq-mvebu-icu.c
+++ b/drivers/irqchip/irq-mvebu-icu.c
@@ -295,6 +295,10 @@ static const struct of_device_id mvebu_icu_subset_of_match[] = {
 	{},
 };
 
+static const struct platform_msi_ops mvebu_icu_msi_ops = {
+	.write_msg	= mvebu_icu_write_msg,
+};
+
 static int mvebu_icu_subset_probe(struct platform_device *pdev)
 {
 	struct mvebu_icu_msi_data *msi_data;
@@ -324,7 +328,7 @@ static int mvebu_icu_subset_probe(struct platform_device *pdev)
 		return -ENODEV;
 
 	irq_domain = platform_msi_create_device_tree_domain(dev, ICU_MAX_IRQS,
-							    mvebu_icu_write_msg,
+							    &mvebu_icu_msi_ops,
 							    &mvebu_icu_domain_ops,
 							    msi_data);
 	if (!irq_domain) {
diff --git a/drivers/mailbox/bcm-flexrm-mailbox.c b/drivers/mailbox/bcm-flexrm-mailbox.c
index bee33abb5308..0268337e08e3 100644
--- a/drivers/mailbox/bcm-flexrm-mailbox.c
+++ b/drivers/mailbox/bcm-flexrm-mailbox.c
@@ -1492,6 +1492,10 @@ static void flexrm_mbox_msi_write(struct msi_desc *desc, struct msi_msg *msg)
 	writel_relaxed(msg->data, ring->regs + RING_MSI_DATA_VALUE);
 }
 
+static const struct platform_msi_ops flexrm_mbox_msi_ops = {
+	.write_msg	= flexrm_mbox_msi_write,
+};
+
 static int flexrm_mbox_probe(struct platform_device *pdev)
 {
 	int index, ret = 0;
@@ -1604,7 +1608,7 @@ static int flexrm_mbox_probe(struct platform_device *pdev)
 
 	/* Allocate platform MSIs for each ring */
 	ret = platform_msi_domain_alloc_irqs(dev, mbox->num_rings,
-						flexrm_mbox_msi_write);
+						&flexrm_mbox_msi_ops);
 	if (ret)
 		goto fail_destroy_cmpl_pool;
 
diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c
index f01a57e5a5f3..bcbd7f5e3d0f 100644
--- a/drivers/perf/arm_smmuv3_pmu.c
+++ b/drivers/perf/arm_smmuv3_pmu.c
@@ -652,6 +652,10 @@ static void smmu_pmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
 		       pmu->reg_base + SMMU_PMCG_IRQ_CFG2);
 }
 
+static const struct platform_msi_ops smmu_pmu_msi_ops = {
+	.write_msg	= smmu_pmu_write_msi_msg,
+};
+
 static void smmu_pmu_setup_msi(struct smmu_pmu *pmu)
 {
 	struct msi_desc *desc;
@@ -665,7 +669,7 @@ static void smmu_pmu_setup_msi(struct smmu_pmu *pmu)
 	if (!(readl(pmu->reg_base + SMMU_PMCG_CFGR) & SMMU_PMCG_CFGR_MSI))
 		return;
 
-	ret = platform_msi_domain_alloc_irqs(dev, 1, smmu_pmu_write_msi_msg);
+	ret = platform_msi_domain_alloc_irqs(dev, 1, &smmu_pmu_msi_ops);
 	if (ret) {
 		dev_warn(dev, "failed to allocate MSIs\n");
 		return;
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8ad679e9d9c0..8e08907d70cb 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -321,6 +321,18 @@ enum {
 	MSI_FLAG_LEVEL_CAPABLE		= (1 << 6),
 };
 
+/*
+ * platform_msi_ops - Callbacks for platform MSI ops
+ * @irq_mask:   mask an interrupt source
+ * @irq_unmask: unmask an interrupt source
+ * @irq_write_msi_msg: write message content
+ */
+struct platform_msi_ops {
+	unsigned int		(*irq_mask)(struct msi_desc *desc);
+	unsigned int		(*irq_unmask)(struct msi_desc *desc);
+	irq_write_msi_msg_t	write_msg;
+};
+
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
 			    bool force);
 
@@ -336,7 +348,7 @@ struct irq_domain *platform_msi_create_irq_domain(struct fwnode_handle *fwnode,
 						  struct msi_domain_info *info,
 						  struct irq_domain *parent);
 int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
-				   irq_write_msi_msg_t write_msi_msg);
+				   const struct platform_msi_ops *platform_ops);
 void platform_msi_domain_free_irqs(struct device *dev);
 
 /* When an MSI domain is used as an intermediate domain */
@@ -348,14 +360,14 @@ struct irq_domain *
 __platform_msi_create_device_domain(struct device *dev,
 				    unsigned int nvec,
 				    bool is_tree,
-				    irq_write_msi_msg_t write_msi_msg,
+				    const struct platform_msi_ops *platform_ops,
 				    const struct irq_domain_ops *ops,
 				    void *host_data);
 
-#define platform_msi_create_device_domain(dev, nvec, write, ops, data)	\
-	__platform_msi_create_device_domain(dev, nvec, false, write, ops, data)
-#define platform_msi_create_device_tree_domain(dev, nvec, write, ops, data) \
-	__platform_msi_create_device_domain(dev, nvec, true, write, ops, data)
+#define platform_msi_create_device_domain(dev, nvec, p_ops, ops, data)	\
+	__platform_msi_create_device_domain(dev, nvec, false, p_ops, ops, data)
+#define platform_msi_create_device_tree_domain(dev, nvec, p_ops, ops, data) \
+	__platform_msi_create_device_domain(dev, nvec, true, p_ops, ops, data)
 
 int platform_msi_domain_alloc(struct irq_domain *domain, unsigned int virq,
 			      unsigned int nr_irqs);


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 02/15] drivers/base: Introduce a new platform-msi list
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
  2020-04-21 23:33 ` [PATCH RFC 01/15] drivers/base: Introduce platform_msi_ops Dave Jiang
@ 2020-04-21 23:33 ` Dave Jiang
  2020-04-25 21:13   ` Thomas Gleixner
  2020-04-21 23:34 ` [PATCH RFC 03/15] drivers/base: Allocate/free platform-msi interrupts by group Dave Jiang
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:33 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Megha Dey <megha.dey@linux.intel.com>

This is a preparatory patch to introduce Interrupt Message Store (IMS).

The struct device has a linked list ('msi_list') of the MSI (msi/msi-x,
platform-msi) descriptors of that device. This list holds only 1 type
of descriptor since it is not possible for a device to support more
than one of these descriptors concurrently.

However, with the introduction of IMS, a device can support IMS as well
as MSI-X at the same time. Instead of sharing this list between IMS (a
type of platform-msi) and MSI-X descriptors, introduce a new linked list,
platform_msi_list, which will hold all the platform-msi descriptors.

Thus, msi_list will point to the MSI/MSIX descriptors of a device, while
platform_msi_list will point to the platform-msi descriptors of a device.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 drivers/base/core.c         |    1 +
 drivers/base/platform-msi.c |   19 +++++++++++--------
 include/linux/device.h      |    2 ++
 include/linux/list.h        |   36 ++++++++++++++++++++++++++++++++++++
 include/linux/msi.h         |   21 +++++++++++++++++++++
 kernel/irq/msi.c            |   16 ++++++++--------
 6 files changed, 79 insertions(+), 16 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 139cdf7e7327..5a0116d1a8d0 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1984,6 +1984,7 @@ void device_initialize(struct device *dev)
 	set_dev_node(dev, -1);
 #ifdef CONFIG_GENERIC_MSI_IRQ
 	INIT_LIST_HEAD(&dev->msi_list);
+	INIT_LIST_HEAD(&dev->platform_msi_list);
 #endif
 	INIT_LIST_HEAD(&dev->links.consumers);
 	INIT_LIST_HEAD(&dev->links.suppliers);
diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index 1a3af5f33802..b25c52f734dc 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -110,7 +110,8 @@ static void platform_msi_free_descs(struct device *dev, int base, int nvec)
 {
 	struct msi_desc *desc, *tmp;
 
-	list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
+	list_for_each_entry_safe(desc, tmp, dev_to_platform_msi_list(dev),
+				 list) {
 		if (desc->platform.msi_index >= base &&
 		    desc->platform.msi_index < (base + nvec)) {
 			list_del(&desc->list);
@@ -127,8 +128,8 @@ static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
 	struct msi_desc *desc;
 	int i, base = 0;
 
-	if (!list_empty(dev_to_msi_list(dev))) {
-		desc = list_last_entry(dev_to_msi_list(dev),
+	if (!list_empty(dev_to_platform_msi_list(dev))) {
+		desc = list_last_entry(dev_to_platform_msi_list(dev),
 				       struct msi_desc, list);
 		base = desc->platform.msi_index + 1;
 	}
@@ -142,7 +143,7 @@ static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
 		desc->platform.msi_index = base + i;
 		desc->irq = virq ? virq + i : 0;
 
-		list_add_tail(&desc->list, dev_to_msi_list(dev));
+		list_add_tail(&desc->list, dev_to_platform_msi_list(dev));
 	}
 
 	if (i != nvec) {
@@ -213,7 +214,7 @@ platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
 	}
 
 	/* Already had a helping of MSI? Greed... */
-	if (!list_empty(dev_to_msi_list(dev)))
+	if (!list_empty(dev_to_platform_msi_list(dev)))
 		return ERR_PTR(-EBUSY);
 
 	datap = kzalloc(sizeof(*datap), GFP_KERNEL);
@@ -255,6 +256,8 @@ int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
 	struct platform_msi_priv_data *priv_data;
 	int err;
 
+	dev->platform_msi_type = GEN_PLAT_MSI;
+
 	priv_data = platform_msi_alloc_priv_data(dev, nvec, platform_ops);
 	if (IS_ERR(priv_data))
 		return PTR_ERR(priv_data);
@@ -284,10 +287,10 @@ EXPORT_SYMBOL_GPL(platform_msi_domain_alloc_irqs);
  */
 void platform_msi_domain_free_irqs(struct device *dev)
 {
-	if (!list_empty(dev_to_msi_list(dev))) {
+	if (!list_empty(dev_to_platform_msi_list(dev))) {
 		struct msi_desc *desc;
 
-		desc = first_msi_entry(dev);
+		desc = first_platform_msi_entry(dev);
 		platform_msi_free_priv_data(desc->platform.msi_priv_data);
 	}
 
@@ -370,7 +373,7 @@ void platform_msi_domain_free(struct irq_domain *domain, unsigned int virq,
 {
 	struct platform_msi_priv_data *data = domain->host_data;
 	struct msi_desc *desc, *tmp;
-	for_each_msi_entry_safe(desc, tmp, data->dev) {
+	for_each_platform_msi_entry_safe(desc, tmp, data->dev) {
 		if (WARN_ON(!desc->irq || desc->nvec_used != 1))
 			return;
 		if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
diff --git a/include/linux/device.h b/include/linux/device.h
index ac8e37cd716a..cbcecb14584e 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -567,6 +567,8 @@ struct device {
 #endif
 #ifdef CONFIG_GENERIC_MSI_IRQ
 	struct list_head	msi_list;
+	struct list_head	platform_msi_list;
+	unsigned int		platform_msi_type;
 #endif
 
 	const struct dma_map_ops *dma_ops;
diff --git a/include/linux/list.h b/include/linux/list.h
index aff44d34f4e4..7a5ea40cb945 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -492,6 +492,18 @@ static inline void list_splice_tail_init(struct list_head *list,
 #define list_entry(ptr, type, member) \
 	container_of(ptr, type, member)
 
+/**
+ * list_entry_select - get the correct struct for this entry based on condition
+ * @condition:	the condition to choose a particular &struct list head pointer
+ * @ptr_a:      the &struct list_head pointer if @condition is not met.
+ * @ptr_b:      the &struct list_head pointer if @condition is met.
+ * @type:       the type of the struct this is embedded in.
+ * @member:     the name of the list_head within the struct.
+ */
+#define list_entry_select(condition, ptr_a, ptr_b, type, member)\
+	(condition) ? list_entry(ptr_a, type, member) :		\
+		      list_entry(ptr_b, type, member)
+
 /**
  * list_first_entry - get the first element from a list
  * @ptr:	the list head to take the element from.
@@ -503,6 +515,17 @@ static inline void list_splice_tail_init(struct list_head *list,
 #define list_first_entry(ptr, type, member) \
 	list_entry((ptr)->next, type, member)
 
+/**
+ * list_first_entry_select - get the first element from list based on condition
+ * @condition:  the condition to choose a particular &struct list head pointer
+ * @ptr_a:      the &struct list_head pointer if @condition is not met.
+ * @ptr_b:      the &struct list_head pointer if @condition is met.
+ * @type:       the type of the struct this is embedded in.
+ * @member:     the name of the list_head within the struct.
+ */
+#define list_first_entry_select(condition, ptr_a, ptr_b, type, member)  \
+	list_entry_select((condition), (ptr_a)->next, (ptr_b)->next, type, member)
+
 /**
  * list_last_entry - get the last element from a list
  * @ptr:	the list head to take the element from.
@@ -602,6 +625,19 @@ static inline void list_splice_tail_init(struct list_head *list,
 	     &pos->member != (head);					\
 	     pos = list_next_entry(pos, member))
 
+/**
+ * list_for_each_entry_select - iterate over list of given type based on condition
+ * @condition:  the condition to choose a particular &struct list head pointer
+ * @pos:        the type * to use as a loop cursor.
+ * @head_a:     the head for your list if condition is met.
+ * @head_b:     the head for your list if condition is not met.
+ * @member:     the name of the list_head within the struct.
+ */
+#define list_for_each_entry_select(condition, pos, head_a, head_b, member)\
+	for (pos = list_first_entry_select((condition), head_a, head_b, typeof(*pos), member);\
+	     (condition) ? &pos->member != (head_a) : &pos->member != (head_b);\
+	     pos = list_next_entry(pos, member))
+
 /**
  * list_for_each_entry_reverse - iterate backwards over list of given type.
  * @pos:	the type * to use as a loop cursor.
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8e08907d70cb..9c15b7403694 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -130,6 +130,11 @@ struct msi_desc {
 	};
 };
 
+enum platform_msi_type {
+	NOT_PLAT_MSI = 0,
+	GEN_PLAT_MSI = 1,
+};
+
 /* Helpers to hide struct msi_desc implementation details */
 #define msi_desc_to_dev(desc)		((desc)->dev)
 #define dev_to_msi_list(dev)		(&(dev)->msi_list)
@@ -140,6 +145,22 @@ struct msi_desc {
 #define for_each_msi_entry_safe(desc, tmp, dev)	\
 	list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
 
+#define dev_to_platform_msi_list(dev)	(&(dev)->platform_msi_list)
+#define first_platform_msi_entry(dev)		\
+	list_first_entry(dev_to_platform_msi_list((dev)), struct msi_desc, list)
+#define for_each_platform_msi_entry(desc, dev)	\
+	list_for_each_entry((desc), dev_to_platform_msi_list((dev)), list)
+#define for_each_platform_msi_entry_safe(desc, tmp, dev)	\
+	list_for_each_entry_safe((desc), (tmp), dev_to_platform_msi_list((dev)), list)
+
+#define first_msi_entry_common(dev)	\
+	list_first_entry_select((dev)->platform_msi_type, dev_to_platform_msi_list((dev)),	\
+				dev_to_msi_list((dev)), struct msi_desc, list)
+
+#define for_each_msi_entry_common(desc, dev)	\
+	list_for_each_entry_select((dev)->platform_msi_type, desc, dev_to_platform_msi_list((dev)), \
+				   dev_to_msi_list((dev)), list)	\
+
 #ifdef CONFIG_IRQ_MSI_IOMMU
 static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
 {
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index eb95f6106a1e..bc5f9e32387f 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -320,7 +320,7 @@ int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
 	struct msi_desc *desc;
 	int ret = 0;
 
-	for_each_msi_entry(desc, dev) {
+	for_each_msi_entry_common(desc, dev) {
 		/* Don't even try the multi-MSI brain damage. */
 		if (WARN_ON(!desc->irq || desc->nvec_used != 1)) {
 			ret = -EINVAL;
@@ -342,7 +342,7 @@ int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
 
 	if (ret) {
 		/* Mop up the damage */
-		for_each_msi_entry(desc, dev) {
+		for_each_msi_entry_common(desc, dev) {
 			if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
 				continue;
 
@@ -383,7 +383,7 @@ static bool msi_check_reservation_mode(struct irq_domain *domain,
 	 * Checking the first MSI descriptor is sufficient. MSIX supports
 	 * masking and MSI does so when the maskbit is set.
 	 */
-	desc = first_msi_entry(dev);
+	desc = first_msi_entry_common(dev);
 	return desc->msi_attrib.is_msix || desc->msi_attrib.maskbit;
 }
 
@@ -411,7 +411,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	if (ret)
 		return ret;
 
-	for_each_msi_entry(desc, dev) {
+	for_each_msi_entry_common(desc, dev) {
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
@@ -437,7 +437,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 	can_reserve = msi_check_reservation_mode(domain, info, dev);
 
-	for_each_msi_entry(desc, dev) {
+	for_each_msi_entry_common(desc, dev) {
 		virq = desc->irq;
 		if (desc->nvec_used == 1)
 			dev_dbg(dev, "irq %d for MSI\n", virq);
@@ -468,7 +468,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	 * so request_irq() will assign the final vector.
 	 */
 	if (can_reserve) {
-		for_each_msi_entry(desc, dev) {
+		for_each_msi_entry_common(desc, dev) {
 			irq_data = irq_domain_get_irq_data(domain, desc->irq);
 			irqd_clr_activated(irq_data);
 		}
@@ -476,7 +476,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	return 0;
 
 cleanup:
-	for_each_msi_entry(desc, dev) {
+	for_each_msi_entry_common(desc, dev) {
 		struct irq_data *irqd;
 
 		if (desc->irq == virq)
@@ -500,7 +500,7 @@ void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
 {
 	struct msi_desc *desc;
 
-	for_each_msi_entry(desc, dev) {
+	for_each_msi_entry_common(desc, dev) {
 		/*
 		 * We might have failed to allocate an MSI early
 		 * enough that there is no IRQ associated to this


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 03/15] drivers/base: Allocate/free platform-msi interrupts by group
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
  2020-04-21 23:33 ` [PATCH RFC 01/15] drivers/base: Introduce platform_msi_ops Dave Jiang
  2020-04-21 23:33 ` [PATCH RFC 02/15] drivers/base: Introduce a new platform-msi list Dave Jiang
@ 2020-04-21 23:34 ` Dave Jiang
  2020-04-25 21:23   ` Thomas Gleixner
  2020-04-21 23:34 ` [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain Dave Jiang
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:34 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Megha Dey <megha.dey@linux.intel.com>

This is a preparatory patch to introduce Interrupt Message Store (IMS).

Dynamic allocation is IMS vectors is a requirement for devices which
support Scalable I/o virtualization. A driver can allocate and free
vectors not just once during probe (as was the case with the MSI/MSI-X)
but also in the post probe phase where actual demand is available.

Thus, introduce an API, platform_msi_domain_alloc_irqs_group() which
drivers using IMS would be able to call multiple times. The vectors
allocated each time this API is called are associated with a group ID,
starting from 1. To free the vectors associated with a particular group,
platform_msi_domain_free_irqs_group() API can be called.

The existing drivers using platform-msi infrastructure will continue to
use the existing alloc (platform_msi_domain_alloc_irqs) and free
(platform_msi_domain_free_irqs) APIs and are assigned a default group 0.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 drivers/base/platform-msi.c |  131 ++++++++++++++++++++++++++++++++-----------
 include/linux/device.h      |    1 
 include/linux/msi.h         |   47 +++++++++++----
 kernel/irq/msi.c            |   43 +++++++++++---
 4 files changed, 169 insertions(+), 53 deletions(-)

diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index b25c52f734dc..2696aa75983b 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -106,16 +106,28 @@ static void platform_msi_update_chip_ops(struct msi_domain_info *info)
 		info->flags &= ~MSI_FLAG_LEVEL_CAPABLE;
 }
 
-static void platform_msi_free_descs(struct device *dev, int base, int nvec)
+static void platform_msi_free_descs(struct device *dev, int base, int nvec,
+				    unsigned int group)
 {
 	struct msi_desc *desc, *tmp;
-
-	list_for_each_entry_safe(desc, tmp, dev_to_platform_msi_list(dev),
-				 list) {
-		if (desc->platform.msi_index >= base &&
-		    desc->platform.msi_index < (base + nvec)) {
-			list_del(&desc->list);
-			free_msi_entry(desc);
+	struct platform_msi_group_entry *platform_msi_group,
+					*tmp_platform_msi_group;
+
+	list_for_each_entry_safe(platform_msi_group, tmp_platform_msi_group,
+				 dev_to_platform_msi_group_list(dev),
+				 group_list) {
+		if (platform_msi_group->group_id == group) {
+			list_for_each_entry_safe(desc, tmp,
+						&platform_msi_group->entry_list,
+						list) {
+				if (desc->platform.msi_index >= base &&
+				    desc->platform.msi_index < (base + nvec)) {
+					list_del(&desc->list);
+					free_msi_entry(desc);
+				}
+			}
+			list_del(&platform_msi_group->group_list);
+			kfree(platform_msi_group);
 		}
 	}
 }
@@ -128,8 +140,8 @@ static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
 	struct msi_desc *desc;
 	int i, base = 0;
 
-	if (!list_empty(dev_to_platform_msi_list(dev))) {
-		desc = list_last_entry(dev_to_platform_msi_list(dev),
+	if (!list_empty(platform_msi_current_group_entry_list(dev))) {
+		desc = list_last_entry(platform_msi_current_group_entry_list(dev),
 				       struct msi_desc, list);
 		base = desc->platform.msi_index + 1;
 	}
@@ -143,12 +155,13 @@ static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
 		desc->platform.msi_index = base + i;
 		desc->irq = virq ? virq + i : 0;
 
-		list_add_tail(&desc->list, dev_to_platform_msi_list(dev));
+		list_add_tail(&desc->list,
+			      platform_msi_current_group_entry_list(dev));
 	}
 
 	if (i != nvec) {
 		/* Clean up the mess */
-		platform_msi_free_descs(dev, base, nvec);
+		platform_msi_free_descs(dev, base, nvec, dev->group_id);
 
 		return -ENOMEM;
 	}
@@ -214,7 +227,7 @@ platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
 	}
 
 	/* Already had a helping of MSI? Greed... */
-	if (!list_empty(dev_to_platform_msi_list(dev)))
+	if (!list_empty(platform_msi_current_group_entry_list(dev)))
 		return ERR_PTR(-EBUSY);
 
 	datap = kzalloc(sizeof(*datap), GFP_KERNEL);
@@ -253,11 +266,36 @@ static void platform_msi_free_priv_data(struct platform_msi_priv_data *data)
 int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
 				   const struct platform_msi_ops *platform_ops)
 {
+	return platform_msi_domain_alloc_irqs_group(dev, nvec, platform_ops,
+									NULL);
+}
+EXPORT_SYMBOL_GPL(platform_msi_domain_alloc_irqs);
+
+int platform_msi_domain_alloc_irqs_group(struct device *dev, unsigned int nvec,
+					 const struct platform_msi_ops *platform_ops,
+					 unsigned int *group_id)
+{
+	struct platform_msi_group_entry *platform_msi_group;
 	struct platform_msi_priv_data *priv_data;
 	int err;
 
 	dev->platform_msi_type = GEN_PLAT_MSI;
 
+	if (group_id)
+		*group_id = ++dev->group_id;
+
+	platform_msi_group = kzalloc(sizeof(*platform_msi_group), GFP_KERNEL);
+	if (!platform_msi_group) {
+		err = -ENOMEM;
+		goto out_platform_msi_group;
+	}
+
+	INIT_LIST_HEAD(&platform_msi_group->group_list);
+	INIT_LIST_HEAD(&platform_msi_group->entry_list);
+	platform_msi_group->group_id = dev->group_id;
+	list_add_tail(&platform_msi_group->group_list,
+		      dev_to_platform_msi_group_list(dev));
+
 	priv_data = platform_msi_alloc_priv_data(dev, nvec, platform_ops);
 	if (IS_ERR(priv_data))
 		return PTR_ERR(priv_data);
@@ -273,13 +311,14 @@ int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
 	return 0;
 
 out_free_desc:
-	platform_msi_free_descs(dev, 0, nvec);
+	platform_msi_free_descs(dev, 0, nvec, dev->group_id);
 out_free_priv_data:
 	platform_msi_free_priv_data(priv_data);
-
+out_platform_msi_group:
+	kfree(platform_msi_group);
 	return err;
 }
-EXPORT_SYMBOL_GPL(platform_msi_domain_alloc_irqs);
+EXPORT_SYMBOL_GPL(platform_msi_domain_alloc_irqs_group);
 
 /**
  * platform_msi_domain_free_irqs - Free MSI interrupts for @dev
@@ -287,17 +326,30 @@ EXPORT_SYMBOL_GPL(platform_msi_domain_alloc_irqs);
  */
 void platform_msi_domain_free_irqs(struct device *dev)
 {
-	if (!list_empty(dev_to_platform_msi_list(dev))) {
-		struct msi_desc *desc;
+	platform_msi_domain_free_irqs_group(dev, 0);
+}
+EXPORT_SYMBOL_GPL(platform_msi_domain_free_irqs);
 
-		desc = first_platform_msi_entry(dev);
-		platform_msi_free_priv_data(desc->platform.msi_priv_data);
+void platform_msi_domain_free_irqs_group(struct device *dev, unsigned int group)
+{
+	struct platform_msi_group_entry *platform_msi_group;
+
+	list_for_each_entry(platform_msi_group,
+			    dev_to_platform_msi_group_list((dev)), group_list) {
+		if (platform_msi_group->group_id == group) {
+			if (!list_empty(&platform_msi_group->entry_list)) {
+				struct msi_desc *desc;
+
+				desc = list_first_entry(&(platform_msi_group)->entry_list,
+							struct msi_desc, list);
+				platform_msi_free_priv_data(desc->platform.msi_priv_data);
+			}
+		}
 	}
-
-	msi_domain_free_irqs(dev->msi_domain, dev);
-	platform_msi_free_descs(dev, 0, MAX_DEV_MSIS);
+	msi_domain_free_irqs_group(dev->msi_domain, dev, group);
+	platform_msi_free_descs(dev, 0, MAX_DEV_MSIS, group);
 }
-EXPORT_SYMBOL_GPL(platform_msi_domain_free_irqs);
+EXPORT_SYMBOL_GPL(platform_msi_domain_free_irqs_group);
 
 /**
  * platform_msi_get_host_data - Query the private data associated with
@@ -373,15 +425,28 @@ void platform_msi_domain_free(struct irq_domain *domain, unsigned int virq,
 {
 	struct platform_msi_priv_data *data = domain->host_data;
 	struct msi_desc *desc, *tmp;
-	for_each_platform_msi_entry_safe(desc, tmp, data->dev) {
-		if (WARN_ON(!desc->irq || desc->nvec_used != 1))
-			return;
-		if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
-			continue;
-
-		irq_domain_free_irqs_common(domain, desc->irq, 1);
-		list_del(&desc->list);
-		free_msi_entry(desc);
+	struct platform_msi_group_entry *platform_msi_group,
+					*tmp_platform_msi_group;
+
+	list_for_each_entry_safe(platform_msi_group, tmp_platform_msi_group,
+				 dev_to_platform_msi_group_list(data->dev),
+				 group_list) {
+		if (platform_msi_group->group_id == data->dev->group_id) {
+			list_for_each_entry_safe(desc, tmp,
+						&platform_msi_group->entry_list,
+						list) {
+				if (WARN_ON(!desc->irq || desc->nvec_used != 1))
+					return;
+				if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
+					continue;
+
+				irq_domain_free_irqs_common(domain, desc->irq, 1);
+				list_del(&desc->list);
+				free_msi_entry(desc);
+			}
+			list_del(&platform_msi_group->group_list);
+			kfree(platform_msi_group);
+		}
 	}
 }
 
diff --git a/include/linux/device.h b/include/linux/device.h
index cbcecb14584e..f6700b85eb95 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -624,6 +624,7 @@ struct device {
     defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
 	bool			dma_coherent:1;
 #endif
+	unsigned int		group_id;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 9c15b7403694..3890b143b04d 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -135,6 +135,12 @@ enum platform_msi_type {
 	GEN_PLAT_MSI = 1,
 };
 
+struct platform_msi_group_entry {
+	unsigned int group_id;
+	struct list_head group_list;
+	struct list_head entry_list;
+};
+
 /* Helpers to hide struct msi_desc implementation details */
 #define msi_desc_to_dev(desc)		((desc)->dev)
 #define dev_to_msi_list(dev)		(&(dev)->msi_list)
@@ -145,21 +151,31 @@ enum platform_msi_type {
 #define for_each_msi_entry_safe(desc, tmp, dev)	\
 	list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
 
-#define dev_to_platform_msi_list(dev)	(&(dev)->platform_msi_list)
-#define first_platform_msi_entry(dev)		\
-	list_first_entry(dev_to_platform_msi_list((dev)), struct msi_desc, list)
-#define for_each_platform_msi_entry(desc, dev)	\
-	list_for_each_entry((desc), dev_to_platform_msi_list((dev)), list)
-#define for_each_platform_msi_entry_safe(desc, tmp, dev)	\
-	list_for_each_entry_safe((desc), (tmp), dev_to_platform_msi_list((dev)), list)
+#define dev_to_platform_msi_group_list(dev)    (&(dev)->platform_msi_list)
+
+#define first_platform_msi_group_entry(dev)				\
+	list_first_entry(dev_to_platform_msi_group_list((dev)),		\
+			 struct platform_msi_group_entry, group_list)
 
-#define first_msi_entry_common(dev)	\
-	list_first_entry_select((dev)->platform_msi_type, dev_to_platform_msi_list((dev)),	\
+#define platform_msi_current_group_entry_list(dev)			\
+	(&((list_last_entry(dev_to_platform_msi_group_list((dev)),	\
+			    struct platform_msi_group_entry,		\
+			    group_list))->entry_list))
+
+#define first_msi_entry_current_group(dev)				\
+	list_first_entry_select((dev)->platform_msi_type,		\
+				platform_msi_current_group_entry_list((dev)),	\
 				dev_to_msi_list((dev)), struct msi_desc, list)
 
-#define for_each_msi_entry_common(desc, dev)	\
-	list_for_each_entry_select((dev)->platform_msi_type, desc, dev_to_platform_msi_list((dev)), \
-				   dev_to_msi_list((dev)), list)	\
+#define for_each_msi_entry_current_group(desc, dev)			\
+	list_for_each_entry_select((dev)->platform_msi_type, desc,	\
+				   platform_msi_current_group_entry_list((dev)),\
+				   dev_to_msi_list((dev)), list)
+
+#define for_each_platform_msi_entry_in_group(desc, platform_msi_group, group, dev)	\
+	list_for_each_entry((platform_msi_group), dev_to_platform_msi_group_list((dev)), group_list)	\
+		if (((platform_msi_group)->group_id) == (group))			\
+			list_for_each_entry((desc), (&(platform_msi_group)->entry_list), list)
 
 #ifdef CONFIG_IRQ_MSI_IOMMU
 static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
@@ -363,6 +379,8 @@ struct irq_domain *msi_create_irq_domain(struct fwnode_handle *fwnode,
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 			  int nvec);
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
+void msi_domain_free_irqs_group(struct irq_domain *domain,
+				struct device *dev, unsigned int group);
 struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);
 
 struct irq_domain *platform_msi_create_irq_domain(struct fwnode_handle *fwnode,
@@ -371,6 +389,11 @@ struct irq_domain *platform_msi_create_irq_domain(struct fwnode_handle *fwnode,
 int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
 				   const struct platform_msi_ops *platform_ops);
 void platform_msi_domain_free_irqs(struct device *dev);
+int platform_msi_domain_alloc_irqs_group(struct device *dev, unsigned int nvec,
+					 const struct platform_msi_ops *platform_ops,
+					 unsigned int *group_id);
+void platform_msi_domain_free_irqs_group(struct device *dev,
+					 unsigned int group_id);
 
 /* When an MSI domain is used as an intermediate domain */
 int msi_domain_prepare_irqs(struct irq_domain *domain, struct device *dev,
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index bc5f9e32387f..899ade394ec8 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -320,7 +320,7 @@ int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
 	struct msi_desc *desc;
 	int ret = 0;
 
-	for_each_msi_entry_common(desc, dev) {
+	for_each_msi_entry_current_group(desc, dev) {
 		/* Don't even try the multi-MSI brain damage. */
 		if (WARN_ON(!desc->irq || desc->nvec_used != 1)) {
 			ret = -EINVAL;
@@ -342,7 +342,7 @@ int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
 
 	if (ret) {
 		/* Mop up the damage */
-		for_each_msi_entry_common(desc, dev) {
+		for_each_msi_entry_current_group(desc, dev) {
 			if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
 				continue;
 
@@ -383,7 +383,7 @@ static bool msi_check_reservation_mode(struct irq_domain *domain,
 	 * Checking the first MSI descriptor is sufficient. MSIX supports
 	 * masking and MSI does so when the maskbit is set.
 	 */
-	desc = first_msi_entry_common(dev);
+	desc = first_msi_entry_current_group(dev);
 	return desc->msi_attrib.is_msix || desc->msi_attrib.maskbit;
 }
 
@@ -411,7 +411,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	if (ret)
 		return ret;
 
-	for_each_msi_entry_common(desc, dev) {
+	for_each_msi_entry_current_group(desc, dev) {
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
@@ -437,7 +437,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 	can_reserve = msi_check_reservation_mode(domain, info, dev);
 
-	for_each_msi_entry_common(desc, dev) {
+	for_each_msi_entry_current_group(desc, dev) {
 		virq = desc->irq;
 		if (desc->nvec_used == 1)
 			dev_dbg(dev, "irq %d for MSI\n", virq);
@@ -468,7 +468,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	 * so request_irq() will assign the final vector.
 	 */
 	if (can_reserve) {
-		for_each_msi_entry_common(desc, dev) {
+		for_each_msi_entry_current_group(desc, dev) {
 			irq_data = irq_domain_get_irq_data(domain, desc->irq);
 			irqd_clr_activated(irq_data);
 		}
@@ -476,7 +476,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	return 0;
 
 cleanup:
-	for_each_msi_entry_common(desc, dev) {
+	for_each_msi_entry_current_group(desc, dev) {
 		struct irq_data *irqd;
 
 		if (desc->irq == virq)
@@ -500,7 +500,34 @@ void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
 {
 	struct msi_desc *desc;
 
-	for_each_msi_entry_common(desc, dev) {
+	for_each_msi_entry_current_group(desc, dev) {
+		/*
+		 * We might have failed to allocate an MSI early
+		 * enough that there is no IRQ associated to this
+		 * entry. If that's the case, don't do anything.
+		 */
+		if (desc->irq) {
+			irq_domain_free_irqs(desc->irq, desc->nvec_used);
+			desc->irq = 0;
+		}
+	}
+}
+
+/**
+ * msi_domain_free_irqs_group - Free interrupts from a MSI interrupt @domain
+ * associated to @dev from a particular group
+ * @domain:	The domain to managing the interrupts
+ * @dev:	Pointer to device struct of the device for which the interrupts
+ *		are free
+ * @group:	group from which interrupts are to be freed
+ */
+void msi_domain_free_irqs_group(struct irq_domain *domain,
+				struct device *dev, unsigned int group)
+{
+	struct msi_desc *desc;
+	struct platform_msi_group_entry *platform_msi_group;
+
+	for_each_platform_msi_entry_in_group(desc, platform_msi_group, group, dev) {
 		/*
 		 * We might have failed to allocate an MSI early
 		 * enough that there is no IRQ associated to this


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (2 preceding siblings ...)
  2020-04-21 23:34 ` [PATCH RFC 03/15] drivers/base: Allocate/free platform-msi interrupts by group Dave Jiang
@ 2020-04-21 23:34 ` Dave Jiang
  2020-04-23 20:11   ` Jason Gunthorpe
  2020-04-25 21:38   ` Thomas Gleixner
  2020-04-21 23:34 ` [PATCH RFC 05/15] ims-msi: Add mask/unmask routines Dave Jiang
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:34 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Megha Dey <megha.dey@linux.intel.com>

Add support for the creation of a new IMS irq domain. It creates a new
irq chip associated with the IMS domain and adds the necessary domain
operations to it.

Also, add a new config option MSI_IMS which must be enabled by any driver
who would want to use the IMS infrastructure.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 arch/x86/include/asm/hw_irq.h    |    7 +++
 drivers/base/Kconfig             |    9 +++
 drivers/base/Makefile            |    1 
 drivers/base/ims-msi.c           |  100 ++++++++++++++++++++++++++++++++++++++
 drivers/base/platform-msi.c      |    6 +-
 drivers/vfio/mdev/mdev_core.c    |    6 ++
 drivers/vfio/mdev/mdev_private.h |    1 
 include/linux/mdev.h             |    3 +
 include/linux/msi.h              |    2 +
 9 files changed, 131 insertions(+), 4 deletions(-)
 create mode 100644 drivers/base/ims-msi.c

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 4154bc5f6a4e..2e355aa6ba50 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -62,6 +62,7 @@ enum irq_alloc_type {
 	X86_IRQ_ALLOC_TYPE_MSIX,
 	X86_IRQ_ALLOC_TYPE_DMAR,
 	X86_IRQ_ALLOC_TYPE_UV,
+	X86_IRQ_ALLOC_TYPE_IMS,
 };
 
 struct irq_alloc_info {
@@ -83,6 +84,12 @@ struct irq_alloc_info {
 			irq_hw_number_t	msi_hwirq;
 		};
 #endif
+#ifdef	CONFIG_MSI_IMS
+		struct {
+			struct device	*dev;
+			irq_hw_number_t	ims_hwirq;
+		};
+#endif
 #ifdef	CONFIG_X86_IO_APIC
 		struct {
 			int		ioapic_id;
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 5f0bc74d2409..877e0fdee013 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -209,4 +209,13 @@ config GENERIC_ARCH_TOPOLOGY
 	  appropriate scaling, sysfs interface for reading capacity values at
 	  runtime.
 
+config MSI_IMS
+	bool "Device Specific Interrupt Message Storage (IMS)"
+	depends on X86
+	select GENERIC_MSI_IRQ_DOMAIN
+	select IRQ_REMAP
+	help
+	  This allows device drivers to enable device specific
+	  interrupt message storage (IMS) besides standard MSI-X interrupts.
+
 endmenu
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 157452080f3d..659b9b0c0b8a 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_SOC_BUS) += soc.o
 obj-$(CONFIG_PINCTRL) += pinctrl.o
 obj-$(CONFIG_DEV_COREDUMP) += devcoredump.o
 obj-$(CONFIG_GENERIC_MSI_IRQ_DOMAIN) += platform-msi.o
+obj-$(CONFIG_MSI_IMS) += ims-msi.o
 obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o
 
 obj-y			+= test/
diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
new file mode 100644
index 000000000000..738f6d153155
--- /dev/null
+++ b/drivers/base/ims-msi.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support for Device Specific IMS interrupts.
+ *
+ * Copyright © 2019 Intel Corporation.
+ *
+ * Author: Megha Dey <megha.dey@intel.com>
+ */
+
+#include <linux/dmar.h>
+#include <linux/irq.h>
+#include <linux/mdev.h>
+#include <linux/pci.h>
+
+/*
+ * Determine if a dev is mdev or not. Return NULL if not mdev device.
+ * Return mdev's parent dev if success.
+ */
+static inline struct device *mdev_to_parent(struct device *dev)
+{
+	struct device *ret = NULL;
+	struct device *(*fn)(struct device *dev);
+	struct bus_type *bus = symbol_get(mdev_bus_type);
+
+	if (bus && dev->bus == bus) {
+		fn = symbol_get(mdev_dev_to_parent_dev);
+		ret = fn(dev);
+		symbol_put(mdev_dev_to_parent_dev);
+		symbol_put(mdev_bus_type);
+	}
+
+	return ret;
+}
+
+static irq_hw_number_t dev_ims_get_hwirq(struct msi_domain_info *info,
+					 msi_alloc_info_t *arg)
+{
+	return arg->ims_hwirq;
+}
+
+static int dev_ims_prepare(struct irq_domain *domain, struct device *dev,
+			   int nvec, msi_alloc_info_t *arg)
+{
+	if (dev_is_mdev(dev))
+		dev = mdev_to_parent(dev);
+
+	init_irq_alloc_info(arg, NULL);
+	arg->dev = dev;
+	arg->type = X86_IRQ_ALLOC_TYPE_IMS;
+
+	return 0;
+}
+
+static void dev_ims_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
+{
+	arg->ims_hwirq = platform_msi_calc_hwirq(desc);
+}
+
+static struct msi_domain_ops dev_ims_domain_ops = {
+	.get_hwirq	= dev_ims_get_hwirq,
+	.msi_prepare	= dev_ims_prepare,
+	.set_desc	= dev_ims_set_desc,
+};
+
+static struct irq_chip dev_ims_ir_controller = {
+	.name			= "IR-DEV-IMS",
+	.irq_ack		= irq_chip_ack_parent,
+	.irq_retrigger		= irq_chip_retrigger_hierarchy,
+	.irq_set_vcpu_affinity	= irq_chip_set_vcpu_affinity_parent,
+	.flags			= IRQCHIP_SKIP_SET_WAKE,
+	.irq_write_msi_msg	= platform_msi_write_msg,
+};
+
+static struct msi_domain_info ims_ir_domain_info = {
+	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
+	.ops		= &dev_ims_domain_ops,
+	.chip		= &dev_ims_ir_controller,
+	.handler	= handle_edge_irq,
+	.handler_name	= "edge",
+};
+
+struct irq_domain *arch_create_ims_irq_domain(struct irq_domain *parent,
+					      const char *name)
+{
+	struct fwnode_handle *fn;
+	struct irq_domain *domain;
+
+	fn = irq_domain_alloc_named_fwnode(name);
+	if (!fn)
+		return NULL;
+
+	domain = msi_create_irq_domain(fn, &ims_ir_domain_info, parent);
+	if (!domain)
+		return NULL;
+
+	irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
+	irq_domain_free_fwnode(fn);
+
+	return domain;
+}
diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index 2696aa75983b..59160e8cbfb1 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -31,12 +31,11 @@ struct platform_msi_priv_data {
 /* The devid allocator */
 static DEFINE_IDA(platform_msi_devid_ida);
 
-#ifdef GENERIC_MSI_DOMAIN_OPS
 /*
  * Convert an msi_desc to a globaly unique identifier (per-device
  * devid + msi_desc position in the msi_list).
  */
-static irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc)
+irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc)
 {
 	u32 devid;
 
@@ -45,6 +44,7 @@ static irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc)
 	return (devid << (32 - DEV_ID_SHIFT)) | desc->platform.msi_index;
 }
 
+#ifdef GENERIC_MSI_DOMAIN_OPS
 static void platform_msi_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
 {
 	arg->desc = desc;
@@ -76,7 +76,7 @@ static void platform_msi_update_dom_ops(struct msi_domain_info *info)
 		ops->set_desc = platform_msi_set_desc;
 }
 
-static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
+void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
 {
 	struct msi_desc *desc = irq_data_get_msi_desc(data);
 	struct platform_msi_priv_data *priv_data;
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index b558d4cfd082..cecc6a6bdbef 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -33,6 +33,12 @@ struct device *mdev_parent_dev(struct mdev_device *mdev)
 }
 EXPORT_SYMBOL(mdev_parent_dev);
 
+struct device *mdev_dev_to_parent_dev(struct device *dev)
+{
+	return to_mdev_device(dev)->parent->dev;
+}
+EXPORT_SYMBOL(mdev_dev_to_parent_dev);
+
 void *mdev_get_drvdata(struct mdev_device *mdev)
 {
 	return mdev->driver_data;
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 7d922950caaf..c21f1305a76b 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -36,7 +36,6 @@ struct mdev_device {
 };
 
 #define to_mdev_device(dev)	container_of(dev, struct mdev_device, dev)
-#define dev_is_mdev(d)		((d)->bus == &mdev_bus_type)
 
 struct mdev_type {
 	struct kobject kobj;
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index 0ce30ca78db0..fa2344e239ef 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -144,5 +144,8 @@ void mdev_unregister_driver(struct mdev_driver *drv);
 struct device *mdev_parent_dev(struct mdev_device *mdev);
 struct device *mdev_dev(struct mdev_device *mdev);
 struct mdev_device *mdev_from_dev(struct device *dev);
+struct device *mdev_dev_to_parent_dev(struct device *dev);
+
+#define dev_is_mdev(dev) ((dev)->bus == symbol_get(mdev_bus_type))
 
 #endif /* MDEV_H */
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 3890b143b04d..80386468a7bc 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -418,6 +418,8 @@ int platform_msi_domain_alloc(struct irq_domain *domain, unsigned int virq,
 void platform_msi_domain_free(struct irq_domain *domain, unsigned int virq,
 			      unsigned int nvec);
 void *platform_msi_get_host_data(struct irq_domain *domain);
+irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc);
+void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg);
 #endif /* CONFIG_GENERIC_MSI_IRQ_DOMAIN */
 
 #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 05/15] ims-msi: Add mask/unmask routines
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (3 preceding siblings ...)
  2020-04-21 23:34 ` [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain Dave Jiang
@ 2020-04-21 23:34 ` Dave Jiang
  2020-04-25 21:49   ` Thomas Gleixner
  2020-04-21 23:34 ` [PATCH RFC 06/15] ims-msi: Enable IMS interrupts Dave Jiang
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:34 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Megha Dey <megha.dey@linux.intel.com>

Introduce the mask/unmask functions which would be used as callbacks
to the IRQ chip associated with the IMS domain.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 drivers/base/ims-msi.c      |   47 +++++++++++++++++++++++++++++++++++++++++++
 drivers/base/platform-msi.c |   12 -----------
 include/linux/msi.h         |   14 +++++++++++++
 3 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
index 738f6d153155..896a5a1b2252 100644
--- a/drivers/base/ims-msi.c
+++ b/drivers/base/ims-msi.c
@@ -7,11 +7,56 @@
  * Author: Megha Dey <megha.dey@intel.com>
  */
 
+#include <linux/device.h>
 #include <linux/dmar.h>
+#include <linux/export.h>
 #include <linux/irq.h>
 #include <linux/mdev.h>
+#include <linux/msi.h>
 #include <linux/pci.h>
 
+static u32 __dev_ims_desc_mask_irq(struct msi_desc *desc, u32 flag)
+{
+	u32 mask_bits = desc->platform.masked;
+	const struct platform_msi_ops *ops;
+
+	ops = desc->platform.msi_priv_data->ops;
+	if (!ops)
+		return 0;
+
+	if (flag) {
+		if (ops->irq_mask)
+			mask_bits = ops->irq_mask(desc);
+	} else {
+		if (ops->irq_unmask)
+			mask_bits = ops->irq_unmask(desc);
+	}
+
+	return mask_bits;
+}
+
+/**
+ * dev_ims_mask_irq - Generic irq chip callback to mask IMS interrupts
+ * @data: pointer to irqdata associated to that interrupt
+ */
+static void dev_ims_mask_irq(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+
+	desc->platform.masked = __dev_ims_desc_mask_irq(desc, 1);
+}
+
+/**
+ * dev_msi_unmask_irq - Generic irq chip callback to unmask IMS interrupts
+ * @data: pointer to irqdata associated to that interrupt
+ */
+void dev_ims_unmask_irq(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+
+	desc->platform.masked = __dev_ims_desc_mask_irq(desc, 0);
+}
+
 /*
  * Determine if a dev is mdev or not. Return NULL if not mdev device.
  * Return mdev's parent dev if success.
@@ -69,6 +114,8 @@ static struct irq_chip dev_ims_ir_controller = {
 	.irq_set_vcpu_affinity	= irq_chip_set_vcpu_affinity_parent,
 	.flags			= IRQCHIP_SKIP_SET_WAKE,
 	.irq_write_msi_msg	= platform_msi_write_msg,
+	.irq_unmask             = dev_ims_unmask_irq,
+	.irq_mask               = dev_ims_mask_irq,
 };
 
 static struct msi_domain_info ims_ir_domain_info = {
diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index 59160e8cbfb1..6d8840db4a85 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -16,18 +16,6 @@
 #define DEV_ID_SHIFT	21
 #define MAX_DEV_MSIS	(1 << (32 - DEV_ID_SHIFT))
 
-/*
- * Internal data structure containing a (made up, but unique) devid
- * and the callback to write the MSI message.
- */
-struct platform_msi_priv_data {
-	struct device			*dev;
-	void				*host_data;
-	msi_alloc_info_t		arg;
-	const struct platform_msi_ops	*ops;
-	int				devid;
-};
-
 /* The devid allocator */
 static DEFINE_IDA(platform_msi_devid_ida);
 
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 80386468a7bc..8b5f24bf3c47 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -33,10 +33,12 @@ typedef void (*irq_write_msi_msg_t)(struct msi_desc *desc,
  * platform_msi_desc - Platform device specific msi descriptor data
  * @msi_priv_data:	Pointer to platform private data
  * @msi_index:		The index of the MSI descriptor for multi MSI
+ * @masked:		mask bits
  */
 struct platform_msi_desc {
 	struct platform_msi_priv_data	*msi_priv_data;
 	u16				msi_index;
+	u32				masked;
 };
 
 /**
@@ -370,6 +372,18 @@ struct platform_msi_ops {
 	irq_write_msi_msg_t	write_msg;
 };
 
+/*
+ * Internal data structure containing a (made up, but unique) devid
+ * and the callback to write the MSI message.
+ */
+struct platform_msi_priv_data {
+	struct device			*dev;
+	void				*host_data;
+	msi_alloc_info_t		arg;
+	const struct platform_msi_ops	*ops;
+	int				devid;
+};
+
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
 			    bool force);
 


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 06/15] ims-msi: Enable IMS interrupts
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (4 preceding siblings ...)
  2020-04-21 23:34 ` [PATCH RFC 05/15] ims-msi: Add mask/unmask routines Dave Jiang
@ 2020-04-21 23:34 ` Dave Jiang
  2020-04-25 22:13   ` Thomas Gleixner
  2020-04-21 23:34 ` [PATCH RFC 07/15] Documentation: Interrupt Message store Dave Jiang
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:34 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Megha Dey <megha.dey@linux.intel.com>

To enable IMS interrupts,

1. create an IMS irqdomain (arch_create_ims_irq_domain()) associated
with the interrupt remapping unit.

2. Add 'IMS' to the enum platform_msi_type to differentiate between
specific actions required for different types of platform-msi, currently
generic platform-msi and IMS

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 arch/x86/include/asm/irq_remapping.h |    6 ++++
 drivers/base/ims-msi.c               |   15 ++++++++++
 drivers/base/platform-msi.c          |   51 +++++++++++++++++++++++++---------
 drivers/iommu/intel-iommu.c          |    2 +
 drivers/iommu/intel_irq_remapping.c  |   31 +++++++++++++++++++--
 include/linux/intel-iommu.h          |    3 ++
 include/linux/msi.h                  |    9 ++++++
 7 files changed, 100 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 4bc985f1e2e4..575e48c31b78 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -53,6 +53,12 @@ irq_remapping_get_irq_domain(struct irq_alloc_info *info);
 extern struct irq_domain *
 arch_create_remap_msi_irq_domain(struct irq_domain *par, const char *n, int id);
 
+/* Create IMS irqdomain, use @parent as the parent irqdomain. */
+#ifdef CONFIG_MSI_IMS
+extern struct irq_domain *arch_create_ims_irq_domain(struct irq_domain *parent,
+						     const char *name);
+#endif
+
 /* Get parent irqdomain for interrupt remapping irqdomain */
 static inline struct irq_domain *arch_get_ir_parent_domain(void)
 {
diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
index 896a5a1b2252..ac21088bcb83 100644
--- a/drivers/base/ims-msi.c
+++ b/drivers/base/ims-msi.c
@@ -14,6 +14,7 @@
 #include <linux/mdev.h>
 #include <linux/msi.h>
 #include <linux/pci.h>
+#include <asm/irq_remapping.h>
 
 static u32 __dev_ims_desc_mask_irq(struct msi_desc *desc, u32 flag)
 {
@@ -101,6 +102,20 @@ static void dev_ims_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
 	arg->ims_hwirq = platform_msi_calc_hwirq(desc);
 }
 
+struct irq_domain *dev_get_ims_domain(struct device *dev)
+{
+	struct irq_alloc_info info;
+
+	if (dev_is_mdev(dev))
+		dev = mdev_to_parent(dev);
+
+	init_irq_alloc_info(&info, NULL);
+	info.type = X86_IRQ_ALLOC_TYPE_IMS;
+	info.dev = dev;
+
+	return irq_remapping_get_irq_domain(&info);
+}
+
 static struct msi_domain_ops dev_ims_domain_ops = {
 	.get_hwirq	= dev_ims_get_hwirq,
 	.msi_prepare	= dev_ims_prepare,
diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index 6d8840db4a85..204ce8041c17 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -118,6 +118,8 @@ static void platform_msi_free_descs(struct device *dev, int base, int nvec,
 			kfree(platform_msi_group);
 		}
 	}
+
+	dev->platform_msi_type = 0;
 }
 
 static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
@@ -205,18 +207,22 @@ platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
 	 * accordingly (which would impact the max number of MSI
 	 * capable devices).
 	 */
-	if (!dev->msi_domain || !platform_ops->write_msg || !nvec ||
-	    nvec > MAX_DEV_MSIS)
+	if (!platform_ops->write_msg || !nvec || nvec > MAX_DEV_MSIS)
 		return ERR_PTR(-EINVAL);
 
-	if (dev->msi_domain->bus_token != DOMAIN_BUS_PLATFORM_MSI) {
-		dev_err(dev, "Incompatible msi_domain, giving up\n");
-		return ERR_PTR(-EINVAL);
-	}
+	if (dev->platform_msi_type == GEN_PLAT_MSI) {
+		if (!dev->msi_domain)
+			return ERR_PTR(-EINVAL);
+
+		if (dev->msi_domain->bus_token != DOMAIN_BUS_PLATFORM_MSI) {
+			dev_err(dev, "Incompatible msi_domain, giving up\n");
+			return ERR_PTR(-EINVAL);
+		}
 
-	/* Already had a helping of MSI? Greed... */
-	if (!list_empty(platform_msi_current_group_entry_list(dev)))
-		return ERR_PTR(-EBUSY);
+		/* Already had a helping of MSI? Greed... */
+		if (!list_empty(platform_msi_current_group_entry_list(dev)))
+			return ERR_PTR(-EBUSY);
+	}
 
 	datap = kzalloc(sizeof(*datap), GFP_KERNEL);
 	if (!datap)
@@ -254,6 +260,7 @@ static void platform_msi_free_priv_data(struct platform_msi_priv_data *data)
 int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
 				   const struct platform_msi_ops *platform_ops)
 {
+	dev->platform_msi_type = GEN_PLAT_MSI;
 	return platform_msi_domain_alloc_irqs_group(dev, nvec, platform_ops,
 									NULL);
 }
@@ -265,12 +272,18 @@ int platform_msi_domain_alloc_irqs_group(struct device *dev, unsigned int nvec,
 {
 	struct platform_msi_group_entry *platform_msi_group;
 	struct platform_msi_priv_data *priv_data;
+	struct irq_domain *domain;
 	int err;
 
-	dev->platform_msi_type = GEN_PLAT_MSI;
-
-	if (group_id)
+	if (!dev->platform_msi_type) {
 		*group_id = ++dev->group_id;
+		dev->platform_msi_type = IMS;
+		domain = dev_get_ims_domain(dev);
+		if (!domain)
+			return -ENOSYS;
+	} else {
+		domain = dev->msi_domain;
+	}
 
 	platform_msi_group = kzalloc(sizeof(*platform_msi_group), GFP_KERNEL);
 	if (!platform_msi_group) {
@@ -292,10 +305,11 @@ int platform_msi_domain_alloc_irqs_group(struct device *dev, unsigned int nvec,
 	if (err)
 		goto out_free_priv_data;
 
-	err = msi_domain_alloc_irqs(dev->msi_domain, dev, nvec);
+	err = msi_domain_alloc_irqs(domain, dev, nvec);
 	if (err)
 		goto out_free_desc;
 
+	dev->platform_msi_type = 0;
 	return 0;
 
 out_free_desc:
@@ -314,6 +328,7 @@ EXPORT_SYMBOL_GPL(platform_msi_domain_alloc_irqs_group);
  */
 void platform_msi_domain_free_irqs(struct device *dev)
 {
+	dev->platform_msi_type = GEN_PLAT_MSI;
 	platform_msi_domain_free_irqs_group(dev, 0);
 }
 EXPORT_SYMBOL_GPL(platform_msi_domain_free_irqs);
@@ -321,6 +336,14 @@ EXPORT_SYMBOL_GPL(platform_msi_domain_free_irqs);
 void platform_msi_domain_free_irqs_group(struct device *dev, unsigned int group)
 {
 	struct platform_msi_group_entry *platform_msi_group;
+	struct irq_domain *domain;
+
+	if (!dev->platform_msi_type) {
+		dev->platform_msi_type = IMS;
+		domain = dev_get_ims_domain(dev);
+	} else {
+		domain = dev->msi_domain;
+	}
 
 	list_for_each_entry(platform_msi_group,
 			    dev_to_platform_msi_group_list((dev)), group_list) {
@@ -334,7 +357,7 @@ void platform_msi_domain_free_irqs_group(struct device *dev, unsigned int group)
 			}
 		}
 	}
-	msi_domain_free_irqs_group(dev->msi_domain, dev, group);
+	msi_domain_free_irqs_group(domain, dev, group);
 	platform_msi_free_descs(dev, 0, MAX_DEV_MSIS, group);
 }
 EXPORT_SYMBOL_GPL(platform_msi_domain_free_irqs_group);
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index ef0a5246700e..99bb238caea6 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -794,7 +794,7 @@ is_downstream_to_pci_bridge(struct device *dev, struct device *bridge)
 	return false;
 }
 
-static struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn)
+struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn)
 {
 	struct dmar_drhd_unit *drhd = NULL;
 	struct intel_iommu *iommu;
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 81e43c1df7ec..1e470c9c3e7d 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -234,6 +234,18 @@ static struct intel_iommu *map_dev_to_ir(struct pci_dev *dev)
 	return drhd->iommu;
 }
 
+static struct intel_iommu *map_gen_dev_to_ir(struct device *dev)
+{
+	struct intel_iommu *iommu;
+	u8 bus, devfn;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return NULL;
+
+	return iommu;
+}
+
 static int clear_entries(struct irq_2_iommu *irq_iommu)
 {
 	struct irte *start, *entry, *end;
@@ -572,6 +584,10 @@ static int intel_setup_irq_remapping(struct intel_iommu *iommu)
 		arch_create_remap_msi_irq_domain(iommu->ir_domain,
 						 "INTEL-IR-MSI",
 						 iommu->seq_id);
+#if IS_ENABLED(CONFIG_MSI_IMS)
+	iommu->ir_ims_domain = arch_create_ims_irq_domain(iommu->ir_domain,
+							  "INTEL-IR-IMS");
+#endif
 
 	ir_table->base = page_address(pages);
 	ir_table->bitmap = bitmap;
@@ -637,6 +653,10 @@ static void intel_teardown_irq_remapping(struct intel_iommu *iommu)
 			irq_domain_remove(iommu->ir_domain);
 			iommu->ir_domain = NULL;
 		}
+		if (iommu->ir_ims_domain) {
+			irq_domain_remove(iommu->ir_ims_domain);
+			iommu->ir_ims_domain = NULL;
+		}
 		free_pages((unsigned long)iommu->ir_table->base,
 			   INTR_REMAP_PAGE_ORDER);
 		bitmap_free(iommu->ir_table->bitmap);
@@ -1132,6 +1152,11 @@ static struct irq_domain *intel_get_irq_domain(struct irq_alloc_info *info)
 		if (iommu)
 			return iommu->ir_msi_domain;
 		break;
+	case X86_IRQ_ALLOC_TYPE_IMS:
+		iommu = map_gen_dev_to_ir(info->dev);
+		if (iommu)
+			return iommu->ir_ims_domain;
+		break;
 	default:
 		break;
 	}
@@ -1299,9 +1324,10 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 	case X86_IRQ_ALLOC_TYPE_HPET:
 	case X86_IRQ_ALLOC_TYPE_MSI:
 	case X86_IRQ_ALLOC_TYPE_MSIX:
+	case X86_IRQ_ALLOC_TYPE_IMS:
 		if (info->type == X86_IRQ_ALLOC_TYPE_HPET)
 			set_hpet_sid(irte, info->hpet_id);
-		else
+		else if (info->type != X86_IRQ_ALLOC_TYPE_IMS)
 			set_msi_sid(irte, info->msi_dev);
 
 		msg->address_hi = MSI_ADDR_BASE_HI;
@@ -1354,7 +1380,8 @@ static int intel_irq_remapping_alloc(struct irq_domain *domain,
 	if (!info || !iommu)
 		return -EINVAL;
 	if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_MSI &&
-	    info->type != X86_IRQ_ALLOC_TYPE_MSIX)
+	    info->type != X86_IRQ_ALLOC_TYPE_MSIX &&
+	    info->type != X86_IRQ_ALLOC_TYPE_IMS)
 		return -EINVAL;
 
 	/*
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 980234ae0312..cdaab83001da 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -557,6 +557,7 @@ struct intel_iommu {
 	struct ir_table *ir_table;	/* Interrupt remapping info */
 	struct irq_domain *ir_domain;
 	struct irq_domain *ir_msi_domain;
+	struct irq_domain *ir_ims_domain;
 #endif
 	struct iommu_device iommu;  /* IOMMU core code handle */
 	int		node;
@@ -701,6 +702,8 @@ extern struct intel_iommu *intel_svm_device_to_iommu(struct device *dev);
 static inline void intel_svm_check(struct intel_iommu *iommu) {}
 #endif
 
+extern struct intel_iommu *device_to_iommu(struct device *dev,
+					   u8 *bus, u8 *devfn);
 #ifdef CONFIG_INTEL_IOMMU_DEBUGFS
 void intel_iommu_debugfs_init(void);
 #else
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8b5f24bf3c47..2f8fa1391333 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -135,6 +135,7 @@ struct msi_desc {
 enum platform_msi_type {
 	NOT_PLAT_MSI = 0,
 	GEN_PLAT_MSI = 1,
+	IMS =	2,
 };
 
 struct platform_msi_group_entry {
@@ -454,4 +455,12 @@ static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
 }
 #endif /* CONFIG_PCI_MSI_IRQ_DOMAIN */
 
+#ifdef CONFIG_MSI_IMS
+struct irq_domain *dev_get_ims_domain(struct device *dev);
+#else
+static inline struct irq_domain *dev_get_ims_domain(struct device *dev)
+{
+	return NULL;
+}
+#endif
 #endif /* LINUX_MSI_H */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 07/15] Documentation: Interrupt Message store
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (5 preceding siblings ...)
  2020-04-21 23:34 ` [PATCH RFC 06/15] ims-msi: Enable IMS interrupts Dave Jiang
@ 2020-04-21 23:34 ` Dave Jiang
  2020-04-23 20:04   ` Jason Gunthorpe
  2020-04-21 23:34 ` [PATCH RFC 08/15] vfio/mdev: Add a member for iommu domain in mdev_device Dave Jiang
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:34 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Megha Dey <megha.dey@linux.intel.com>

Add documentation for interrupt message store. This documentation
describes the basics of Interrupt Message Store (IMS), the need to
introduce a new interrupt mechanism, implementation details in the
kernel, driver changes required to support IMS and the general
misconceptions and FAQs associated with IMS.

Currently the only consumer of the newly introduced IMS APIs is
Intel's Data Streaming Accelerator.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 Documentation/ims-howto.rst |  210 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 210 insertions(+)
 create mode 100644 Documentation/ims-howto.rst

diff --git a/Documentation/ims-howto.rst b/Documentation/ims-howto.rst
new file mode 100644
index 000000000000..a18de152b393
--- /dev/null
+++ b/Documentation/ims-howto.rst
@@ -0,0 +1,210 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+==========================
+The IMS Driver Guide HOWTO
+==========================
+
+:Authors: Megha Dey
+
+:Copyright: 2020 Intel Corporation
+
+About this guide
+================
+
+This guide describes the basics of Interrupt Message Store (IMS), the
+need to introduce a new interrupt mechanism, implementation details of
+IMS in the kernel, driver changes required to support IMS and the general
+misconceptions and FAQs associated with IMS.
+
+What is IMS?
+============
+
+Intel has introduced the Scalable I/O virtualization (SIOV)[1] which
+provides a scalable and lightweight approach to hardware assisted I/O
+virtualization by overcoming many of the shortcomings of SR-IOV.
+
+SIOV shares I/O devices at a much finer granularity, the minimal sharable
+resource being the 'Assignable Device Interface' or ADI. Each ADI can
+support multiple interrupt messages and thus, we need a matching scalable
+interrupt mechanism to process these ADI interrupts. Interrupt Message
+Store or IMS is a new interrupt mechanism to meet such a demand.
+
+Why use IMS?
+============
+
+Until now, the maximum number of interrupts a device could support was 2048
+(using MSI-X). With IMS, there is no such restriction. A device can report
+support for SIOV(and hence IMS) to the kernel through the host device
+driver. Alternatively, if the kernel needs a generic way to discover these
+capabilities without host driver dependency, the PCIE Designated Vendor
+specific Extended capability(DVSEC) can be used. ([1]Section 3.7)
+
+IMS is device-specific which means that the programming of the interrupt
+messages (address/data pairs) is done in some device specific way, and not
+by using a standard like PCI. Also, by using IMS, the device is free to
+choose where it wants to store the interrupt messages. This makes IMS
+highly flexible. Some devices may organise IMS as a table in device memory
+(like MSI-X) which can be accessed through one or more memory mapped system
+pages, some device implementations may organise it in a distributed/
+replicated fashion at each of the “engines” in the device (with future
+multi-tile devices) while context based devices (GPUs for instance),
+can have it stored/located in memory (as part of supervisory state of a
+command/context), that the hosting function can fetch and cache on demand
+at the device. Since the number of contexts cannot be determined at boot
+time, there cannot be a standard enumeration of the IMS size during boot.
+In any approach, devices may implement IMS as either one unified storage
+structure or de-centralized per ADI storage structures.
+
+Even though the IMS storage organisation is device-specific, IMS entries
+store and generate interrupts using the same interrupt message address and
+data format as the PCI Express MSI-X table entries, a DWORD size data
+payload and a 64-bit address. Interrupt messages are expected to be
+programmed only by the host driver. All the IMS interrupt messages are
+stored in the remappable format. Hence, if a driver enables IMS, interrupt
+remapping is also enabled by default.
+
+A device can support both MSI-X and IMS entries simultaneously, each being
+used for a different purpose. E.g., MSI-X can be used to report device level
+errors while IMS for software constructed devices created for native or
+guest use.
+
+Implementation of IMS in the kernel
+===================================
+
+The Linux kernel today already provides a generic mechanism to support
+non-PCI compliant MSI interrupts for platform devices (platform-msi.c).
+To support IMS interrupts, we create a new IMS IRQ domain and extend the
+existing infrastructure. Dynamic allocation of IMS vectors is a requirement
+for devices which support Scalable I/O Virtualization. A driver can allocate
+and free vectors not just once during probe (as was the case with MSI/MSI-X)
+but also in the post probe phase where actual demand is available. Thus, a
+new API, platform_msi_domain_alloc_irqs_group is introduced which drivers
+using IMS would be able to call multiple times. The vectors allocated each
+time this API is called are associated with a group ID. To free the vectors
+associated with a particular group, the platform_msi_domain_free_irqs_group
+API can be called. The existing drivers using platform-msi infrastructure
+will continue to use the existing alloc (platform_msi_domain_alloc_irqs)
+and free (platform_msi_domain_free_irqs) APIs and are assigned a default
+group ID of 0.
+
+Thus, platform-msi.c provides the generic methods which can be used by any
+non-pci MSI interrupt type while the newly created ims-msi.c provides IMS
+specific callbacks that can be used by drivers capable of generating IMS
+interrupts. Intel has introduced data streaming accelerator (DSA)[2] device
+which supports SIOV and thus supports IMS. Currently, only Intel's Data
+Accelerator (idxd) driver is a consumer of this feature.
+
+FAQs and general misconceptions:
+================================
+
+** There were some concerns raised by Thomas Gleixner and Marc Zyngier
+during Linux plumbers conference 2019:
+
+1. Enumeration of IMS needs to be done by PCI core code and not by
+   individual device drivers:
+
+   Currently, if the kernel needs a generic way to discover IMS capability
+   without host driver dependency, the PCIE Designated Vendor specific
+
+   However, we cannot have a standard way of enumerating the IMS size
+   because for context based devices, the interrupt message is part of
+   the context itself which is managed entirely by the driver. Since
+   context creation is done on demand, there is no way to tell during boot
+   time, the maximum number of contexts (and hence the number of interrupt
+   messages)that the device can support.
+
+   Also, this seems redundant (given only the driver will use this
+   information). Hence, we thought it may suffice to enumerate it as part
+   of driver callback interfaces. In the current linux code, even with
+   MSI-X, the size reported by MSI-X capability is used only to cross check
+   if the driver is asking more than that or not (and if so, fail the call).
+
+   Although, if you believe it would be useful, we can add the IMS size
+   enumeration to the SIOV DVSEC capability.
+
+   Perhaps there is a misunderstanding on what IMS serves. IMS is not a
+   system-wide interrupt solution which serves all devices; it is a
+   self-serving device level interrupt mechanism (other than using system
+   vector resources). Since both producer and consumer of IMS belong to
+   the same device driver, there wouldn't be any ordering problem. Whereas,
+   if IMS service is provided by one driver which serves multiple drivers,
+   there would be ordering problems to solve.
+
+Some other commonly asked questions about IMS are as follows:
+
+1. Does all SIOV devices support MSI-X (even if they have IMS)?
+
+   Yes, all SIOV hosting functions are expected to have MSI-X capability
+   (irrespective of whether it supports IMS or not). This is done for
+   compatibility reasons, because a SIOV hosting function can be used
+   without enabling any SIOV capabilities as a standard PCIe PF.
+
+2. Why is Intel designing a new interrupt mechanism rather than extending
+   MSI-X to address its limitations? Isn't 2048 device interrupts enough?
+
+   MSI-X has a rigid definition of one-table and on-device storage and does
+   not provide the full flexibility required for future multi-tile
+   accelerator designs.
+   IMS was envisioned to be used with large number of ADIs in devices where
+   each will need unique interrupt resources. For example, a DSA shared
+   work queue can support large number of clients where each client can
+   have its own interrupt. In future, with user interrupts, we expect the
+   demand for messages to increase further.
+
+3. Will there be devices which only support IMS in the future?
+
+   No. All Scalable IOV devices will support MSI-X. But the number of MSI-X
+   table entries may be limited compared to number of IMS entries. Device
+   designs can restrict the number of interrupt messages supported with
+   MSI-X (e.g., support only what is required for the base PF function
+   without SIOV), and offer the interrupt message scalability only through
+   IMS. For e.g., DSA supports only 9 messages with MSI-X and 2K messages
+   with IMS.
+
+Device Driver Changes:
+=====================
+
+1. platform_msi_domain_alloc_irqs_group (struct device *dev, unsigned int
+   nvec, const struct platform_msi_ops *platform_ops, int *group_id)
+   to allocate IMS interrupts, where:
+
+   dev: The device for which to allocate interrupts
+   nvec: The number of interrupts to allocate
+   platform_ops: Callbacks for platform MSI ops (to be provided by driver)
+   group_id: returned by the call, to be used to free IRQs of a certain type
+
+   eg: static struct platform_msi_ops ims_ops  = {
+        .irq_mask               = ims_irq_mask,
+        .irq_unmask             = ims_irq_unmask,
+        .write_msg              = ims_write_msg,
+        };
+
+        int group;
+        platform_msi_domain_alloc_irqs_group (dev, nvec, platform_ops, &group)
+
+   where, struct platform_msi_ops:
+   irq_mask:   mask an interrupt source
+   irq_unmask: unmask an interrupt source
+   irq_write_msi_msg: write message content
+
+   This API can be called multiple times. Every time a new group will be
+   associated with the allocated vectors. Group ID starts from 0.
+
+2. platform_msi_domain_free_irqs_group(struct device *dev, int group) to
+   free IMS interrupts from a particular group
+
+3. To traverse the msi_descs associated with a group:
+        struct device *device;
+        struct msi_desc *desc;
+        struct platform_msi_group_entry *platform_msi_group;
+        int group;
+
+        for_each_platform_msi_entry_in_group(desc, platform_msi_group, group, dev) {
+        }
+
+References:
+===========
+
+[1]https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
+[2]https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 08/15] vfio/mdev: Add a member for iommu domain in mdev_device
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (6 preceding siblings ...)
  2020-04-21 23:34 ` [PATCH RFC 07/15] Documentation: Interrupt Message store Dave Jiang
@ 2020-04-21 23:34 ` Dave Jiang
  2020-04-21 23:34 ` [PATCH RFC 09/15] vfio/type1: Save domain when attach domain to mdev Dave Jiang
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:34 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Lu Baolu <baolu.lu@linux.intel.com>

This adds a member to save iommu domain in mdev_device
structure. Whenever an iommu domain is attached to the
mediated device, it must be save here so that a VDCM
(Virtual Device Control Module) could retreive it.

Below member is added in struct mdev_device:
* iommu_domain
  - A place to save the iommu domain attached to this
    mdev.

Below helpers are added to set and get iommu domain in
struct mdev_device.
* mdev_set/get_iommu_domain(domain)
  - A iommu domain which has been attached to the iommu
    device in order to protect and isolate the mediated
    device will be kept in the mdev data structure and
    could be retrieved later.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/vfio/mdev/mdev_core.c    |   16 ++++++++++++++++
 drivers/vfio/mdev/mdev_private.h |    1 +
 include/linux/mdev.h             |   10 ++++++++++
 3 files changed, 27 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index cecc6a6bdbef..15863cf83f3f 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -410,6 +410,22 @@ struct device *mdev_get_iommu_device(struct device *dev)
 }
 EXPORT_SYMBOL(mdev_get_iommu_device);
 
+void mdev_set_iommu_domain(struct device *dev, void *domain)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	mdev->iommu_domain = domain;
+}
+EXPORT_SYMBOL(mdev_set_iommu_domain);
+
+void *mdev_get_iommu_domain(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	return mdev->iommu_domain;
+}
+EXPORT_SYMBOL(mdev_get_iommu_domain);
+
 static int __init mdev_init(void)
 {
 	return mdev_bus_register();
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index c21f1305a76b..c97478b22a02 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -32,6 +32,7 @@ struct mdev_device {
 	struct list_head next;
 	struct kobject *type_kobj;
 	struct device *iommu_device;
+	void *iommu_domain;
 	bool active;
 };
 
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index fa2344e239ef..0d66daaecc67 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -26,6 +26,16 @@ int mdev_set_iommu_device(struct device *dev, struct device *iommu_device);
 
 struct device *mdev_get_iommu_device(struct device *dev);
 
+/*
+ * Called by vfio iommu modules to save the iommu domain after a domain being
+ * attached to the mediated device. The vDCM (virtual device control module)
+ * could call mdev_get_iommu_domain() to retrieve an auxiliary domain attached
+ * to an mdev.
+ */
+void mdev_set_iommu_domain(struct device *dev, void *domain);
+
+void *mdev_get_iommu_domain(struct device *dev);
+
 /**
  * struct mdev_parent_ops - Structure to be registered for each parent device to
  * register the device to mdev module.


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 09/15] vfio/type1: Save domain when attach domain to mdev
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (7 preceding siblings ...)
  2020-04-21 23:34 ` [PATCH RFC 08/15] vfio/mdev: Add a member for iommu domain in mdev_device Dave Jiang
@ 2020-04-21 23:34 ` Dave Jiang
  2020-04-21 23:34 ` [PATCH RFC 10/15] dmaengine: idxd: add config support for readonly devices Dave Jiang
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:34 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Lu Baolu <baolu.lu@linux.intel.com>

This saves the iommu domain in mdev on attaching a domain
to it and clear it on detaching.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/vfio/vfio_iommu_type1.c |   52 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 85b32c325282..40b22c456b06 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1309,20 +1309,62 @@ static struct device *vfio_mdev_get_iommu_device(struct device *dev)
 	return NULL;
 }
 
+static int vfio_mdev_set_domain(struct device *dev, struct iommu_domain *domain)
+{
+	void (*fn)(struct device *dev, void *domain);
+
+	fn = symbol_get(mdev_set_iommu_domain);
+	if (fn) {
+		fn(dev, domain);
+		symbol_put(mdev_set_iommu_domain);
+
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static struct iommu_domain *vfio_mdev_get_domain(struct device *dev)
+{
+	void *(*fn)(struct device *dev);
+
+	fn = symbol_get(mdev_get_iommu_domain);
+	if (fn) {
+		struct iommu_domain *domain;
+
+		domain = fn(dev);
+		symbol_put(mdev_get_iommu_domain);
+
+		return domain;
+	}
+
+	return NULL;
+}
+
 static int vfio_mdev_attach_domain(struct device *dev, void *data)
 {
-	struct iommu_domain *domain = data;
+	struct iommu_domain *domain;
 	struct device *iommu_device;
+	int ret = -ENODEV;
+
+	/* Only single domain is allowed to attach to an mdev. */
+	domain = vfio_mdev_get_domain(dev);
+	if (domain)
+		return -EINVAL;
+	domain = data;
 
 	iommu_device = vfio_mdev_get_iommu_device(dev);
 	if (iommu_device) {
 		if (iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
-			return iommu_aux_attach_device(domain, iommu_device);
+			ret = iommu_aux_attach_device(domain, iommu_device);
 		else
-			return iommu_attach_device(domain, iommu_device);
+			ret = iommu_attach_device(domain, iommu_device);
 	}
 
-	return -EINVAL;
+	if (!ret)
+		vfio_mdev_set_domain(dev, domain);
+
+	return ret;
 }
 
 static int vfio_mdev_detach_domain(struct device *dev, void *data)
@@ -1338,6 +1380,8 @@ static int vfio_mdev_detach_domain(struct device *dev, void *data)
 			iommu_detach_device(domain, iommu_device);
 	}
 
+	vfio_mdev_set_domain(dev, NULL);
+
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 10/15] dmaengine: idxd: add config support for readonly devices
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (8 preceding siblings ...)
  2020-04-21 23:34 ` [PATCH RFC 09/15] vfio/type1: Save domain when attach domain to mdev Dave Jiang
@ 2020-04-21 23:34 ` Dave Jiang
  2020-04-21 23:34 ` [PATCH RFC 11/15] dmaengine: idxd: add IMS support in base driver Dave Jiang
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:34 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

The device can have a readonly bit set for configuration. This especially
is true for mediated device in guest that are software emulated. Add
support to load configuration if the device config is read only.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/device.c |  159 ++++++++++++++++++++++++++++++++++++++++++++-
 drivers/dma/idxd/idxd.h   |    2 +
 drivers/dma/idxd/init.c   |    6 ++
 drivers/dma/idxd/sysfs.c  |   45 +++++++++----
 4 files changed, 194 insertions(+), 18 deletions(-)

diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 684a0e167770..a46b6558984c 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -603,6 +603,36 @@ static int idxd_groups_config_write(struct idxd_device *idxd)
 	return 0;
 }
 
+static int idxd_wq_config_write_ro(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	int wq_offset;
+
+	if (!wq->group)
+		return 0;
+
+	if (idxd->pasid_enabled) {
+		wq->wqcfg.pasid_en = 1;
+		if (wq->type == IDXD_WQT_KERNEL && wq_dedicated(wq))
+			wq->wqcfg.pasid = idxd->pasid;
+	} else {
+		wq->wqcfg.pasid_en = 0;
+	}
+
+	if (wq->type == IDXD_WQT_KERNEL)
+		wq->wqcfg.priv = 1;
+
+	if (idxd->type == IDXD_TYPE_DSA &&
+	    idxd->hw.gen_cap.block_on_fault &&
+	    test_bit(WQ_FLAG_BOF, &wq->flags))
+		wq->wqcfg.bof = 1;
+
+	wq_offset = idxd->wqcfg_offset + wq->id * 32 + 2 * sizeof(u32);
+	iowrite32(wq->wqcfg.bits[2], idxd->reg_base + wq_offset);
+
+	return 0;
+}
+
 static int idxd_wq_config_write(struct idxd_wq *wq)
 {
 	struct idxd_device *idxd = wq->idxd;
@@ -633,7 +663,8 @@ static int idxd_wq_config_write(struct idxd_wq *wq)
 
 	if (idxd->pasid_enabled) {
 		wq->wqcfg.pasid_en = 1;
-		wq->wqcfg.pasid = idxd->pasid;
+		if (wq->type == IDXD_WQT_KERNEL && wq_dedicated(wq))
+			wq->wqcfg.pasid = idxd->pasid;
 	}
 
 	wq->wqcfg.priority = wq->priority;
@@ -658,14 +689,17 @@ static int idxd_wq_config_write(struct idxd_wq *wq)
 	return 0;
 }
 
-static int idxd_wqs_config_write(struct idxd_device *idxd)
+static int idxd_wqs_config_write(struct idxd_device *idxd, bool rw)
 {
 	int i, rc;
 
 	for (i = 0; i < idxd->max_wqs; i++) {
 		struct idxd_wq *wq = &idxd->wqs[i];
 
-		rc = idxd_wq_config_write(wq);
+		if (rw)
+			rc = idxd_wq_config_write(wq);
+		else
+			rc = idxd_wq_config_write_ro(wq);
 		if (rc < 0)
 			return rc;
 	}
@@ -764,6 +798,12 @@ static int idxd_wqs_setup(struct idxd_device *idxd)
 	return 0;
 }
 
+int idxd_device_ro_config(struct idxd_device *idxd)
+{
+	lockdep_assert_held(&idxd->dev_lock);
+	return idxd_wqs_config_write(idxd, false);
+}
+
 int idxd_device_config(struct idxd_device *idxd)
 {
 	int rc;
@@ -779,7 +819,7 @@ int idxd_device_config(struct idxd_device *idxd)
 
 	idxd_group_flags_setup(idxd);
 
-	rc = idxd_wqs_config_write(idxd);
+	rc = idxd_wqs_config_write(idxd, true);
 	if (rc < 0)
 		return rc;
 
@@ -789,3 +829,114 @@ int idxd_device_config(struct idxd_device *idxd)
 
 	return 0;
 }
+
+static void idxd_wq_load_config(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	int wqcfg_offset;
+	int i;
+
+	wqcfg_offset = idxd->wqcfg_offset + wq->id * 32;
+	memcpy_fromio(&wq->wqcfg, idxd->reg_base + wqcfg_offset,
+		      sizeof(union wqcfg));
+
+	wq->size = wq->wqcfg.wq_size;
+	wq->threshold = wq->wqcfg.wq_thresh;
+	if (wq->wqcfg.priv)
+		wq->type = IDXD_WQT_KERNEL;
+
+	if (wq->wqcfg.mode)
+		set_bit(WQ_FLAG_DEDICATED, &wq->flags);
+
+	wq->priority = wq->wqcfg.priority;
+
+	if (wq->wqcfg.bof)
+		set_bit(WQ_FLAG_BOF, &wq->flags);
+
+	if (idxd->pasid_enabled) {
+		wq->wqcfg.pasid_en = 1;
+		wqcfg_offset = idxd->wqcfg_offset +
+			       wq->id * 32 + 2 * sizeof(u32);
+		iowrite32(wq->wqcfg.bits[2], idxd->reg_base + wqcfg_offset);
+	}
+
+	for (i = 0; i < 8; i++) {
+		wqcfg_offset = idxd->wqcfg_offset +
+			       wq->id * 32 + i * sizeof(u32);
+		dev_dbg(dev, "WQ[%d][%d][%#x]: %#x\n",
+			wq->id, i, wqcfg_offset, wq->wqcfg.bits[i]);
+	}
+}
+
+static void idxd_group_load_config(struct idxd_group *group)
+{
+	struct idxd_device *idxd = group->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	int i, j, grpcfg_offset;
+
+	/* load wqs */
+	for (i = 0; i < 4; i++) {
+		struct idxd_wq *wq;
+
+		grpcfg_offset = idxd->grpcfg_offset +
+				group->id * 64 + i * sizeof(u64);
+		group->grpcfg.wqs[i] =
+			ioread64(idxd->reg_base + grpcfg_offset);
+		dev_dbg(dev, "GRPCFG wq[%d:%d: %#x]: %#llx\n",
+			group->id, i, grpcfg_offset,
+			group->grpcfg.wqs[i]);
+
+		for (j = 0; j < sizeof(u64); j++) {
+			int wq_id = i * 64 + j;
+
+			if (wq_id >= idxd->max_wqs)
+				break;
+			if (group->grpcfg.wqs[i] & BIT(j)) {
+				wq = &idxd->wqs[wq_id];
+				wq->group = group;
+			}
+		}
+	}
+
+	grpcfg_offset = idxd->grpcfg_offset + group->id * 64 + 32;
+	group->grpcfg.engines = ioread64(idxd->reg_base + grpcfg_offset);
+	dev_dbg(dev, "GRPCFG engs[%d: %#x]: %#llx\n", group->id,
+		grpcfg_offset, group->grpcfg.engines);
+
+	for (i = 0; i < sizeof(u64); i++) {
+		if (i > idxd->max_engines)
+			break;
+		if (group->grpcfg.engines & BIT(i)) {
+			struct idxd_engine *engine = &idxd->engines[i];
+
+			engine->group = group;
+		}
+	}
+
+	grpcfg_offset = grpcfg_offset + group->id * 64 + 40;
+	group->grpcfg.flags.bits = ioread32(idxd->reg_base + grpcfg_offset);
+	dev_dbg(dev, "GRPFLAGS flags[%d: %#x]: %#x\n",
+		group->id, grpcfg_offset, group->grpcfg.flags.bits);
+}
+
+void idxd_device_load_config(struct idxd_device *idxd)
+{
+	union gencfg_reg reg;
+	int i;
+
+	reg.bits = ioread32(idxd->reg_base + IDXD_GENCFG_OFFSET);
+	idxd->token_limit = reg.token_limit;
+
+	for (i = 0; i < idxd->max_groups; i++) {
+		struct idxd_group *group = &idxd->groups[i];
+
+		idxd_group_load_config(group);
+	}
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		idxd_wq_load_config(wq);
+	}
+}
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 304b76169c0d..82a9b6035722 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -286,8 +286,10 @@ int idxd_device_reset(struct idxd_device *idxd);
 int __idxd_device_reset(struct idxd_device *idxd);
 void idxd_device_cleanup(struct idxd_device *idxd);
 int idxd_device_config(struct idxd_device *idxd);
+int idxd_device_ro_config(struct idxd_device *idxd);
 void idxd_device_wqs_clear_state(struct idxd_device *idxd);
 int idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
+void idxd_device_load_config(struct idxd_device *idxd);
 
 /* work queue control */
 int idxd_wq_alloc_resources(struct idxd_wq *wq);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index f794ee1c7c1b..c0fd796e9dce 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -367,6 +367,12 @@ static int idxd_probe(struct idxd_device *idxd)
 	if (rc)
 		goto err_setup;
 
+	/* If the configs are readonly, then load them from device */
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags)) {
+		dev_dbg(dev, "Loading RO device config\n");
+		idxd_device_load_config(idxd);
+	}
+
 	rc = idxd_setup_interrupts(idxd);
 	if (rc)
 		goto err_setup;
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index dc38172be42e..1dd3ade2e438 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -120,13 +120,17 @@ static int idxd_config_bus_probe(struct device *dev)
 
 		spin_lock_irqsave(&idxd->dev_lock, flags);
 
-		/* Perform IDXD configuration and enabling */
-		rc = idxd_device_config(idxd);
-		if (rc < 0) {
-			spin_unlock_irqrestore(&idxd->dev_lock, flags);
-			module_put(THIS_MODULE);
-			dev_warn(dev, "Device config failed: %d\n", rc);
-			return rc;
+		if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags)) {
+			/* Perform DSA configuration and enabling */
+			rc = idxd_device_config(idxd);
+			if (rc < 0) {
+				spin_unlock_irqrestore(&idxd->dev_lock,
+						       flags);
+				module_put(THIS_MODULE);
+				dev_warn(dev, "Device config failed: %d\n",
+					 rc);
+				return rc;
+			}
 		}
 
 		/* start device */
@@ -211,13 +215,26 @@ static int idxd_config_bus_probe(struct device *dev)
 		}
 
 		spin_lock_irqsave(&idxd->dev_lock, flags);
-		rc = idxd_device_config(idxd);
-		if (rc < 0) {
-			spin_unlock_irqrestore(&idxd->dev_lock, flags);
-			mutex_unlock(&wq->wq_lock);
-			dev_warn(dev, "Writing WQ %d config failed: %d\n",
-				 wq->id, rc);
-			return rc;
+		if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags)) {
+			rc = idxd_device_config(idxd);
+			if (rc < 0) {
+				spin_unlock_irqrestore(&idxd->dev_lock,
+						       flags);
+				mutex_unlock(&wq->wq_lock);
+				dev_warn(dev, "Writing WQ %d config failed: %d\n",
+					 wq->id, rc);
+				return rc;
+			}
+		} else {
+			rc = idxd_device_ro_config(idxd);
+			if (rc < 0) {
+				spin_unlock_irqrestore(&idxd->dev_lock,
+						       flags);
+				mutex_unlock(&wq->wq_lock);
+				dev_warn(dev, "Writing WQ %d config failed: %d\n",
+					 wq->id, rc);
+				return rc;
+			}
 		}
 
 		rc = idxd_wq_enable(wq);


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 11/15] dmaengine: idxd: add IMS support in base driver
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (9 preceding siblings ...)
  2020-04-21 23:34 ` [PATCH RFC 10/15] dmaengine: idxd: add config support for readonly devices Dave Jiang
@ 2020-04-21 23:34 ` Dave Jiang
  2020-04-21 23:35 ` [PATCH RFC 12/15] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:34 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

In preparation for support of VFIO mediated device for idxd driver, the
enabling for Interrupt Message Store (IMS) interrupts is added for the idxd
base driver. Until now, the maximum number of interrupts a device could
support was 2048 (MSI-X). With IMS, the maximum number of interrupts can be
significantly expanded for guest support. This commit only provides the
support functions in the base driver and not the VFIO mdev code
utilization.

See Intel SIOV spec for more details:
https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/Kconfig       |    1 
 drivers/dma/idxd/Makefile |    2 -
 drivers/dma/idxd/cdev.c   |    3 +
 drivers/dma/idxd/idxd.h   |   21 ++++-
 drivers/dma/idxd/init.c   |   46 +++++++++++-
 drivers/dma/idxd/mdev.c   |  179 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/mdev.h   |   82 +++++++++++++++++++++
 drivers/dma/idxd/submit.c |    3 +
 drivers/dma/idxd/sysfs.c  |   11 +++
 9 files changed, 340 insertions(+), 8 deletions(-)
 create mode 100644 drivers/dma/idxd/mdev.c
 create mode 100644 drivers/dma/idxd/mdev.h

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 71ea9f24a8f9..9e7d9eafb1f5 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -290,6 +290,7 @@ config INTEL_IDXD
 	select PCI_PRI
 	select PCI_PASID
 	select PCI_IOV
+	select MSI_IMS
 	help
 	  Enable support for the Intel(R) data accelerators present
 	  in Intel Xeon CPU.
diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index 8978b898d777..308e12869f96 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,2 @@
 obj-$(CONFIG_INTEL_IDXD) += idxd.o
-idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o
+idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o mdev.o
diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index 27be9250606d..ddd3ce16620d 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -186,7 +186,8 @@ static int idxd_cdev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_WIPEONFORK;
 	pfn = (base + idxd_get_wq_portal_full_offset(wq->id,
-				IDXD_PORTAL_LIMITED)) >> PAGE_SHIFT;
+				IDXD_PORTAL_LIMITED,
+				IDXD_IRQ_MSIX)) >> PAGE_SHIFT;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_private_data = ctx;
 
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 82a9b6035722..3a942e9c5980 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -172,6 +172,7 @@ struct idxd_device {
 
 	int num_groups;
 
+	u32 ims_offset;
 	u32 msix_perm_offset;
 	u32 wqcfg_offset;
 	u32 grpcfg_offset;
@@ -179,6 +180,7 @@ struct idxd_device {
 
 	u64 max_xfer_bytes;
 	u32 max_batch_size;
+	int ims_size;
 	int max_groups;
 	int max_engines;
 	int max_tokens;
@@ -194,6 +196,9 @@ struct idxd_device {
 	struct idxd_irq_entry *irq_entries;
 
 	struct dma_device dma_dev;
+
+	atomic_t num_allocated_ims;
+	struct sbitmap ims_sbmap;
 };
 
 /* IDXD software descriptor */
@@ -224,15 +229,23 @@ enum idxd_portal_prot {
 	IDXD_PORTAL_LIMITED,
 };
 
-static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
+enum idxd_interrupt_type {
+	IDXD_IRQ_MSIX = 0,
+	IDXD_IRQ_IMS,
+};
+
+static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot,
+					    enum idxd_interrupt_type irq_type)
 {
-	return prot * 0x1000;
+	return prot * 0x1000 + irq_type * 0x2000;
 }
 
 static inline int idxd_get_wq_portal_full_offset(int wq_id,
-						 enum idxd_portal_prot prot)
+						 enum idxd_portal_prot prot,
+						 enum idxd_interrupt_type irq_type)
 {
-	return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot);
+	return ((wq_id * 4) << PAGE_SHIFT) +
+		idxd_get_wq_portal_offset(prot, irq_type);
 }
 
 static inline void idxd_set_type(struct idxd_device *idxd)
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index c0fd796e9dce..15b3ef73cac3 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -231,10 +231,42 @@ static void idxd_read_table_offsets(struct idxd_device *idxd)
 	idxd->msix_perm_offset = offsets.msix_perm * 0x100;
 	dev_dbg(dev, "IDXD MSIX Permission Offset: %#x\n",
 		idxd->msix_perm_offset);
+	idxd->ims_offset = offsets.ims * 0x100;
+	dev_dbg(dev, "IDXD IMS Offset: %#x\n", idxd->ims_offset);
 	idxd->perfmon_offset = offsets.perfmon * 0x100;
 	dev_dbg(dev, "IDXD Perfmon Offset: %#x\n", idxd->perfmon_offset);
 }
 
+static int device_supports_ims(struct pci_dev *pdev)
+{
+	int dvsec;
+	u16 val16;
+	u32 val32;
+
+	dvsec = pci_find_ext_capability(pdev, 0x23);
+	pci_read_config_word(pdev, dvsec + 0x4, &val16);
+	if (val16 != 0x8086) {
+		dev_dbg(&pdev->dev, "DVSEC vendor id is not Intel\n");
+		return -EOPNOTSUPP;
+	}
+
+	pci_read_config_word(pdev, dvsec + 0x8, &val16);
+	if (val16 != 0x5) {
+		dev_dbg(&pdev->dev, "DVSEC ID is not SIOV\n");
+		return -EOPNOTSUPP;
+	}
+
+	pci_read_config_dword(pdev, dvsec + 0x14, &val32);
+	if (val32 & 0x1) {
+		dev_dbg(&pdev->dev, "IMS supported for device\n");
+		return 0;
+	}
+
+	dev_dbg(&pdev->dev, "IMS unsupported for device\n");
+
+	return -EOPNOTSUPP;
+}
+
 static void idxd_read_caps(struct idxd_device *idxd)
 {
 	struct device *dev = &idxd->pdev->dev;
@@ -247,9 +279,11 @@ static void idxd_read_caps(struct idxd_device *idxd)
 	dev_dbg(dev, "max xfer size: %llu bytes\n", idxd->max_xfer_bytes);
 	idxd->max_batch_size = 1U << idxd->hw.gen_cap.max_batch_shift;
 	dev_dbg(dev, "max batch size: %u\n", idxd->max_batch_size);
+	if (device_supports_ims(idxd->pdev) == 0)
+		idxd->ims_size = idxd->hw.gen_cap.max_ims_mult * 256ULL;
+	dev_dbg(dev, "IMS size: %u\n", idxd->ims_size);
 	if (idxd->hw.gen_cap.config_en)
 		set_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags);
-
 	/* reading group capabilities */
 	idxd->hw.group_cap.bits =
 		ioread64(idxd->reg_base + IDXD_GRPCAP_OFFSET);
@@ -294,6 +328,7 @@ static struct idxd_device *idxd_alloc(struct pci_dev *pdev)
 
 	idxd->pdev = pdev;
 	spin_lock_init(&idxd->dev_lock);
+	atomic_set(&idxd->num_allocated_ims, 0);
 
 	return idxd;
 }
@@ -389,9 +424,18 @@ static int idxd_probe(struct idxd_device *idxd)
 
 	idxd->major = idxd_cdev_get_major(idxd);
 
+	rc = sbitmap_init_node(&idxd->ims_sbmap, idxd->ims_size, -1,
+			       GFP_KERNEL, dev_to_node(dev));
+	if (rc < 0)
+		goto sbitmap_fail;
+
 	dev_dbg(dev, "IDXD device %d probed successfully\n", idxd->id);
 	return 0;
 
+ sbitmap_fail:
+	mutex_lock(&idxd_idr_lock);
+	idr_remove(&idxd_idrs[idxd->type], idxd->id);
+	mutex_unlock(&idxd_idr_lock);
  err_idr_fail:
 	idxd_mask_error_interrupts(idxd);
 	idxd_mask_msix_vectors(idxd);
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
new file mode 100644
index 000000000000..2cf0cdf149b7
--- /dev/null
+++ b/drivers/dma/idxd/mdev.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/msi.h>
+#include <linux/mdev.h>
+#include <linux/vfio.h>
+#include "../../vfio/pci/vfio_pci_private.h"
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "mdev.h"
+
+static void idxd_free_ims_index(struct idxd_device *idxd,
+				unsigned long ims_idx)
+{
+	sbitmap_clear_bit(&idxd->ims_sbmap, ims_idx);
+	atomic_dec(&idxd->num_allocated_ims);
+}
+
+static int vidxd_free_ims_entries(struct vdcm_idxd *vidxd)
+{
+	struct idxd_device *idxd = vidxd->idxd;
+	struct ims_irq_entry *irq_entry;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	struct msi_desc *desc;
+	int i = 0;
+	struct platform_msi_group_entry *platform_msi_group;
+
+	for_each_platform_msi_entry_in_group(desc, platform_msi_group, 0, dev) {
+		irq_entry = &vidxd->irq_entries[i];
+		devm_free_irq(dev, desc->irq, irq_entry);
+		i++;
+	}
+
+	platform_msi_domain_free_irqs(dev);
+
+	for (i = 0; i < vidxd->num_wqs; i++)
+		idxd_free_ims_index(idxd, vidxd->ims_index[i]);
+	return 0;
+}
+
+static int idxd_alloc_ims_index(struct idxd_device *idxd)
+{
+	int index;
+
+	index = sbitmap_get(&idxd->ims_sbmap, 0, false);
+	if (index < 0)
+		return -ENOSPC;
+	return index;
+}
+
+static unsigned int idxd_ims_irq_mask(struct msi_desc *desc)
+{
+	int ims_offset;
+	u32 mask_bits = desc->platform.masked;
+	struct device *dev = desc->dev;
+	struct mdev_device *mdev = mdev_from_dev(dev);
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	struct idxd_device *idxd = vidxd->idxd;
+	void __iomem *base;
+	int ims_id = desc->platform.msi_index;
+
+	dev_dbg(dev, "idxd irq mask: %d\n", ims_id);
+
+	mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
+	ims_offset = idxd->ims_offset + vidxd->ims_index[ims_id] * 0x10;
+	base = idxd->reg_base + ims_offset;
+	iowrite32(mask_bits, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
+
+	return mask_bits;
+}
+
+static unsigned int idxd_ims_irq_unmask(struct msi_desc *desc)
+{
+	int ims_offset;
+	u32 mask_bits = desc->platform.masked;
+	struct device *dev = desc->dev;
+	struct mdev_device *mdev = mdev_from_dev(dev);
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	struct idxd_device *idxd = vidxd->idxd;
+	void __iomem *base;
+	int ims_id = desc->platform.msi_index;
+
+	dev_dbg(dev, "idxd irq unmask: %d\n", ims_id);
+
+	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
+	ims_offset = idxd->ims_offset + vidxd->ims_index[ims_id] * 0x10;
+	base = idxd->reg_base + ims_offset;
+	iowrite32(mask_bits, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
+
+	return mask_bits;
+}
+
+static void idxd_ims_write_msg(struct msi_desc *desc, struct msi_msg *msg)
+{
+	int ims_offset;
+	struct device *dev = desc->dev;
+	struct mdev_device *mdev = mdev_from_dev(dev);
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	struct idxd_device *idxd = vidxd->idxd;
+	void __iomem *base;
+	int ims_id = desc->platform.msi_index;
+
+	dev_dbg(dev, "ims_write: %d %x\n", ims_id, msg->address_lo);
+
+	ims_offset = idxd->ims_offset + vidxd->ims_index[ims_id] * 0x10;
+	base = idxd->reg_base + ims_offset;
+	iowrite32(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
+	iowrite32(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
+	iowrite32(msg->data, base + PCI_MSIX_ENTRY_DATA);
+}
+
+static struct platform_msi_ops idxd_ims_ops  = {
+	.irq_mask		= idxd_ims_irq_mask,
+	.irq_unmask		= idxd_ims_irq_unmask,
+	.write_msg		= idxd_ims_write_msg,
+};
+
+static irqreturn_t idxd_guest_wq_completion_interrupt(int irq, void *data)
+{
+	/* send virtual interrupt */
+	return IRQ_HANDLED;
+}
+
+static int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
+{
+	struct idxd_device *idxd = vidxd->idxd;
+	struct ims_irq_entry *irq_entry;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	struct msi_desc *desc;
+	int err, i = 0;
+	int group;
+	struct platform_msi_group_entry *platform_msi_group;
+
+	if (!atomic_add_unless(&idxd->num_allocated_ims, vidxd->num_wqs,
+			       idxd->ims_size))
+		return -ENOSPC;
+
+	vidxd->ims_index[0] = idxd_alloc_ims_index(idxd);
+
+	err = platform_msi_domain_alloc_irqs_group(dev, vidxd->num_wqs,
+						   &idxd_ims_ops, &group);
+	if (err < 0) {
+		dev_dbg(dev, "Enabling IMS entry! %d\n", err);
+		return err;
+	}
+
+	i = 0;
+	for_each_platform_msi_entry_in_group(desc, platform_msi_group, group, dev) {
+		irq_entry = &vidxd->irq_entries[i];
+		irq_entry->vidxd = vidxd;
+		irq_entry->int_src = i;
+		err = devm_request_irq(dev, desc->irq,
+				       idxd_guest_wq_completion_interrupt, 0,
+				       "idxd-ims", irq_entry);
+		if (err)
+			break;
+		i++;
+	}
+
+	if (err) {
+		i = 0;
+		for_each_platform_msi_entry_in_group(desc, platform_msi_group, group, dev) {
+			irq_entry = &vidxd->irq_entries[i];
+			devm_free_irq(dev, desc->irq, irq_entry);
+			i++;
+		}
+		platform_msi_domain_free_irqs_group(dev, group);
+	}
+
+	return 0;
+}
diff --git a/drivers/dma/idxd/mdev.h b/drivers/dma/idxd/mdev.h
new file mode 100644
index 000000000000..5b05b6cb2b7b
--- /dev/null
+++ b/drivers/dma/idxd/mdev.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_MDEV_H_
+#define _IDXD_MDEV_H_
+
+/* two 64-bit BARs implemented */
+#define VIDXD_MAX_BARS 2
+#define VIDXD_MAX_CFG_SPACE_SZ 4096
+#define VIDXD_MSIX_TBL_SZ_OFFSET 0x42
+#define VIDXD_CAP_CTRL_SZ 0x100
+#define VIDXD_GRP_CTRL_SZ 0x100
+#define VIDXD_WQ_CTRL_SZ 0x100
+#define VIDXD_WQ_OCPY_INT_SZ 0x20
+#define VIDXD_MSIX_TBL_SZ 0x90
+#define VIDXD_MSIX_PERM_TBL_SZ 0x48
+
+#define VIDXD_MSIX_TABLE_OFFSET 0x600
+#define VIDXD_MSIX_PERM_OFFSET 0x300
+#define VIDXD_GRPCFG_OFFSET 0x400
+#define VIDXD_WQCFG_OFFSET 0x500
+#define VIDXD_IMS_OFFSET 0x1000
+
+#define VIDXD_BAR0_SIZE  0x2000
+#define VIDXD_BAR2_SIZE  0x20000
+#define VIDXD_MAX_MSIX_ENTRIES  (VIDXD_MSIX_TBL_SZ / 0x10)
+#define VIDXD_MAX_WQS	1
+
+#define	VIDXD_ATS_OFFSET 0x100
+#define	VIDXD_PRS_OFFSET 0x110
+#define VIDXD_PASID_OFFSET 0x120
+#define VIDXD_MSIX_PBA_OFFSET 0x700
+
+struct vdcm_idxd_pci_bar0 {
+	u8 cap_ctrl_regs[VIDXD_CAP_CTRL_SZ];
+	u8 grp_ctrl_regs[VIDXD_GRP_CTRL_SZ];
+	u8 wq_ctrl_regs[VIDXD_WQ_CTRL_SZ];
+	u8 wq_ocpy_int_regs[VIDXD_WQ_OCPY_INT_SZ];
+	u8 msix_table[VIDXD_MSIX_TBL_SZ];
+	u8 msix_perm_table[VIDXD_MSIX_PERM_TBL_SZ];
+	unsigned long msix_pba;
+};
+
+struct ims_irq_entry {
+	struct vdcm_idxd *vidxd;
+	int int_src;
+};
+
+struct idxd_vdev {
+	struct mdev_device *mdev;
+	struct eventfd_ctx *msix_trigger[VIDXD_MAX_MSIX_ENTRIES];
+	struct notifier_block group_notifier;
+	struct kvm *kvm;
+	struct work_struct release_work;
+	atomic_t released;
+};
+
+struct vdcm_idxd {
+	struct idxd_device *idxd;
+	struct idxd_wq *wq;
+	struct idxd_vdev vdev;
+	struct vdcm_idxd_type *type;
+	int num_wqs;
+	unsigned long handle;
+	u64 ims_index[VIDXD_MAX_WQS];
+	struct msix_entry ims_entry;
+	struct ims_irq_entry irq_entries[VIDXD_MAX_WQS];
+
+	/* For VM use case */
+	u64 bar_val[VIDXD_MAX_BARS];
+	u64 bar_size[VIDXD_MAX_BARS];
+	u8 cfg[VIDXD_MAX_CFG_SPACE_SZ];
+	struct vdcm_idxd_pci_bar0 bar0;
+	struct list_head list;
+};
+
+static inline struct vdcm_idxd *to_vidxd(struct idxd_vdev *vdev)
+{
+	return container_of(vdev, struct vdcm_idxd, vdev);
+}
+
+#endif
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index 741bc3aa7267..bdcac933bb28 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -123,7 +123,8 @@ int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc,
 		return -EIO;
 
 	portal = wq->portal +
-		 idxd_get_wq_portal_offset(IDXD_PORTAL_UNLIMITED);
+		 idxd_get_wq_portal_offset(IDXD_PORTAL_UNLIMITED,
+					   IDXD_IRQ_MSIX);
 	if (wq_dedicated(wq)) {
 		/*
 		 * The wmb() flushes writes to coherent DMA data before
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 1dd3ade2e438..07bad4f6c7fb 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -1282,6 +1282,16 @@ static ssize_t numa_node_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(numa_node);
 
+static ssize_t ims_size_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd =
+		container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->ims_size);
+}
+static DEVICE_ATTR_RO(ims_size);
+
 static ssize_t max_batch_size_show(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
@@ -1467,6 +1477,7 @@ static struct attribute *idxd_device_attributes[] = {
 	&dev_attr_max_work_queues_size.attr,
 	&dev_attr_max_engines.attr,
 	&dev_attr_numa_node.attr,
+	&dev_attr_ims_size.attr,
 	&dev_attr_max_batch_size.attr,
 	&dev_attr_max_transfer_size.attr,
 	&dev_attr_op_cap.attr,


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 12/15] dmaengine: idxd: add device support functions in prep for mdev
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (10 preceding siblings ...)
  2020-04-21 23:34 ` [PATCH RFC 11/15] dmaengine: idxd: add IMS support in base driver Dave Jiang
@ 2020-04-21 23:35 ` Dave Jiang
  2020-04-21 23:35 ` [PATCH RFC 13/15] dmaengine: idxd: add support for VFIO mediated device Dave Jiang
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:35 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Add some device support helper functions that will be used by VFIO mediated
device in preparation of adding VFIO mdev support.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/device.c |  130 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/idxd.h   |    7 ++
 drivers/dma/idxd/init.c   |   19 +++++++
 3 files changed, 156 insertions(+)

diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index a46b6558984c..830aa5859646 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -319,6 +319,40 @@ void idxd_wq_unmap_portal(struct idxd_wq *wq)
 	devm_iounmap(dev, wq->portal);
 }
 
+int idxd_wq_abort(struct idxd_wq *wq)
+{
+	int rc;
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	u32 operand, status;
+
+	lockdep_assert_held(&idxd->dev_lock);
+
+	dev_dbg(dev, "Abort WQ %d\n", wq->id);
+	if (wq->state != IDXD_WQ_ENABLED) {
+		dev_dbg(dev, "WQ %d not active\n", wq->id);
+		return -ENXIO;
+	}
+
+	operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
+	dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_WQ, operand);
+	rc = idxd_cmd_send(idxd, IDXD_CMD_ABORT_WQ, operand);
+	if (rc < 0)
+		return rc;
+
+	rc = idxd_cmd_wait(idxd, &status, IDXD_DRAIN_TIMEOUT);
+	if (rc < 0)
+		return rc;
+
+	if (status != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "WQ abort failed: %#x\n", status);
+		return -ENXIO;
+	}
+
+	dev_dbg(dev, "WQ %d aborted\n", wq->id);
+	return 0;
+}
+
 int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
 {
 	struct idxd_device *idxd = wq->idxd;
@@ -372,6 +406,66 @@ int idxd_wq_disable_pasid(struct idxd_wq *wq)
 	return 0;
 }
 
+void idxd_wq_update_pasid(struct idxd_wq *wq, int pasid)
+{
+	struct idxd_device *idxd = wq->idxd;
+	int offset;
+
+	lockdep_assert_held(&idxd->dev_lock);
+
+	/* PASID fields are 8 bytes into the WQCFG register */
+	offset = idxd->wqcfg_offset + wq->id * 32 + 8;
+	wq->wqcfg.pasid = pasid;
+	iowrite32(wq->wqcfg.bits[2], idxd->reg_base + offset);
+}
+
+void idxd_wq_update_priv(struct idxd_wq *wq, int priv)
+{
+	struct idxd_device *idxd = wq->idxd;
+	int offset;
+
+	lockdep_assert_held(&idxd->dev_lock);
+
+	/* priv field is 8 bytes into the WQCFG register */
+	offset = idxd->wqcfg_offset + wq->id * 32 + 8;
+	wq->wqcfg.priv = !!priv;
+	iowrite32(wq->wqcfg.bits[2], idxd->reg_base + offset);
+}
+
+int idxd_wq_drain(struct idxd_wq *wq)
+{
+	int rc;
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	u32 operand, status;
+
+	lockdep_assert_held(&idxd->dev_lock);
+
+	dev_dbg(dev, "Drain WQ %d\n", wq->id);
+	if (wq->state != IDXD_WQ_ENABLED) {
+		dev_dbg(dev, "WQ %d not active\n", wq->id);
+		return -ENXIO;
+	}
+
+	operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
+	dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_DRAIN_WQ, operand);
+	rc = idxd_cmd_send(idxd, IDXD_CMD_DRAIN_WQ, operand);
+	if (rc < 0)
+		return rc;
+
+	rc = idxd_cmd_wait(idxd, &status, IDXD_DRAIN_TIMEOUT);
+	if (rc < 0)
+		return rc;
+
+	if (status != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "WQ drain failed: %#x\n", status);
+		return -ENXIO;
+	}
+
+	dev_dbg(dev, "WQ %d drained\n", wq->id);
+	return 0;
+}
+
 /* Device control bits */
 static inline bool idxd_is_enabled(struct idxd_device *idxd)
 {
@@ -542,6 +636,42 @@ int idxd_device_drain_pasid(struct idxd_device *idxd, int pasid)
 	return 0;
 }
 
+int idxd_device_request_int_handle(struct idxd_device *idxd, int idx,
+				   int *handle)
+{
+	int rc;
+	struct device *dev = &idxd->pdev->dev;
+	u32 operand, status;
+
+	lockdep_assert_held(&idxd->dev_lock);
+
+	if (!idxd->hw.gen_cap.int_handle_req)
+		return -EOPNOTSUPP;
+
+	dev_dbg(dev, "get int handle, idx %d\n", idx);
+
+	operand = idx & 0xffff;
+	dev_dbg(dev, "cmd: %u operand: %#x\n",
+		IDXD_CMD_REQUEST_INT_HANDLE, operand);
+	rc = idxd_cmd_send(idxd, IDXD_CMD_REQUEST_INT_HANDLE, operand);
+	if (rc < 0)
+		return rc;
+
+	rc = idxd_cmd_wait(idxd, &status, IDXD_REG_TIMEOUT);
+	if (rc < 0)
+		return rc;
+
+	if (status != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "request int handle failed: %#x\n", status);
+		return -ENXIO;
+	}
+
+	*handle = (status >> 8) & 0xffff;
+
+	dev_dbg(dev, "int handle acquired: %u\n", *handle);
+	return 0;
+}
+
 /* Device configuration bits */
 static void idxd_group_config_write(struct idxd_group *group)
 {
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 3a942e9c5980..9b56a4c7f3fc 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -199,6 +199,7 @@ struct idxd_device {
 
 	atomic_t num_allocated_ims;
 	struct sbitmap ims_sbmap;
+	int *int_handles;
 };
 
 /* IDXD software descriptor */
@@ -303,6 +304,8 @@ int idxd_device_ro_config(struct idxd_device *idxd);
 void idxd_device_wqs_clear_state(struct idxd_device *idxd);
 int idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
 void idxd_device_load_config(struct idxd_device *idxd);
+int idxd_device_request_int_handle(struct idxd_device *idxd,
+				   int idx, int *handle);
 
 /* work queue control */
 int idxd_wq_alloc_resources(struct idxd_wq *wq);
@@ -313,6 +316,10 @@ int idxd_wq_map_portal(struct idxd_wq *wq);
 void idxd_wq_unmap_portal(struct idxd_wq *wq);
 int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid);
 int idxd_wq_disable_pasid(struct idxd_wq *wq);
+int idxd_wq_abort(struct idxd_wq *wq);
+void idxd_wq_update_pasid(struct idxd_wq *wq, int pasid);
+void idxd_wq_update_priv(struct idxd_wq *wq, int priv);
+int idxd_wq_drain(struct idxd_wq *wq);
 
 /* submission */
 int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc,
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 15b3ef73cac3..babe6e614087 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -56,6 +56,7 @@ static int idxd_setup_interrupts(struct idxd_device *idxd)
 	int i, msixcnt;
 	int rc = 0;
 	union msix_perm mperm;
+	unsigned long flags;
 
 	msixcnt = pci_msix_vec_count(pdev);
 	if (msixcnt < 0) {
@@ -130,6 +131,17 @@ static int idxd_setup_interrupts(struct idxd_device *idxd)
 		}
 		dev_dbg(dev, "Allocated idxd-msix %d for vector %d\n",
 			i, msix->vector);
+
+		if (idxd->hw.gen_cap.int_handle_req) {
+			spin_lock_irqsave(&idxd->dev_lock, flags);
+			rc = idxd_device_request_int_handle(idxd, i,
+							    &idxd->int_handles[i]);
+			spin_unlock_irqrestore(&idxd->dev_lock, flags);
+			if (rc < 0)
+				goto err_no_irq;
+			dev_dbg(dev, "int handle requested: %u\n",
+				idxd->int_handles[i]);
+		}
 	}
 
 	idxd_unmask_error_interrupts(idxd);
@@ -168,6 +180,13 @@ static int idxd_setup_internals(struct idxd_device *idxd)
 	struct device *dev = &idxd->pdev->dev;
 	int i;
 
+	if (idxd->hw.gen_cap.int_handle_req) {
+		idxd->int_handles = devm_kcalloc(dev, idxd->max_wqs,
+						 sizeof(int), GFP_KERNEL);
+		if (!idxd->int_handles)
+			return -ENOMEM;
+	}
+
 	idxd->groups = devm_kcalloc(dev, idxd->max_groups,
 				    sizeof(struct idxd_group), GFP_KERNEL);
 	if (!idxd->groups)


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 13/15] dmaengine: idxd: add support for VFIO mediated device
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (11 preceding siblings ...)
  2020-04-21 23:35 ` [PATCH RFC 12/15] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
@ 2020-04-21 23:35 ` Dave Jiang
  2020-04-21 23:35 ` [PATCH RFC 14/15] dmaengine: idxd: add error notification from host driver to " Dave Jiang
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:35 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Add enabling code that provide VFIO mediated device. A mediated device
allows hardware to export resources to guests with significantly less
dedicated hardware versus the SR-IOV implementation. For DSA devices
through mdev enabling, we can emulate a virtual DSA device in the guest
by exporting one or more workqueues to the guest and exposed as DSA
device(s). The software emulates PCI config and MMIO accesses. The I/O
submission path however is accessed directly to the hardware. A submission
portal is mmap'd to the guest in order to allow direct submission of
descriptors.

The creation of a mediated device will generate a UUID. The UUID can be
retrieved from one of the VFIO sysfs attributes. This UUID must be
provided to the idxd driver via sysfs in order to tie the specific mdev to
the relevant workqueue. Given the various ways a wq can be configured and
grouped on a device, this allows the system admin to directly associate a
specifically configured wq to be exported to the guest that is desired. The
hope is that this design choice provides the max configurability and
flexibility.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/Kconfig          |    3 
 drivers/dma/idxd/Makefile    |    2 
 drivers/dma/idxd/device.c    |   36 +
 drivers/dma/idxd/dma.c       |    9 
 drivers/dma/idxd/idxd.h      |   23 +
 drivers/dma/idxd/init.c      |   10 
 drivers/dma/idxd/irq.c       |    2 
 drivers/dma/idxd/mdev.c      | 1558 ++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/mdev.h      |   23 +
 drivers/dma/idxd/registers.h |   10 
 drivers/dma/idxd/submit.c    |   28 +
 drivers/dma/idxd/sysfs.c     |  143 ++++
 drivers/dma/idxd/vdev.c      |  570 +++++++++++++++
 drivers/dma/idxd/vdev.h      |   42 +
 14 files changed, 2418 insertions(+), 41 deletions(-)
 create mode 100644 drivers/dma/idxd/vdev.c
 create mode 100644 drivers/dma/idxd/vdev.h

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 9e7d9eafb1f5..e39e04309587 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -291,6 +291,9 @@ config INTEL_IDXD
 	select PCI_PASID
 	select PCI_IOV
 	select MSI_IMS
+	select VFIO_PCI
+	select VFIO_MDEV
+	select VFIO_MDEV_DEVICE
 	help
 	  Enable support for the Intel(R) data accelerators present
 	  in Intel Xeon CPU.
diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index 308e12869f96..bb1fb771f6b5 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,2 @@
 obj-$(CONFIG_INTEL_IDXD) += idxd.o
-idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o mdev.o
+idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o mdev.o vdev.o
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 830aa5859646..b92cb1ca20d3 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -223,11 +223,11 @@ void idxd_wq_free_resources(struct idxd_wq *wq)
 	sbitmap_free(&wq->sbmap);
 }
 
-int idxd_wq_enable(struct idxd_wq *wq)
+int idxd_wq_enable(struct idxd_wq *wq, u32 *status)
 {
 	struct idxd_device *idxd = wq->idxd;
 	struct device *dev = &idxd->pdev->dev;
-	u32 status;
+	u32 stat;
 	int rc;
 
 	lockdep_assert_held(&idxd->dev_lock);
@@ -240,13 +240,16 @@ int idxd_wq_enable(struct idxd_wq *wq)
 	rc = idxd_cmd_send(idxd, IDXD_CMD_ENABLE_WQ, wq->id);
 	if (rc < 0)
 		return rc;
-	rc = idxd_cmd_wait(idxd, &status, IDXD_REG_TIMEOUT);
+	rc = idxd_cmd_wait(idxd, &stat, IDXD_REG_TIMEOUT);
 	if (rc < 0)
 		return rc;
 
-	if (status != IDXD_CMDSTS_SUCCESS &&
-	    status != IDXD_CMDSTS_ERR_WQ_ENABLED) {
-		dev_dbg(dev, "WQ enable failed: %#x\n", status);
+	if (status)
+		*status = stat;
+
+	if (stat != IDXD_CMDSTS_SUCCESS &&
+	    stat != IDXD_CMDSTS_ERR_WQ_ENABLED) {
+		dev_dbg(dev, "WQ enable failed: %#x\n", stat);
 		return -ENXIO;
 	}
 
@@ -255,11 +258,11 @@ int idxd_wq_enable(struct idxd_wq *wq)
 	return 0;
 }
 
-int idxd_wq_disable(struct idxd_wq *wq)
+int idxd_wq_disable(struct idxd_wq *wq, u32 *status)
 {
 	struct idxd_device *idxd = wq->idxd;
 	struct device *dev = &idxd->pdev->dev;
-	u32 status, operand;
+	u32 stat, operand;
 	int rc;
 
 	lockdep_assert_held(&idxd->dev_lock);
@@ -274,12 +277,15 @@ int idxd_wq_disable(struct idxd_wq *wq)
 	rc = idxd_cmd_send(idxd, IDXD_CMD_DISABLE_WQ, operand);
 	if (rc < 0)
 		return rc;
-	rc = idxd_cmd_wait(idxd, &status, IDXD_REG_TIMEOUT);
+	rc = idxd_cmd_wait(idxd, &stat, IDXD_REG_TIMEOUT);
 	if (rc < 0)
 		return rc;
 
-	if (status != IDXD_CMDSTS_SUCCESS) {
-		dev_dbg(dev, "WQ disable failed: %#x\n", status);
+	if (status)
+		*status = stat;
+
+	if (stat != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "WQ disable failed: %#x\n", stat);
 		return -ENXIO;
 	}
 
@@ -362,7 +368,7 @@ int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
 
 	lockdep_assert_held(&idxd->dev_lock);
 
-	rc = idxd_wq_disable(wq);
+	rc = idxd_wq_disable(wq, NULL);
 	if (rc < 0)
 		return rc;
 
@@ -373,7 +379,7 @@ int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
 	wqcfg.pasid = pasid;
 	iowrite32(wqcfg.bits[2], idxd->reg_base + offset);
 
-	rc = idxd_wq_enable(wq);
+	rc = idxd_wq_enable(wq, NULL);
 	if (rc < 0)
 		return rc;
 
@@ -389,7 +395,7 @@ int idxd_wq_disable_pasid(struct idxd_wq *wq)
 
 	lockdep_assert_held(&idxd->dev_lock);
 
-	rc = idxd_wq_disable(wq);
+	rc = idxd_wq_disable(wq, NULL);
 	if (rc < 0)
 		return rc;
 
@@ -399,7 +405,7 @@ int idxd_wq_disable_pasid(struct idxd_wq *wq)
 	wqcfg.pasid = 0;
 	iowrite32(wqcfg.bits[2], idxd->reg_base + offset);
 
-	rc = idxd_wq_enable(wq);
+	rc = idxd_wq_enable(wq, NULL);
 	if (rc < 0)
 		return rc;
 
diff --git a/drivers/dma/idxd/dma.c b/drivers/dma/idxd/dma.c
index 9a4f78519e57..a49d4f303d7d 100644
--- a/drivers/dma/idxd/dma.c
+++ b/drivers/dma/idxd/dma.c
@@ -61,8 +61,6 @@ static inline void idxd_prep_desc_common(struct idxd_wq *wq,
 					 u64 addr_f1, u64 addr_f2, u64 len,
 					 u64 compl, u32 flags)
 {
-	struct idxd_device *idxd = wq->idxd;
-
 	hw->flags = flags;
 	hw->opcode = opcode;
 	hw->src_addr = addr_f1;
@@ -70,13 +68,6 @@ static inline void idxd_prep_desc_common(struct idxd_wq *wq,
 	hw->xfer_size = len;
 	hw->priv = !!(wq->type == IDXD_WQT_KERNEL);
 	hw->completion_addr = compl;
-
-	/*
-	 * Descriptor completion vectors are 1-8 for MSIX. We will round
-	 * robin through the 8 vectors.
-	 */
-	wq->vec_ptr = (wq->vec_ptr % idxd->num_wq_irqs) + 1;
-	hw->int_handle =  wq->vec_ptr;
 }
 
 static struct dma_async_tx_descriptor *
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 9b56a4c7f3fc..92a9718daa15 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -8,6 +8,10 @@
 #include <linux/percpu-rwsem.h>
 #include <linux/wait.h>
 #include <linux/cdev.h>
+#include <linux/pci.h>
+#include <linux/irq.h>
+#include <linux/idxd.h>
+#include <linux/uuid.h>
 #include "registers.h"
 
 #define IDXD_DRIVER_VERSION	"1.00"
@@ -66,6 +70,7 @@ enum idxd_wq_type {
 	IDXD_WQT_NONE = 0,
 	IDXD_WQT_KERNEL,
 	IDXD_WQT_USER,
+	IDXD_WQT_MDEV,
 };
 
 struct idxd_cdev {
@@ -75,6 +80,11 @@ struct idxd_cdev {
 	struct wait_queue_head err_queue;
 };
 
+struct idxd_wq_uuid {
+	guid_t uuid;
+	struct list_head list;
+};
+
 #define IDXD_ALLOCATED_BATCH_SIZE	128U
 #define WQ_NAME_SIZE   1024
 #define WQ_TYPE_SIZE   10
@@ -119,6 +129,9 @@ struct idxd_wq {
 	struct percpu_rw_semaphore submit_lock;
 	wait_queue_head_t submit_waitq;
 	char name[WQ_NAME_SIZE + 1];
+	struct list_head uuid_list;
+	int uuids;
+	struct list_head vdcm_list;
 };
 
 struct idxd_engine {
@@ -200,6 +213,7 @@ struct idxd_device {
 	atomic_t num_allocated_ims;
 	struct sbitmap ims_sbmap;
 	int *int_handles;
+	struct mutex mdev_lock; /* mdev creation lock */
 };
 
 /* IDXD software descriptor */
@@ -282,6 +296,7 @@ void idxd_cleanup_sysfs(struct idxd_device *idxd);
 int idxd_register_driver(void);
 void idxd_unregister_driver(void);
 struct bus_type *idxd_get_bus_type(struct idxd_device *idxd);
+bool is_idxd_wq_mdev(struct idxd_wq *wq);
 
 /* device interrupt control */
 irqreturn_t idxd_irq_handler(int vec, void *data);
@@ -310,8 +325,8 @@ int idxd_device_request_int_handle(struct idxd_device *idxd,
 /* work queue control */
 int idxd_wq_alloc_resources(struct idxd_wq *wq);
 void idxd_wq_free_resources(struct idxd_wq *wq);
-int idxd_wq_enable(struct idxd_wq *wq);
-int idxd_wq_disable(struct idxd_wq *wq);
+int idxd_wq_enable(struct idxd_wq *wq, u32 *status);
+int idxd_wq_disable(struct idxd_wq *wq, u32 *status);
 int idxd_wq_map_portal(struct idxd_wq *wq);
 void idxd_wq_unmap_portal(struct idxd_wq *wq);
 int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid);
@@ -344,4 +359,8 @@ int idxd_cdev_get_major(struct idxd_device *idxd);
 int idxd_wq_add_cdev(struct idxd_wq *wq);
 void idxd_wq_del_cdev(struct idxd_wq *wq);
 
+/* mdev */
+int idxd_mdev_host_init(struct idxd_device *idxd);
+void idxd_mdev_host_release(struct idxd_device *idxd);
+
 #endif
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index babe6e614087..b0f99a794e91 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -218,6 +218,8 @@ static int idxd_setup_internals(struct idxd_device *idxd)
 		mutex_init(&wq->wq_lock);
 		atomic_set(&wq->dq_count, 0);
 		init_waitqueue_head(&wq->submit_waitq);
+		INIT_LIST_HEAD(&wq->uuid_list);
+		INIT_LIST_HEAD(&wq->vdcm_list);
 		wq->idxd_cdev.minor = -1;
 		rc = percpu_init_rwsem(&wq->submit_lock);
 		if (rc < 0) {
@@ -347,6 +349,7 @@ static struct idxd_device *idxd_alloc(struct pci_dev *pdev)
 
 	idxd->pdev = pdev;
 	spin_lock_init(&idxd->dev_lock);
+	mutex_init(&idxd->mdev_lock);
 	atomic_set(&idxd->num_allocated_ims, 0);
 
 	return idxd;
@@ -509,6 +512,12 @@ static int idxd_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return -ENODEV;
 	}
 
+	rc = idxd_mdev_host_init(idxd);
+	if (rc < 0) {
+		dev_err(dev, "VFIO mdev init failed\n");
+		return rc;
+	}
+
 	rc = idxd_setup_sysfs(idxd);
 	if (rc) {
 		dev_err(dev, "IDXD sysfs setup failed\n");
@@ -584,6 +593,7 @@ static void idxd_remove(struct pci_dev *pdev)
 	dev_dbg(&pdev->dev, "%s called\n", __func__);
 	idxd_cleanup_sysfs(idxd);
 	idxd_shutdown(pdev);
+	idxd_mdev_host_release(idxd);
 	idxd_wqs_free_lock(idxd);
 	idxd_disable_system_pasid(idxd);
 	mutex_lock(&idxd_idr_lock);
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
index 37ad927d6944..bc634dc4e485 100644
--- a/drivers/dma/idxd/irq.c
+++ b/drivers/dma/idxd/irq.c
@@ -77,7 +77,7 @@ static int idxd_restart(struct idxd_device *idxd)
 		struct idxd_wq *wq = &idxd->wqs[i];
 
 		if (wq->state == IDXD_WQ_ENABLED) {
-			rc = idxd_wq_enable(wq);
+			rc = idxd_wq_enable(wq, NULL);
 			if (rc < 0) {
 				dev_warn(&idxd->pdev->dev,
 					 "Unable to re-enable wq %s\n",
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index 2cf0cdf149b7..b222ce00a9db 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -1,19 +1,76 @@
 // SPDX-License-Identifier: GPL-2.0
-/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/pci.h>
 #include <linux/device.h>
+#include <linux/sched/task.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
-#include <linux/msi.h>
-#include <linux/mdev.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
 #include <linux/vfio.h>
-#include "../../vfio/pci/vfio_pci_private.h"
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <linux/circ_buf.h>
 #include <uapi/linux/idxd.h>
 #include "registers.h"
 #include "idxd.h"
+#include "../../vfio/pci/vfio_pci_private.h"
 #include "mdev.h"
+#include "vdev.h"
+
+static u64 idxd_pci_config[] = {
+	0x001000000b258086ULL,
+	0x0080000008800000ULL,
+	0x000000000000000cULL,
+	0x000000000000000cULL,
+	0x0000000000000000ULL,
+	0x2010808600000000ULL,
+	0x0000004000000000ULL,
+	0x000000ff00000000ULL,
+	0x0000060000005011ULL, /* MSI-X capability */
+	0x0000070000000000ULL,
+	0x0000000000920010ULL, /* PCIe capability */
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0070001000000000ULL,
+	0x0000000000000000ULL,
+};
+
+static u64 idxd_pci_ext_cap[] = {
+	0x000000611101000fULL, /* ATS capability */
+	0x0000000000000000ULL,
+	0x8100000012010013ULL, /* Page Request capability */
+	0x0000000000000001ULL,
+	0x000014040001001bULL, /* PASID capability */
+	0x0000000000000000ULL,
+	0x0181808600010023ULL, /* Scalable IOV capability */
+	0x0000000100000005ULL,
+	0x0000000000000001ULL,
+	0x0000000000000000ULL,
+};
+
+static u64 idxd_cap_ctrl_reg[] = {
+	0x0000000000000100ULL,
+	0x0000000000000000ULL,
+	0x00000001013f038fULL, /* gencap */
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000004004ULL, /* grpcap */
+	0x0000000000000004ULL, /* engcap */
+	0x00000001003f03ffULL, /* opcap */
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000000000ULL, /* offsets */
+};
 
 static void idxd_free_ims_index(struct idxd_device *idxd,
 				unsigned long ims_idx)
@@ -124,7 +181,11 @@ static struct platform_msi_ops idxd_ims_ops  = {
 
 static irqreturn_t idxd_guest_wq_completion_interrupt(int irq, void *data)
 {
-	/* send virtual interrupt */
+	struct ims_irq_entry *irq_entry = data;
+	struct vdcm_idxd *vidxd = irq_entry->vidxd;
+	int msix_idx = irq_entry->int_src;
+
+	vidxd_send_interrupt(vidxd, msix_idx + 1);
 	return IRQ_HANDLED;
 }
 
@@ -177,3 +238,1490 @@ static int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
 
 	return 0;
 }
+
+static inline bool handle_valid(unsigned long handle)
+{
+	return !!(handle & ~0xff);
+}
+
+static void idxd_vdcm_reinit(struct vdcm_idxd *vidxd)
+{
+	struct idxd_wq *wq;
+	struct idxd_device *idxd;
+	unsigned long flags;
+
+	memset(vidxd->cfg, 0, VIDXD_MAX_CFG_SPACE_SZ);
+	memset(&vidxd->bar0, 0, sizeof(struct vdcm_idxd_pci_bar0));
+
+	memcpy(vidxd->cfg, idxd_pci_config, sizeof(idxd_pci_config));
+	memcpy(vidxd->cfg + 0x100, idxd_pci_ext_cap,
+	       sizeof(idxd_pci_ext_cap));
+
+	memcpy(vidxd->bar0.cap_ctrl_regs, idxd_cap_ctrl_reg,
+	       sizeof(idxd_cap_ctrl_reg));
+
+	/* Set the MSI-X table size */
+	vidxd->cfg[VIDXD_MSIX_TBL_SZ_OFFSET] = 1;
+	idxd = vidxd->idxd;
+	wq = vidxd->wq;
+
+	if (wq_dedicated(wq)) {
+		spin_lock_irqsave(&idxd->dev_lock, flags);
+		idxd_wq_disable(wq, NULL);
+		spin_unlock_irqrestore(&idxd->dev_lock, flags);
+	}
+
+	vidxd_mmio_init(vidxd);
+}
+
+struct vfio_region {
+	u32 type;
+	u32 subtype;
+	size_t size;
+	u32 flags;
+};
+
+struct kvmidxd_guest_info {
+	struct kvm *kvm;
+	struct vdcm_idxd *vidxd;
+};
+
+static int kvmidxd_guest_init(struct mdev_device *mdev)
+{
+	struct kvmidxd_guest_info *info;
+	struct vdcm_idxd *vidxd;
+	struct kvm *kvm;
+	struct device *dev = mdev_dev(mdev);
+
+	vidxd = mdev_get_drvdata(mdev);
+	if (handle_valid(vidxd->handle))
+		return -EEXIST;
+
+	kvm = vidxd->vdev.kvm;
+	if (!kvm || kvm->mm != current->mm) {
+		dev_err(dev, "KVM is required to use Intel vIDXD\n");
+		return -ESRCH;
+	}
+
+	info = vzalloc(sizeof(*info));
+	if (!info)
+		return -ENOMEM;
+
+	vidxd->handle = (unsigned long)info;
+	info->vidxd = vidxd;
+	info->kvm = kvm;
+
+	return 0;
+}
+
+static bool kvmidxd_guest_exit(unsigned long handle)
+{
+	if (handle == 0)
+		return false;
+
+	vfree((void *)handle);
+
+	return true;
+}
+
+static void __idxd_vdcm_release(struct vdcm_idxd *vidxd)
+{
+	int rc;
+	struct device *dev = &vidxd->idxd->pdev->dev;
+
+	if (atomic_cmpxchg(&vidxd->vdev.released, 0, 1))
+		return;
+
+	if (!handle_valid(vidxd->handle))
+		return;
+
+	/* Re-initialize the VIDXD to a pristine state for re-use */
+	rc = vfio_unregister_notifier(mdev_dev(vidxd->vdev.mdev),
+				      VFIO_GROUP_NOTIFY,
+				      &vidxd->vdev.group_notifier);
+	if (rc < 0)
+		dev_warn(dev, "vfio_unregister_notifier group failed: %d\n",
+			 rc);
+
+	kvmidxd_guest_exit(vidxd->handle);
+	vidxd_free_ims_entries(vidxd);
+
+	vidxd->vdev.kvm = NULL;
+	vidxd->handle = 0;
+	idxd_vdcm_reinit(vidxd);
+}
+
+static void idxd_vdcm_release(struct mdev_device *mdev)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "vdcm_idxd_release %d\n", vidxd->type->type);
+	__idxd_vdcm_release(vidxd);
+}
+
+static void idxd_vdcm_release_work(struct work_struct *work)
+{
+	struct vdcm_idxd *vidxd = container_of(work, struct vdcm_idxd,
+					       vdev.release_work);
+
+	__idxd_vdcm_release(vidxd);
+}
+
+static bool idxd_wq_match_uuid(struct idxd_wq *wq, const guid_t *uuid)
+{
+	struct idxd_wq_uuid *entry;
+	bool found = false;
+
+	list_for_each_entry(entry, &wq->uuid_list, list) {
+		if (guid_equal(&entry->uuid, uuid)) {
+			found = true;
+			break;
+		}
+	}
+
+	return found;
+}
+
+static struct idxd_wq *find_wq_by_uuid(struct idxd_device *idxd,
+				       const guid_t *uuid)
+{
+	int i;
+	struct idxd_wq *wq;
+	bool found = false;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		wq = &idxd->wqs[i];
+		found = idxd_wq_match_uuid(wq, uuid);
+		if (found)
+			return wq;
+	}
+
+	return NULL;
+}
+
+static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd,
+					   struct mdev_device *mdev,
+					   struct vdcm_idxd_type *type)
+{
+	struct vdcm_idxd *vidxd;
+	unsigned long flags;
+	struct idxd_wq *wq = NULL;
+	struct device *dev = mdev_dev(mdev);
+
+	wq = find_wq_by_uuid(idxd, mdev_uuid(mdev));
+	if (!wq) {
+		dev_dbg(dev, "No WQ found\n");
+		return NULL;
+	}
+
+	if (wq->state != IDXD_WQ_ENABLED)
+		return NULL;
+
+	vidxd = kzalloc(sizeof(*vidxd), GFP_KERNEL);
+	if (!vidxd)
+		return NULL;
+
+	vidxd->idxd = idxd;
+	vidxd->vdev.mdev = mdev;
+	vidxd->wq = wq;
+	mdev_set_drvdata(mdev, vidxd);
+	vidxd->type = type;
+	vidxd->num_wqs = 1;
+
+	mutex_lock(&wq->wq_lock);
+	if (wq_dedicated(wq)) {
+		/* disable wq. will be enabled by the VM */
+		spin_lock_irqsave(&vidxd->idxd->dev_lock, flags);
+		idxd_wq_disable(vidxd->wq, NULL);
+		spin_unlock_irqrestore(&vidxd->idxd->dev_lock, flags);
+	}
+
+	/* Initialize virtual PCI resources if it is an MDEV type for a VM */
+	memcpy(vidxd->cfg, idxd_pci_config, sizeof(idxd_pci_config));
+	memcpy(vidxd->cfg + 0x100, idxd_pci_ext_cap,
+	       sizeof(idxd_pci_ext_cap));
+	memcpy(vidxd->bar0.cap_ctrl_regs, idxd_cap_ctrl_reg,
+	       sizeof(idxd_cap_ctrl_reg));
+
+	/* Set the MSI-X table size */
+	vidxd->cfg[VIDXD_MSIX_TBL_SZ_OFFSET] = 1;
+	vidxd->bar_size[0] = VIDXD_BAR0_SIZE;
+	vidxd->bar_size[1] = VIDXD_BAR2_SIZE;
+
+	vidxd_mmio_init(vidxd);
+
+	INIT_WORK(&vidxd->vdev.release_work, idxd_vdcm_release_work);
+
+	idxd_wq_get(wq);
+	mutex_unlock(&wq->wq_lock);
+
+	return vidxd;
+}
+
+static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES] = {
+	{
+		.name = "wq",
+		.description = "IDXD MDEV workqueue",
+		.type = IDXD_MDEV_TYPE_WQ,
+	},
+};
+
+static struct vdcm_idxd_type *idxd_vdcm_find_vidxd_type(struct device *dev,
+							const char *name)
+{
+	int i;
+	char dev_name[IDXD_MDEV_NAME_LEN];
+
+	for (i = 0; i < IDXD_MDEV_TYPES; i++) {
+		snprintf(dev_name, IDXD_MDEV_NAME_LEN, "idxd-%s",
+			 idxd_mdev_types[i].name);
+
+		if (!strncmp(name, dev_name, IDXD_MDEV_NAME_LEN))
+			return &idxd_mdev_types[i];
+	}
+
+	return NULL;
+}
+
+static int idxd_vdcm_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+	struct vdcm_idxd *vidxd;
+	struct vdcm_idxd_type *type;
+	struct device *dev, *parent;
+	struct idxd_device *idxd;
+	int rc = 0;
+
+	parent = mdev_parent_dev(mdev);
+	idxd = dev_get_drvdata(parent);
+	dev = mdev_dev(mdev);
+
+	mdev_set_iommu_device(dev, parent);
+	mutex_lock(&idxd->mdev_lock);
+	type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+	if (!type) {
+		dev_err(dev, "failed to find type %s to create\n",
+			kobject_name(kobj));
+		rc = -EINVAL;
+		goto out;
+	}
+
+	vidxd = vdcm_vidxd_create(idxd, mdev, type);
+	if (IS_ERR_OR_NULL(vidxd)) {
+		rc = !vidxd ? -ENOMEM : PTR_ERR(vidxd);
+		dev_err(dev, "failed to create vidxd: %d\n", rc);
+		goto out;
+	}
+
+	list_add(&vidxd->list, &vidxd->wq->vdcm_list);
+	dev_dbg(dev, "mdev creation success: %s\n", dev_name(mdev_dev(mdev)));
+
+ out:
+	mutex_unlock(&idxd->mdev_lock);
+	return rc;
+}
+
+static void vdcm_vidxd_remove(struct vdcm_idxd *vidxd)
+{
+	struct idxd_device *idxd = vidxd->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	struct idxd_wq *wq = vidxd->wq;
+
+	dev_dbg(dev, "%s: removing for wq %d\n", __func__, vidxd->wq->id);
+
+	mutex_lock(&wq->wq_lock);
+	list_del(&vidxd->list);
+	idxd_wq_put(wq);
+	mutex_unlock(&wq->wq_lock);
+	kfree(vidxd);
+}
+
+static int idxd_vdcm_remove(struct mdev_device *mdev)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+
+	if (handle_valid(vidxd->handle))
+		return -EBUSY;
+
+	vdcm_vidxd_remove(vidxd);
+	return 0;
+}
+
+static int idxd_vdcm_group_notifier(struct notifier_block *nb,
+				    unsigned long action, void *data)
+{
+	struct vdcm_idxd *vidxd = container_of(nb, struct vdcm_idxd,
+			vdev.group_notifier);
+
+	/* The only action we care about */
+	if (action == VFIO_GROUP_NOTIFY_SET_KVM) {
+		vidxd->vdev.kvm = data;
+
+		if (!data)
+			schedule_work(&vidxd->vdev.release_work);
+	}
+
+	return NOTIFY_OK;
+}
+
+static int idxd_vdcm_open(struct mdev_device *mdev)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	unsigned long events;
+	int rc;
+	struct vdcm_idxd_type *type = vidxd->type;
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "%s: type: %d\n", __func__, type->type);
+
+	vidxd->vdev.group_notifier.notifier_call = idxd_vdcm_group_notifier;
+	events = VFIO_GROUP_NOTIFY_SET_KVM;
+	rc = vfio_register_notifier(mdev_dev(mdev), VFIO_GROUP_NOTIFY,
+				    &events, &vidxd->vdev.group_notifier);
+	if (rc < 0) {
+		dev_err(dev, "vfio_register_notifier for group failed: %d\n",
+			rc);
+		return rc;
+	}
+
+	/* allocate and setup IMS entries */
+	rc = vidxd_setup_ims_entries(vidxd);
+	if (rc < 0)
+		goto undo_group;
+
+	rc = kvmidxd_guest_init(mdev);
+	if (rc)
+		goto undo_ims;
+
+	atomic_set(&vidxd->vdev.released, 0);
+
+	return rc;
+
+ undo_ims:
+	vidxd_free_ims_entries(vidxd);
+ undo_group:
+	vfio_unregister_notifier(mdev_dev(mdev), VFIO_GROUP_NOTIFY,
+				 &vidxd->vdev.group_notifier);
+	return rc;
+}
+
+static int vdcm_vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf,
+				 unsigned int size)
+{
+	u32 offset = pos & (vidxd->bar_size[0] - 1);
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	struct device *dev = mdev_dev(vidxd->vdev.mdev);
+
+	dev_WARN_ONCE(dev, (size & (size - 1)) != 0, "%s\n", __func__);
+	dev_WARN_ONCE(dev, size > 8, "%s\n", __func__);
+	dev_WARN_ONCE(dev, (offset & (size - 1)) != 0, "%s\n", __func__);
+
+	dev_dbg(dev, "vidxd mmio W %d %x %x: %llx\n", vidxd->wq->id, size,
+		offset, get_reg_val(buf, size));
+
+	/* If we don't limit this, we potentially can write out of bound */
+	if (size > 8)
+		size = 8;
+
+	switch (offset) {
+	case IDXD_GENCFG_OFFSET ... IDXD_GENCFG_OFFSET + 7:
+		/* Write only when device is disabled. */
+		if (vidxd_state(vidxd) == IDXD_DEVICE_STATE_DISABLED)
+			memcpy(&bar0->cap_ctrl_regs[offset], buf, size);
+		break;
+
+	case IDXD_GENCTRL_OFFSET:
+		memcpy(&bar0->cap_ctrl_regs[offset], buf, size);
+		break;
+
+	case IDXD_INTCAUSE_OFFSET:
+		bar0->cap_ctrl_regs[offset] &= ~(get_reg_val(buf, 1) & 0x0f);
+		break;
+
+	case IDXD_CMD_OFFSET:
+		if (size == 4) {
+			u8 *cap_ctrl = &bar0->cap_ctrl_regs[0];
+			unsigned long *cmdsts =
+				(unsigned long *)&cap_ctrl[IDXD_CMDSTS_OFFSET];
+			u32 val = get_reg_val(buf, size);
+
+			/* Check and set device active */
+			if (test_and_set_bit(31, cmdsts) == 0) {
+				*(u32 *)cmdsts = 1 << 31;
+				vidxd_do_command(vidxd, val);
+			}
+		}
+		break;
+
+	case IDXD_SWERR_OFFSET:
+		/* W1C */
+		bar0->cap_ctrl_regs[offset] &= ~(get_reg_val(buf, 1) & 3);
+		break;
+
+	case VIDXD_WQCFG_OFFSET ... VIDXD_WQCFG_OFFSET + VIDXD_WQ_CTRL_SZ - 1: {
+		union wqcfg *wqcfg;
+		int wq_id = (offset - VIDXD_WQCFG_OFFSET) / 0x20;
+		struct idxd_wq *wq;
+		int subreg = offset & 0x1c;
+		u32 new_val;
+
+		if (wq_id >= 1)
+			break;
+		wq = vidxd->wq;
+		wqcfg = (union wqcfg *)&bar0->wq_ctrl_regs[wq_id * 0x20];
+		if (size >= 4) {
+			new_val = get_reg_val(buf, 4);
+		} else {
+			u32 tmp1, tmp2, shift, mask;
+
+			switch (subreg) {
+			case 4:
+				tmp1 = wqcfg->bits[1]; break;
+			case 8:
+				tmp1 = wqcfg->bits[2]; break;
+			case 12:
+				tmp1 = wqcfg->bits[3]; break;
+			case 16:
+				tmp1 = wqcfg->bits[4]; break;
+			case 20:
+				tmp1 = wqcfg->bits[5]; break;
+			default:
+				tmp1 = 0;
+			}
+
+			tmp2 = get_reg_val(buf, size);
+			shift = (offset & 0x03U) * 8;
+			mask = ((1U << size * 8) - 1u) << shift;
+			new_val = (tmp1 & ~mask) | (tmp2 << shift);
+		}
+
+		if (subreg == 8) {
+			if (wqcfg->wq_state == 0) {
+				wqcfg->bits[2] &= 0xfe;
+				wqcfg->bits[2] |= new_val & 0xffffff01;
+			}
+		}
+
+		break;
+	}
+
+	case VIDXD_MSIX_TABLE_OFFSET ...
+		VIDXD_MSIX_TABLE_OFFSET + VIDXD_MSIX_TBL_SZ - 1: {
+		int index = (offset - VIDXD_MSIX_TABLE_OFFSET) / 0x10;
+		u8 *msix_entry = &bar0->msix_table[index * 0x10];
+		u8 *msix_perm = &bar0->msix_perm_table[index * 8];
+		int end;
+
+		/* Upper bound checking to stop overflow */
+		end = VIDXD_MSIX_TABLE_OFFSET + VIDXD_MSIX_TBL_SZ;
+		if (offset + size > end)
+			size = end - offset;
+
+		memcpy(msix_entry + (offset & 0xf), buf, size);
+		/* check mask and pba */
+		if ((msix_entry[12] & 1) == 0) {
+			*(u32 *)msix_perm &= ~3U;
+			if (test_and_clear_bit(index, &bar0->msix_pba))
+				vidxd_send_interrupt(vidxd, index);
+		} else {
+			*(u32 *)msix_perm |= 1;
+		}
+		break;
+	}
+
+	case VIDXD_MSIX_PERM_OFFSET ...
+		VIDXD_MSIX_PERM_OFFSET + VIDXD_MSIX_PERM_TBL_SZ - 1:
+		if ((offset & 7) == 0 && size == 4) {
+			int index = (offset - VIDXD_MSIX_PERM_OFFSET) / 8;
+			u32 *msix_perm =
+				(u32 *)&bar0->msix_perm_table[index * 8];
+			u8 *msix_entry = &bar0->msix_table[index * 0x10];
+			u32 val = get_reg_val(buf, size) & 0xfffff00d;
+
+			if (index > 0)
+				vidxd_setup_ims_entry(vidxd, index - 1, val);
+
+			if (val & 1) {
+				msix_entry[12] |= 1;
+				if (bar0->msix_pba & (1ULL << index))
+					val |= 2;
+			} else {
+				msix_entry[12] &= ~1u;
+				if (test_and_clear_bit(index,
+						       &bar0->msix_pba))
+					vidxd_send_interrupt(vidxd, index);
+			}
+			*msix_perm = val;
+		}
+		break;
+	}
+
+	return 0;
+}
+
+static int vdcm_vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf,
+				unsigned int size)
+{
+	u32 offset = pos & (vidxd->bar_size[0] - 1);
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	u8 *reg_addr, *msix_table, *msix_perm_table;
+	struct device *dev = mdev_dev(vidxd->vdev.mdev);
+	u32 end;
+
+	dev_WARN_ONCE(dev, (size & (size - 1)) != 0, "%s\n", __func__);
+	dev_WARN_ONCE(dev, size > 8, "%s\n", __func__);
+	dev_WARN_ONCE(dev, (offset & (size - 1)) != 0, "%s\n", __func__);
+
+	/* If we don't limit this, we potentially can write out of bound */
+	if (size > 8)
+		size = 8;
+
+	switch (offset) {
+	case 0 ... VIDXD_CAP_CTRL_SZ - 1:
+		end = VIDXD_CAP_CTRL_SZ;
+		if (offset + 8 > end)
+			size = end - offset;
+		reg_addr = &bar0->cap_ctrl_regs[offset];
+		break;
+
+	case VIDXD_GRPCFG_OFFSET ...
+		VIDXD_GRPCFG_OFFSET + VIDXD_GRP_CTRL_SZ - 1:
+		end = VIDXD_GRPCFG_OFFSET + VIDXD_GRP_CTRL_SZ;
+		if (offset + 8 > end)
+			size = end - offset;
+		reg_addr = &bar0->grp_ctrl_regs[offset - VIDXD_GRPCFG_OFFSET];
+		break;
+
+	case VIDXD_WQCFG_OFFSET ... VIDXD_WQCFG_OFFSET + VIDXD_WQ_CTRL_SZ - 1:
+		end = VIDXD_WQCFG_OFFSET + VIDXD_WQ_CTRL_SZ;
+		if (offset + 8 > end)
+			size = end - offset;
+		reg_addr = &bar0->wq_ctrl_regs[offset - VIDXD_WQCFG_OFFSET];
+		break;
+
+	case VIDXD_MSIX_TABLE_OFFSET ...
+		VIDXD_MSIX_TABLE_OFFSET + VIDXD_MSIX_TBL_SZ - 1:
+		end = VIDXD_MSIX_TABLE_OFFSET + VIDXD_MSIX_TBL_SZ;
+		if (offset + 8 > end)
+			size = end - offset;
+		msix_table = &bar0->msix_table[0];
+		reg_addr = &msix_table[offset - VIDXD_MSIX_TABLE_OFFSET];
+		break;
+
+	case VIDXD_MSIX_PBA_OFFSET ... VIDXD_MSIX_PBA_OFFSET + 7:
+		end = VIDXD_MSIX_PBA_OFFSET + 8;
+		if (offset + 8 > end)
+			size = end - offset;
+		reg_addr = (u8 *)&bar0->msix_pba;
+		break;
+
+	case VIDXD_MSIX_PERM_OFFSET ...
+		VIDXD_MSIX_PERM_OFFSET + VIDXD_MSIX_PERM_TBL_SZ - 1:
+		end = VIDXD_MSIX_PERM_OFFSET + VIDXD_MSIX_PERM_TBL_SZ;
+		if (offset + 8 > end)
+			size = end - offset;
+		msix_perm_table = &bar0->msix_perm_table[0];
+		reg_addr = &msix_perm_table[offset - VIDXD_MSIX_PERM_OFFSET];
+		break;
+
+	default:
+		reg_addr = NULL;
+		break;
+	}
+
+	if (reg_addr)
+		memcpy(buf, reg_addr, size);
+	else
+		memset(buf, 0, size);
+
+	dev_dbg(dev, "vidxd mmio R %d %x %x: %llx\n",
+		vidxd->wq->id, size, offset, get_reg_val(buf, size));
+	return 0;
+}
+
+static int vdcm_vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos,
+			       void *buf, unsigned int count)
+{
+	u32 offset = pos & 0xfff;
+	struct device *dev = mdev_dev(vidxd->vdev.mdev);
+
+	memcpy(buf, &vidxd->cfg[offset], count);
+
+	dev_dbg(dev, "vidxd pci R %d %x %x: %llx\n",
+		vidxd->wq->id, count, offset, get_reg_val(buf, count));
+
+	return 0;
+}
+
+static int vdcm_vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos,
+				void *buf, unsigned int size)
+{
+	u32 offset = pos & 0xfff;
+	u64 val;
+	u8 *cfg = vidxd->cfg;
+	u8 *bar0 = vidxd->bar0.cap_ctrl_regs;
+	struct device *dev = mdev_dev(vidxd->vdev.mdev);
+
+	dev_dbg(dev, "vidxd pci W %d %x %x: %llx\n", vidxd->wq->id, size,
+		offset, get_reg_val(buf, size));
+
+	switch (offset) {
+	case PCI_COMMAND: { /* device control */
+		bool bme;
+
+		memcpy(&cfg[offset], buf, size);
+		bme = cfg[offset] & PCI_COMMAND_MASTER;
+		if (!bme &&
+		    ((*(u32 *)&bar0[IDXD_GENSTATS_OFFSET]) & 0x3) != 0) {
+			*(u32 *)(&bar0[IDXD_SWERR_OFFSET]) = 0x51u << 8;
+			*(u32 *)(&bar0[IDXD_GENSTATS_OFFSET]) = 0;
+		}
+
+		if (size < 4)
+			break;
+		offset += 2;
+		buf = buf + 2;
+		size -= 2;
+	}
+	/* fall through */
+
+	case PCI_STATUS: { /* device status */
+		u16 nval = get_reg_val(buf, size) << (offset & 1) * 8;
+
+		nval &= 0xf900;
+		*(u16 *)&cfg[offset] = *((u16 *)&cfg[offset]) & ~nval;
+		break;
+	}
+
+	case PCI_CACHE_LINE_SIZE:
+	case PCI_INTERRUPT_LINE:
+		memcpy(&cfg[offset], buf, size);
+		break;
+
+	case PCI_BASE_ADDRESS_0: /* BAR0 */
+	case PCI_BASE_ADDRESS_1: /* BAR1 */
+	case PCI_BASE_ADDRESS_2: /* BAR2 */
+	case PCI_BASE_ADDRESS_3: /* BAR3 */ {
+		unsigned int bar_id, bar_offset;
+		u64 bar, bar_size;
+
+		bar_id = (offset - PCI_BASE_ADDRESS_0) / 8;
+		bar_size = vidxd->bar_size[bar_id];
+		bar_offset = PCI_BASE_ADDRESS_0 + bar_id * 8;
+
+		val = get_reg_val(buf, size);
+		bar = *(u64 *)&cfg[bar_offset];
+		memcpy((u8 *)&bar + (offset & 0x7), buf, size);
+		bar &= ~(bar_size - 1);
+
+		*(u64 *)&cfg[bar_offset] = bar |
+			PCI_BASE_ADDRESS_MEM_TYPE_64 |
+			PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+		if (val == -1U || val == -1ULL)
+			break;
+		if (bar == 0 || bar == -1ULL - -1U)
+			break;
+		if (bar == (-1U & ~(bar_size - 1)))
+			break;
+		if (bar == (-1ULL & ~(bar_size - 1)))
+			break;
+		if (bar == vidxd->bar_val[bar_id])
+			break;
+
+		vidxd->bar_val[bar_id] = bar;
+		break;
+	}
+
+	case VIDXD_ATS_OFFSET + 4:
+		if (size < 4)
+			break;
+		offset += 2;
+		buf = buf + 2;
+		size -= 2;
+		/* fall through */
+
+	case VIDXD_ATS_OFFSET + 6:
+		memcpy(&cfg[offset], buf, size);
+		break;
+
+	case VIDXD_PRS_OFFSET + 4: {
+		u8 old_val, new_val;
+
+		val = get_reg_val(buf, 1);
+		old_val = cfg[VIDXD_PRS_OFFSET + 4];
+		new_val = val & 1;
+
+		cfg[offset] = new_val;
+		if (old_val == 0 && new_val == 1) {
+			/*
+			 * Clear Stopped, Response Failure,
+			 * and Unexpected Response.
+			 */
+			*(u16 *)&cfg[VIDXD_PRS_OFFSET + 6] &= ~(u16)(0x0103);
+		}
+
+		if (size < 4)
+			break;
+
+		offset += 2;
+		buf = (u8 *)buf + 2;
+		size -= 2;
+	}
+	/* fall through */
+
+	case VIDXD_PRS_OFFSET + 6:
+		cfg[offset] &= ~(get_reg_val(buf, 1) & 3);
+		break;
+	case VIDXD_PRS_OFFSET + 12 ... VIDXD_PRS_OFFSET + 15:
+		memcpy(&cfg[offset], buf, size);
+		break;
+
+	case VIDXD_PASID_OFFSET + 4:
+		if (size < 4)
+			break;
+		offset += 2;
+		buf = buf + 2;
+		size -= 2;
+		/* fall through */
+	case VIDXD_PASID_OFFSET + 6:
+		cfg[offset] = get_reg_val(buf, 1) & 5;
+		break;
+	}
+
+	return 0;
+}
+
+static ssize_t idxd_vdcm_rw(struct mdev_device *mdev, char *buf,
+			    size_t count, loff_t *ppos, enum idxd_vdcm_rw mode)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	struct device *dev = mdev_dev(mdev);
+	int rc = -EINVAL;
+
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		dev_err(dev, "invalid index: %u\n", index);
+		return -EINVAL;
+	}
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		if (mode == IDXD_VDCM_WRITE)
+			rc = vdcm_vidxd_cfg_write(vidxd, pos, buf, count);
+		else
+			rc = vdcm_vidxd_cfg_read(vidxd, pos, buf, count);
+		break;
+	case VFIO_PCI_BAR0_REGION_INDEX:
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		if (mode == IDXD_VDCM_WRITE)
+			rc = vdcm_vidxd_mmio_write(vidxd,
+						   vidxd->bar_val[0] + pos, buf,
+						   count);
+		else
+			rc = vdcm_vidxd_mmio_read(vidxd,
+						  vidxd->bar_val[0] + pos, buf,
+						  count);
+		break;
+	case VFIO_PCI_BAR2_REGION_INDEX:
+	case VFIO_PCI_BAR3_REGION_INDEX:
+	case VFIO_PCI_BAR4_REGION_INDEX:
+	case VFIO_PCI_BAR5_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+	case VFIO_PCI_ROM_REGION_INDEX:
+	default:
+		dev_err(dev, "unsupported region: %u\n", index);
+	}
+
+	return rc == 0 ? count : rc;
+}
+
+static ssize_t idxd_vdcm_read(struct mdev_device *mdev, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	unsigned int done = 0;
+	int rc;
+
+	while (count) {
+		size_t filled;
+
+		if (count >= 8 && !(*ppos % 8)) {
+			u64 val;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+					  ppos, IDXD_VDCM_READ);
+			if (rc <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 8;
+		} else if (count >= 4 && !(*ppos % 4)) {
+			u32 val;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+					  ppos, IDXD_VDCM_READ);
+			if (rc <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 4;
+		} else if (count >= 2 && !(*ppos % 2)) {
+			u16 val;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+					  ppos, IDXD_VDCM_READ);
+			if (rc <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			rc = idxd_vdcm_rw(mdev, &val, sizeof(val), ppos,
+					  IDXD_VDCM_READ);
+			if (rc <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		*ppos += filled;
+		buf += filled;
+	}
+
+	return done;
+
+ read_err:
+	return -EFAULT;
+}
+
+static ssize_t idxd_vdcm_write(struct mdev_device *mdev,
+			       const char __user *buf, size_t count,
+			       loff_t *ppos)
+{
+	unsigned int done = 0;
+	int rc;
+
+	while (count) {
+		size_t filled;
+
+		if (count >= 8 && !(*ppos % 8)) {
+			u64 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+					  ppos, IDXD_VDCM_WRITE);
+			if (rc <= 0)
+				goto write_err;
+
+			filled = 8;
+		} else if (count >= 4 && !(*ppos % 4)) {
+			u32 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+					  ppos, IDXD_VDCM_WRITE);
+			if (rc <= 0)
+				goto write_err;
+
+			filled = 4;
+		} else if (count >= 2 && !(*ppos % 2)) {
+			u16 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val,
+					  sizeof(val), ppos, IDXD_VDCM_WRITE);
+			if (rc <= 0)
+				goto write_err;
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			rc = idxd_vdcm_rw(mdev, &val, sizeof(val),
+					  ppos, IDXD_VDCM_WRITE);
+			if (rc <= 0)
+				goto write_err;
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		*ppos += filled;
+		buf += filled;
+	}
+
+	return done;
+write_err:
+	return -EFAULT;
+}
+
+static int check_vma(struct idxd_wq *wq, struct vm_area_struct *vma,
+		     const char *func)
+{
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int idxd_vdcm_mmap(struct mdev_device *mdev, struct vm_area_struct *vma)
+{
+	unsigned int wq_idx, rc;
+	unsigned long req_size, pgoff = 0, offset;
+	pgprot_t pg_prot;
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	struct idxd_wq *wq = vidxd->wq;
+	struct idxd_device *idxd = vidxd->idxd;
+	enum idxd_portal_prot virt_limited, phys_limited;
+	phys_addr_t base = pci_resource_start(idxd->pdev, IDXD_WQ_BAR);
+	struct device *dev = mdev_dev(mdev);
+
+	rc = check_vma(wq, vma, __func__);
+	if (rc)
+		return rc;
+
+	pg_prot = vma->vm_page_prot;
+	req_size = vma->vm_end - vma->vm_start;
+	vma->vm_flags |= VM_DONTCOPY;
+
+	offset = (vma->vm_pgoff << PAGE_SHIFT) &
+		 ((1ULL << VFIO_PCI_OFFSET_SHIFT) - 1);
+
+	wq_idx = offset >> (PAGE_SHIFT + 2);
+	if (wq_idx >= 1) {
+		dev_err(dev, "mapping invalid wq %d off %lx\n",
+			wq_idx, offset);
+		return -EINVAL;
+	}
+
+	virt_limited = ((offset >> PAGE_SHIFT) & 0x3) == 1;
+	phys_limited = IDXD_PORTAL_LIMITED;
+
+	if (virt_limited == IDXD_PORTAL_UNLIMITED && wq_dedicated(wq))
+		phys_limited = IDXD_PORTAL_UNLIMITED;
+
+	/* We always map IMS portals to the guest */
+	pgoff = (base +
+		idxd_get_wq_portal_full_offset(wq->id, phys_limited,
+					       IDXD_IRQ_IMS)) >> PAGE_SHIFT;
+
+	dev_dbg(dev, "mmap %lx %lx %lx %lx\n", vma->vm_start, pgoff, req_size,
+		pgprot_val(pg_prot));
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_private_data = mdev;
+	vma->vm_pgoff = pgoff;
+	vma->vm_private_data = mdev;
+
+	return remap_pfn_range(vma, vma->vm_start, pgoff, req_size, pg_prot);
+}
+
+static int idxd_vdcm_get_irq_count(struct vdcm_idxd *vidxd, int type)
+{
+	if (type == VFIO_PCI_MSI_IRQ_INDEX ||
+	    type == VFIO_PCI_MSIX_IRQ_INDEX)
+		return vidxd->num_wqs + 1;
+
+	return 0;
+}
+
+static int vdcm_idxd_set_msix_trigger(struct vdcm_idxd *vidxd,
+				      unsigned int index, unsigned int start,
+				      unsigned int count, uint32_t flags,
+				      void *data)
+{
+	struct eventfd_ctx *trigger;
+	int i, rc = 0;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	if (count > VIDXD_MAX_MSIX_ENTRIES - 1)
+		count = VIDXD_MAX_MSIX_ENTRIES - 1;
+
+	if (count == 0 && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+		/* Disable all MSIX entries */
+		for (i = 0; i < VIDXD_MAX_MSIX_ENTRIES; i++) {
+			if (vidxd->vdev.msix_trigger[i]) {
+				dev_dbg(dev, "disable MSIX entry %d\n", i);
+				eventfd_ctx_put(vidxd->vdev.msix_trigger[i]);
+				vidxd->vdev.msix_trigger[i] = 0;
+
+				if (i) {
+					rc = vidxd_free_ims_entry(vidxd, i - 1);
+					if (rc)
+						return rc;
+				}
+			}
+		}
+		return 0;
+	}
+
+	for (i = 0; i < count; i++) {
+		if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+			u32 fd = *(u32 *)(data + i * sizeof(u32));
+
+			dev_dbg(dev, "enable MSIX entry %d\n", i);
+			trigger = eventfd_ctx_fdget(fd);
+			if (IS_ERR(trigger)) {
+				pr_err("eventfd_ctx_fdget failed %d\n", i);
+				return PTR_ERR(trigger);
+			}
+			vidxd->vdev.msix_trigger[i] = trigger;
+			/*
+			 * Allocate a vector from the OS and set in the IMS
+			 * entry
+			 */
+			if (i) {
+				rc = vidxd_setup_ims_entry(vidxd, i - 1, 0);
+				if (rc)
+					return rc;
+			}
+			fd++;
+		} else if (flags & VFIO_IRQ_SET_DATA_NONE) {
+			dev_dbg(dev, "disable MSIX entry %d\n", i);
+			eventfd_ctx_put(vidxd->vdev.msix_trigger[i]);
+			vidxd->vdev.msix_trigger[i] = 0;
+
+			if (i) {
+				rc = vidxd_free_ims_entry(vidxd, i - 1);
+				if (rc)
+					return rc;
+			}
+		}
+	}
+	return rc;
+}
+
+static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags,
+			      unsigned int index, unsigned int start,
+			      unsigned int count, void *data)
+{
+	int (*func)(struct vdcm_idxd *vidxd, unsigned int index,
+		    unsigned int start, unsigned int count, uint32_t flags,
+		    void *data) = NULL;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	int msixcnt = pci_msix_vec_count(vidxd->idxd->pdev);
+
+	if (msixcnt < 0)
+		return -ENXIO;
+
+	switch (index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+		dev_warn(dev, "intx interrupts not supported.\n");
+		break;
+	case VFIO_PCI_MSI_IRQ_INDEX:
+		dev_dbg(dev, "msi interrupt.\n");
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vdcm_idxd_set_msix_trigger;
+			break;
+		}
+		break;
+	case VFIO_PCI_MSIX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vdcm_idxd_set_msix_trigger;
+			break;
+		}
+		break;
+	default:
+		return -ENOTTY;
+	}
+
+	if (!func)
+		return -ENOTTY;
+
+	return func(vidxd, index, start, count, flags, data);
+}
+
+static void vidxd_vdcm_reset(struct vdcm_idxd *vidxd)
+{
+	vidxd_reset(vidxd);
+}
+
+static long idxd_vdcm_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			    unsigned long arg)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	unsigned long minsz;
+	int rc = -EINVAL;
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "vidxd %lx ioctl, cmd: %d\n", vidxd->handle, cmd);
+
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.flags |= VFIO_DEVICE_FLAGS_RESET;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		int i;
+		struct vfio_region_info_cap_sparse_mmap *sparse = NULL;
+		size_t size;
+		int nr_areas = 1;
+		int cap_type_id = 0;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = VIDXD_MAX_CFG_SPACE_SZ;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+			break;
+		case VFIO_PCI_BAR0_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vidxd->bar_size[info.index];
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+			break;
+		case VFIO_PCI_BAR1_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0;
+			info.flags = 0;
+			break;
+		case VFIO_PCI_BAR2_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.flags = VFIO_REGION_INFO_FLAG_CAPS |
+				VFIO_REGION_INFO_FLAG_MMAP |
+				VFIO_REGION_INFO_FLAG_READ |
+				VFIO_REGION_INFO_FLAG_WRITE;
+			info.size = vidxd->bar_size[1];
+
+			/*
+			 * Every WQ has two areas for unlimited and limited
+			 * MSI-X portals. IMS portals are not reported
+			 */
+			nr_areas = 2;
+
+			size = sizeof(*sparse) +
+				(nr_areas * sizeof(*sparse->areas));
+			sparse = kzalloc(size, GFP_KERNEL);
+			if (!sparse)
+				return -ENOMEM;
+
+			sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+			sparse->header.version = 1;
+			sparse->nr_areas = nr_areas;
+			cap_type_id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+
+			sparse->areas[0].offset = 0;
+			sparse->areas[0].size = PAGE_SIZE;
+
+			sparse->areas[1].offset = PAGE_SIZE;
+			sparse->areas[1].size = PAGE_SIZE;
+			break;
+
+		case VFIO_PCI_BAR3_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0;
+			info.flags = 0;
+			dev_dbg(dev, "get region info bar:%d\n", info.index);
+			break;
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		case VFIO_PCI_VGA_REGION_INDEX:
+			dev_dbg(dev, "get region info index:%d\n",
+				info.index);
+			break;
+		default: {
+			struct vfio_region_info_cap_type cap_type = {
+				.header.id = VFIO_REGION_INFO_CAP_TYPE,
+				.header.version = 1
+			};
+
+			if (info.index >= VFIO_PCI_NUM_REGIONS +
+					vidxd->vdev.num_regions)
+				return -EINVAL;
+
+			i = info.index - VFIO_PCI_NUM_REGIONS;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vidxd->vdev.region[i].size;
+			info.flags = vidxd->vdev.region[i].flags;
+
+			cap_type.type = vidxd->vdev.region[i].type;
+			cap_type.subtype = vidxd->vdev.region[i].subtype;
+
+			rc = vfio_info_add_capability(&caps, &cap_type.header,
+						      sizeof(cap_type));
+			if (rc)
+				return rc;
+		} /* default */
+		} /* info.index switch */
+
+		if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && sparse) {
+			if (cap_type_id == VFIO_REGION_INFO_CAP_SPARSE_MMAP) {
+				rc = vfio_info_add_capability(&caps,
+							      &sparse->header,
+							      sizeof(*sparse) +
+							      (sparse->nr_areas *
+							      sizeof(*sparse->areas)));
+				kfree(sparse);
+				if (rc)
+					return rc;
+			}
+		}
+
+		if (caps.size) {
+			if (info.argsz < sizeof(info) + caps.size) {
+				info.argsz = sizeof(info) + caps.size;
+				info.cap_offset = 0;
+			} else {
+				vfio_info_cap_shift(&caps, sizeof(info));
+				if (copy_to_user((void __user *)arg +
+						 sizeof(info), caps.buf,
+						 caps.size)) {
+					kfree(caps.buf);
+					return -EFAULT;
+				}
+				info.cap_offset = sizeof(info);
+			}
+
+			kfree(caps.buf);
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+				    -EFAULT : 0;
+	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			return -EINVAL;
+		} /* switch(info.index) */
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_NORESIZE;
+		info.count = idxd_vdcm_get_irq_count(vidxd, info.index);
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
+		struct vfio_irq_set hdr;
+		u8 *data = NULL;
+		size_t data_size = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			int max = idxd_vdcm_get_irq_count(vidxd, hdr.index);
+
+			rc = vfio_set_irqs_validate_and_prepare(&hdr, max,
+								VFIO_PCI_NUM_IRQS,
+								&data_size);
+			if (rc) {
+				dev_err(dev, "intel:vfio_set_irqs_validate_and_prepare failed\n");
+				return -EINVAL;
+			}
+			if (data_size) {
+				data = memdup_user((void __user *)(arg + minsz),
+						   data_size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+			}
+		}
+
+		if (!data)
+			return -EINVAL;
+
+		rc = idxd_vdcm_set_irqs(vidxd, hdr.flags, hdr.index,
+					hdr.start, hdr.count, data);
+		kfree(data);
+		return rc;
+	} else if (cmd == VFIO_DEVICE_RESET) {
+		vidxd_vdcm_reset(vidxd);
+		return 0;
+	}
+
+	return rc;
+}
+
+static ssize_t name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	struct vdcm_idxd_type *type;
+
+	type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+
+	if (type)
+		return sprintf(buf, "%s\n", type->description);
+
+	return -EINVAL;
+}
+static MDEV_TYPE_ATTR_RO(name);
+
+static int find_available_mdev_instances(struct idxd_device *idxd)
+{
+	int count = 0, i;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq;
+
+		wq = &idxd->wqs[i];
+		if (!is_idxd_wq_mdev(wq))
+			continue;
+
+		if ((idxd_wq_refcount(wq) <= 1 && wq_dedicated(wq)) ||
+		    !wq_dedicated(wq))
+			count++;
+	}
+
+	return count;
+}
+
+static ssize_t available_instances_show(struct kobject *kobj,
+					struct device *dev, char *buf)
+{
+	int count;
+	struct idxd_device *idxd = dev_get_drvdata(dev);
+	struct vdcm_idxd_type *type;
+
+	type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+	if (!type)
+		return -EINVAL;
+
+	count = find_available_mdev_instances(idxd);
+
+	return sprintf(buf, "%d\n", count);
+}
+static MDEV_TYPE_ATTR_RO(available_instances);
+
+static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
+			       char *buf)
+{
+	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+static MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *idxd_mdev_types_attrs[] = {
+	&mdev_type_attr_name.attr,
+	&mdev_type_attr_device_api.attr,
+	&mdev_type_attr_available_instances.attr,
+	NULL,
+};
+
+static struct attribute_group idxd_mdev_type_group0 = {
+	.name  = "wq",
+	.attrs = idxd_mdev_types_attrs,
+};
+
+static struct attribute_group *idxd_mdev_type_groups[] = {
+	&idxd_mdev_type_group0,
+	NULL,
+};
+
+static const struct mdev_parent_ops idxd_vdcm_ops = {
+	.supported_type_groups	= idxd_mdev_type_groups,
+	.create			= idxd_vdcm_create,
+	.remove			= idxd_vdcm_remove,
+	.open			= idxd_vdcm_open,
+	.release		= idxd_vdcm_release,
+	.read			= idxd_vdcm_read,
+	.write			= idxd_vdcm_write,
+	.mmap			= idxd_vdcm_mmap,
+	.ioctl			= idxd_vdcm_ioctl,
+};
+
+int idxd_mdev_host_init(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int rc;
+
+	if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
+		rc = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_AUX);
+		if (rc < 0)
+			dev_warn(dev, "Failed to enable aux-domain: %d\n",
+				 rc);
+	} else {
+		dev_dbg(dev, "No aux-domain feature.\n");
+	}
+
+	return mdev_register_device(dev, &idxd_vdcm_ops);
+}
+
+void idxd_mdev_host_release(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int rc;
+
+	if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
+		rc = iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
+		if (rc < 0)
+			dev_warn(dev, "Failed to disable aux-domain: %d\n",
+				 rc);
+	}
+
+	mdev_unregister_device(dev);
+}
diff --git a/drivers/dma/idxd/mdev.h b/drivers/dma/idxd/mdev.h
index 5b05b6cb2b7b..0b3a4c9822d4 100644
--- a/drivers/dma/idxd/mdev.h
+++ b/drivers/dma/idxd/mdev.h
@@ -48,6 +48,8 @@ struct ims_irq_entry {
 
 struct idxd_vdev {
 	struct mdev_device *mdev;
+	struct vfio_region *region;
+	int num_regions;
 	struct eventfd_ctx *msix_trigger[VIDXD_MAX_MSIX_ENTRIES];
 	struct notifier_block group_notifier;
 	struct kvm *kvm;
@@ -79,4 +81,25 @@ static inline struct vdcm_idxd *to_vidxd(struct idxd_vdev *vdev)
 	return container_of(vdev, struct vdcm_idxd, vdev);
 }
 
+#define IDXD_MDEV_NAME_LEN 16
+#define IDXD_MDEV_DESCRIPTION_LEN 64
+
+enum idxd_mdev_type {
+	IDXD_MDEV_TYPE_WQ = 0,
+};
+
+#define IDXD_MDEV_TYPES 1
+
+struct vdcm_idxd_type {
+	char name[IDXD_MDEV_NAME_LEN];
+	char description[IDXD_MDEV_DESCRIPTION_LEN];
+	enum idxd_mdev_type type;
+	unsigned int avail_instance;
+};
+
+enum idxd_vdcm_rw {
+	IDXD_VDCM_READ = 0,
+	IDXD_VDCM_WRITE,
+};
+
 #endif
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index a39e7ae6b3d9..043cf825a71f 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -137,6 +137,8 @@ enum idxd_device_status_state {
 	IDXD_DEVICE_STATE_HALT,
 };
 
+#define IDXD_GENSTATS_MASK		0x03
+
 enum idxd_device_reset_type {
 	IDXD_DEVICE_RESET_SOFTWARE = 0,
 	IDXD_DEVICE_RESET_FLR,
@@ -160,6 +162,7 @@ union idxd_command_reg {
 	};
 	u32 bits;
 } __packed;
+#define IDXD_CMD_INT_MASK		0x80000000
 
 enum idxd_cmd {
 	IDXD_CMD_ENABLE_DEVICE = 1,
@@ -333,4 +336,11 @@ union wqcfg {
 	};
 	u32 bits[8];
 } __packed;
+
+enum idxd_wq_hw_state {
+	IDXD_WQ_DEV_DISABLED = 0,
+	IDXD_WQ_DEV_ENABLED,
+	IDXD_WQ_DEV_BUSY,
+};
+
 #endif
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index bdcac933bb28..ee976b51b88d 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -57,6 +57,21 @@ struct idxd_desc *idxd_alloc_desc(struct idxd_wq *wq,
 	desc = wq->descs[idx];
 	memset(desc->hw, 0, sizeof(struct dsa_hw_desc));
 	memset(desc->completion, 0, sizeof(struct dsa_completion_record));
+
+	if (idxd->pasid_enabled)
+		desc->hw->pasid = idxd->pasid;
+
+	/*
+	 * Descriptor completion vectors are 1-8 for MSIX. We will round
+	 * robin through the 8 vectors.
+	 */
+	if (!idxd->int_handles) {
+		wq->vec_ptr = (wq->vec_ptr % idxd->num_wq_irqs) + 1;
+		desc->hw->int_handle =  wq->vec_ptr;
+	} else {
+		desc->hw->int_handle = idxd->int_handles[wq->id];
+	}
+
 	return desc;
 }
 
@@ -115,7 +130,6 @@ int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc,
 		     enum idxd_op_type optype)
 {
 	struct idxd_device *idxd = wq->idxd;
-	int vec = desc->hw->int_handle;
 	int rc;
 	void __iomem *portal;
 
@@ -143,9 +157,19 @@ int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc,
 	 * Pending the descriptor to the lockless list for the irq_entry
 	 * that we designated the descriptor to.
 	 */
-	if (desc->hw->flags & IDXD_OP_FLAG_RCI)
+	if (desc->hw->flags & IDXD_OP_FLAG_RCI) {
+		int vec;
+
+		/*
+		 * If the driver is on host kernel, it would be the value
+		 * assigned to interrupt handle, which is index for MSIX
+		 * vector. If it's guest then we'll set it to 1 for now
+		 * since only 1 workqueue is exported.
+		 */
+		vec = !idxd->int_handles ? desc->hw->int_handle : 1;
 		llist_add(&desc->llnode,
 			  &idxd->irq_entries[vec].pending_llist);
+	}
 
 	return 0;
 }
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 07bad4f6c7fb..a175c2381e0e 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -4,6 +4,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/pci.h>
+#include <linux/uuid.h>
 #include <linux/device.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <uapi/linux/idxd.h>
@@ -14,6 +15,7 @@ static char *idxd_wq_type_names[] = {
 	[IDXD_WQT_NONE]		= "none",
 	[IDXD_WQT_KERNEL]	= "kernel",
 	[IDXD_WQT_USER]		= "user",
+	[IDXD_WQT_MDEV]		= "mdev",
 };
 
 static void idxd_conf_device_release(struct device *dev)
@@ -69,6 +71,11 @@ static inline bool is_idxd_wq_cdev(struct idxd_wq *wq)
 	return wq->type == IDXD_WQT_USER;
 }
 
+inline bool is_idxd_wq_mdev(struct idxd_wq *wq)
+{
+	return wq->type == IDXD_WQT_MDEV ? true : false;
+}
+
 static int idxd_config_bus_match(struct device *dev,
 				 struct device_driver *drv)
 {
@@ -205,6 +212,13 @@ static int idxd_config_bus_probe(struct device *dev)
 				mutex_unlock(&wq->wq_lock);
 				return -EINVAL;
 			}
+
+			/* This check is added until we have SVM support for mdev */
+			if (wq->type == IDXD_WQT_MDEV) {
+				dev_warn(dev, "Shared MDEV unsupported.");
+				mutex_unlock(&wq->wq_lock);
+				return -EINVAL;
+			}
 		}
 
 		rc = idxd_wq_alloc_resources(wq);
@@ -237,7 +251,7 @@ static int idxd_config_bus_probe(struct device *dev)
 			}
 		}
 
-		rc = idxd_wq_enable(wq);
+		rc = idxd_wq_enable(wq, NULL);
 		if (rc < 0) {
 			spin_unlock_irqrestore(&idxd->dev_lock, flags);
 			mutex_unlock(&wq->wq_lock);
@@ -250,7 +264,7 @@ static int idxd_config_bus_probe(struct device *dev)
 		rc = idxd_wq_map_portal(wq);
 		if (rc < 0) {
 			dev_warn(dev, "wq portal mapping failed: %d\n", rc);
-			rc = idxd_wq_disable(wq);
+			rc = idxd_wq_disable(wq, NULL);
 			if (rc < 0)
 				dev_warn(dev, "IDXD wq disable failed\n");
 			spin_unlock_irqrestore(&idxd->dev_lock, flags);
@@ -311,7 +325,7 @@ static void disable_wq(struct idxd_wq *wq)
 	idxd_wq_unmap_portal(wq);
 
 	spin_lock_irqsave(&idxd->dev_lock, flags);
-	rc = idxd_wq_disable(wq);
+	rc = idxd_wq_disable(wq, NULL);
 	spin_unlock_irqrestore(&idxd->dev_lock, flags);
 
 	idxd_wq_free_resources(wq);
@@ -1106,6 +1120,100 @@ static ssize_t wq_threshold_store(struct device *dev,
 static struct device_attribute dev_attr_wq_threshold =
 		__ATTR(threshold, 0644, wq_threshold_show, wq_threshold_store);
 
+static ssize_t wq_uuid_store(struct device *dev,
+			     struct device_attribute *attr, const char *buf,
+			     size_t count)
+{
+	char *str;
+	int rc;
+	struct idxd_wq_uuid *entry, *n;
+	struct idxd_wq_uuid *wq_uuid;
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+	struct device *ddev = &wq->idxd->pdev->dev;
+
+	if (wq->type != IDXD_WQT_MDEV)
+		return -EPERM;
+
+	if (count < UUID_STRING_LEN || (count > UUID_STRING_LEN + 1))
+		return -EINVAL;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	wq_uuid = devm_kzalloc(ddev, sizeof(struct idxd_wq_uuid), GFP_KERNEL);
+	if (!wq_uuid) {
+		kfree(str);
+		return -ENOMEM;
+	}
+
+	rc = guid_parse(str, &wq_uuid->uuid);
+	kfree(str);
+	if (rc)
+		return rc;
+
+	mutex_lock(&wq->wq_lock);
+	/* If user writes 0, erase entire list. */
+	if (guid_is_null(&wq_uuid->uuid)) {
+		list_for_each_entry_safe(entry, n, &wq->uuid_list, list) {
+			list_del(&entry->list);
+			devm_kfree(ddev, entry);
+			wq->uuids--;
+		}
+
+		mutex_unlock(&wq->wq_lock);
+		return count;
+	}
+
+	/* If uuid already exists, remove the old uuid. */
+	list_for_each_entry_safe(entry, n, &wq->uuid_list, list) {
+		if (guid_equal(&wq_uuid->uuid, &entry->uuid)) {
+			list_del(&entry->list);
+			devm_kfree(ddev, entry);
+			wq->uuids--;
+			mutex_unlock(&wq->wq_lock);
+			return count;
+		}
+	}
+
+	/*
+	 * At this point, we are only adding, and the wq must be on in order
+	 * to do so. A disabled wq type is ambiguous.
+	 */
+	if (wq->state != IDXD_WQ_ENABLED)
+		return -EPERM;
+	/*
+	 * If wq is shared or wq is dedicated and list empty,
+	 * put uuid into list.
+	 */
+	if (!wq_dedicated(wq) || list_empty(&wq->uuid_list)) {
+		wq->uuids++;
+		list_add(&wq_uuid->list, &wq->uuid_list);
+	} else {
+		mutex_unlock(&wq->wq_lock);
+		return -EPERM;
+	}
+
+	mutex_unlock(&wq->wq_lock);
+	return count;
+}
+
+static ssize_t wq_uuid_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct idxd_wq *wq = container_of(dev, struct idxd_wq, conf_dev);
+	struct idxd_wq_uuid *entry;
+	int out = 0;
+
+	list_for_each_entry(entry, &wq->uuid_list, list)
+		out += sprintf(buf + out, "%pUl\n", &entry->uuid);
+
+	return out;
+}
+
+static struct device_attribute dev_attr_wq_uuid =
+		__ATTR(uuid, 0644, wq_uuid_show, wq_uuid_store);
+
 static ssize_t wq_type_show(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
@@ -1116,8 +1224,9 @@ static ssize_t wq_type_show(struct device *dev,
 		return sprintf(buf, "%s\n",
 			       idxd_wq_type_names[IDXD_WQT_KERNEL]);
 	case IDXD_WQT_USER:
-		return sprintf(buf, "%s\n",
-			       idxd_wq_type_names[IDXD_WQT_USER]);
+		return sprintf(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_USER]);
+	case IDXD_WQT_MDEV:
+		return sprintf(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_MDEV]);
 	case IDXD_WQT_NONE:
 	default:
 		return sprintf(buf, "%s\n",
@@ -1127,6 +1236,20 @@ static ssize_t wq_type_show(struct device *dev,
 	return -EINVAL;
 }
 
+static void wq_clear_uuids(struct idxd_wq *wq)
+{
+	struct idxd_wq_uuid *entry, *n;
+	struct device *dev = &wq->idxd->pdev->dev;
+
+	mutex_lock(&wq->wq_lock);
+	list_for_each_entry_safe(entry, n, &wq->uuid_list, list) {
+		list_del(&entry->list);
+		devm_kfree(dev, entry);
+		wq->uuids--;
+	}
+	mutex_unlock(&wq->wq_lock);
+}
+
 static ssize_t wq_type_store(struct device *dev,
 			     struct device_attribute *attr, const char *buf,
 			     size_t count)
@@ -1144,13 +1267,20 @@ static ssize_t wq_type_store(struct device *dev,
 		wq->type = IDXD_WQT_KERNEL;
 	else if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_USER]))
 		wq->type = IDXD_WQT_USER;
+	else if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_MDEV]))
+		wq->type = IDXD_WQT_MDEV;
 	else
 		return -EINVAL;
 
 	/* If we are changing queue type, clear the name */
-	if (wq->type != old_type)
+	if (wq->type != old_type) {
 		memset(wq->name, 0, WQ_NAME_SIZE + 1);
 
+		/* If changed out of MDEV type, clear uuids */
+		if (wq->type != IDXD_WQT_MDEV)
+			wq_clear_uuids(wq);
+	}
+
 	return count;
 }
 
@@ -1218,6 +1348,7 @@ static struct attribute *idxd_wq_attributes[] = {
 	&dev_attr_wq_type.attr,
 	&dev_attr_wq_name.attr,
 	&dev_attr_wq_cdev_minor.attr,
+	&dev_attr_wq_uuid.attr,
 	NULL,
 };
 
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
new file mode 100644
index 000000000000..d2a15f1dae6a
--- /dev/null
+++ b/drivers/dma/idxd/vdev.c
@@ -0,0 +1,570 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "../../vfio/pci/vfio_pci_private.h"
+#include "mdev.h"
+#include "vdev.h"
+
+static int idxd_get_mdev_pasid(struct mdev_device *mdev)
+{
+	struct iommu_domain *domain;
+	struct device *dev = mdev_dev(mdev);
+
+	domain = mdev_get_iommu_domain(dev);
+	if (!domain)
+		return -EINVAL;
+
+	return iommu_aux_get_pasid(domain, dev->parent);
+}
+
+int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx)
+{
+	int rc = -1;
+	struct device *dev = &vidxd->idxd->pdev->dev;
+
+	/*
+	 * We need to check MSIX mask bit only for entry 0 because that is
+	 * the only virtual interrupt. Other interrupts are physical
+	 * interrupts, and they are setup such that we receive them only
+	 * when guest wants to receive them.
+	 */
+	if (msix_idx == 0) {
+		u8 *msix_perm = &vidxd->bar0.msix_perm_table[0];
+
+		if (msix_perm[0] & 1) {
+			set_bit(0, (unsigned long *)&vidxd->bar0.msix_pba);
+			set_bit(1, (unsigned long *)msix_perm);
+		}
+		return 1;
+	}
+
+	if (!vidxd->vdev.msix_trigger[msix_idx]) {
+		dev_warn(dev, "%s: intr evtfd not found %d\n",
+			 __func__, msix_idx);
+		return -EINVAL;
+	}
+
+	rc = eventfd_signal(vidxd->vdev.msix_trigger[msix_idx], 1);
+	if (rc != 1)
+		dev_err(dev, "eventfd signal failed (%d)\n", rc);
+	else
+		dev_dbg(dev, "vidxd interrupt triggered wq(%d) %d\n",
+			vidxd->wq->id, msix_idx);
+
+	return rc;
+}
+
+static void vidxd_mmio_init_grpcfg(struct vdcm_idxd *vidxd,
+				   struct grpcfg *grpcfg)
+{
+	struct idxd_wq *wq = vidxd->wq;
+	struct idxd_group *group = wq->group;
+	int i;
+
+	/*
+	 * At this point, we are only exporting a single workqueue for
+	 * each mdev. So we need to just fake it as first workqueue
+	 * and also mark the available engines in this group.
+	 */
+
+	/* Set single workqueue and the first one */
+	grpcfg->wqs[0] = 0x1;
+	grpcfg->engines = 0;
+	for (i = 0; i < group->num_engines; i++)
+		grpcfg->engines |= BIT(i);
+	grpcfg->flags.bits = group->grpcfg.flags.bits;
+}
+
+void vidxd_mmio_init(struct vdcm_idxd *vidxd)
+{
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	struct idxd_device *idxd = vidxd->idxd;
+	struct idxd_wq *wq = vidxd->wq;
+	union wqcfg *wqcfg;
+	struct grpcfg *grpcfg;
+	union wq_cap_reg *wq_cap;
+	union offsets_reg *offsets;
+
+	/* setup wqcfg */
+	wqcfg = (union wqcfg *)&bar0->wq_ctrl_regs[0];
+	grpcfg = (struct grpcfg *)&bar0->grp_ctrl_regs[0];
+
+	wqcfg->wq_size = wq->size;
+	wqcfg->wq_thresh = wq->threshold;
+
+	if (wq_dedicated(wq))
+		wqcfg->mode = 1;
+
+	if (idxd->hw.gen_cap.block_on_fault &&
+	    test_bit(WQ_FLAG_BOF, &wq->flags))
+		wqcfg->bof = 1;
+
+	wqcfg->priority = wq->priority;
+	wqcfg->max_xfer_shift = idxd->hw.gen_cap.max_xfer_shift;
+	wqcfg->max_batch_shift = idxd->hw.gen_cap.max_batch_shift;
+	/* make mode change read-only */
+	wqcfg->mode_support = 0;
+
+	/* setup grpcfg */
+	vidxd_mmio_init_grpcfg(vidxd, grpcfg);
+
+	/* setup wqcap */
+	wq_cap = (union wq_cap_reg *)&bar0->cap_ctrl_regs[IDXD_WQCAP_OFFSET];
+	memset(wq_cap, 0, sizeof(union wq_cap_reg));
+	wq_cap->total_wq_size = wq->size;
+	wq_cap->num_wqs = 1;
+	if (wq_dedicated(wq))
+		wq_cap->dedicated_mode = 1;
+	else
+		wq_cap->shared_mode = 1;
+
+	offsets = (union offsets_reg *)&bar0->cap_ctrl_regs[IDXD_TABLE_OFFSET];
+	offsets->grpcfg = VIDXD_GRPCFG_OFFSET / 0x100;
+	offsets->wqcfg = VIDXD_WQCFG_OFFSET / 0x100;
+	offsets->msix_perm = VIDXD_MSIX_PERM_OFFSET / 0x100;
+
+	/* Clear MSI-X permissions table */
+	memset(bar0->msix_perm_table, 0, 2 * 8);
+}
+
+static void idxd_complete_command(struct vdcm_idxd *vidxd,
+				  enum idxd_cmdsts_err val)
+{
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	u32 *cmd = (u32 *)&bar0->cap_ctrl_regs[IDXD_CMD_OFFSET];
+	u32 *cmdsts = (u32 *)&bar0->cap_ctrl_regs[IDXD_CMDSTS_OFFSET];
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	*cmdsts = val;
+	dev_dbg(dev, "%s: cmd: %#x  status: %#x\n", __func__, *cmd, val);
+
+	if (*cmd & IDXD_CMD_INT_MASK) {
+		bar0->cap_ctrl_regs[IDXD_INTCAUSE_OFFSET] |= IDXD_INTC_CMD;
+		vidxd_send_interrupt(vidxd, 0);
+	}
+}
+
+static void vidxd_enable(struct vdcm_idxd *vidxd)
+{
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	bool ats = (*(u16 *)&vidxd->cfg[VIDXD_ATS_OFFSET + 6]) & (1U << 15);
+	bool prs = (*(u16 *)&vidxd->cfg[VIDXD_PRS_OFFSET + 4]) & 1U;
+	bool pasid = (*(u16 *)&vidxd->cfg[VIDXD_PASID_OFFSET + 6]) & 1U;
+	u32 vdev_state = *(u32 *)&bar0->cap_ctrl_regs[IDXD_GENSTATS_OFFSET] &
+			 IDXD_GENSTATS_MASK;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "%s\n", __func__);
+
+	if (vdev_state == IDXD_DEVICE_STATE_ENABLED)
+		return idxd_complete_command(vidxd,
+					     IDXD_CMDSTS_ERR_DEV_ENABLED);
+
+	/* Check PCI configuration */
+	if (!(vidxd->cfg[PCI_COMMAND] & PCI_COMMAND_MASTER))
+		return idxd_complete_command(vidxd,
+					     IDXD_CMDSTS_ERR_BUSMASTER_EN);
+
+	if (pasid != prs || (pasid && !ats))
+		return idxd_complete_command(vidxd,
+					     IDXD_CMDSTS_ERR_BUSMASTER_EN);
+
+	bar0->cap_ctrl_regs[IDXD_GENSTATS_OFFSET] = IDXD_DEVICE_STATE_ENABLED;
+
+	return idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_disable(struct vdcm_idxd *vidxd)
+{
+	int rc;
+	struct idxd_wq *wq;
+	union wqcfg *wqcfg;
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	u32 vdev_state = *(u32 *)&bar0->cap_ctrl_regs[IDXD_GENSTATS_OFFSET] &
+			 IDXD_GENSTATS_MASK;
+
+	dev_dbg(dev, "%s\n", __func__);
+
+	if (vdev_state == IDXD_DEVICE_STATE_DISABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DIS_DEV_EN);
+		return;
+	}
+
+	wqcfg = (union wqcfg *)&bar0->wq_ctrl_regs[0];
+	wq = vidxd->wq;
+
+	/* If it is a DWQ, need to disable the DWQ as well */
+	rc = idxd_wq_drain(wq);
+	if (rc < 0)
+		dev_warn(dev, "vidxd drain wq %d failed: %d\n",
+			 wq->id, rc);
+
+	if (wq_dedicated(wq)) {
+		rc = idxd_wq_disable(wq, NULL);
+		if (rc < 0)
+			dev_warn(dev, "vidxd disable wq %d failed: %d\n",
+				 wq->id, rc);
+	}
+
+	wqcfg->wq_state = 0;
+	bar0->cap_ctrl_regs[IDXD_GENSTATS_OFFSET] = IDXD_DEVICE_STATE_DISABLED;
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_drain(struct vdcm_idxd *vidxd)
+{
+	int rc;
+	struct idxd_wq *wq;
+	union wqcfg *wqcfg;
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	u32 vdev_state = *(u32 *)&bar0->cap_ctrl_regs[IDXD_GENSTATS_OFFSET] &
+			 IDXD_GENSTATS_MASK;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "%s\n", __func__);
+
+	if (vdev_state == IDXD_DEVICE_STATE_DISABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOT_EN);
+		return;
+	}
+
+	wqcfg = (union wqcfg *)&bar0->wq_ctrl_regs[0];
+	wq = vidxd->wq;
+
+	rc = idxd_wq_drain(wq);
+	if (rc < 0)
+		dev_warn(dev, "wq %d drain failed: %d\n", wq->id, rc);
+
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_abort(struct vdcm_idxd *vidxd)
+{
+	int rc;
+	struct idxd_wq *wq;
+	union wqcfg *wqcfg;
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	u32 vdev_state = *(u32 *)&bar0->cap_ctrl_regs[IDXD_GENSTATS_OFFSET] &
+			 IDXD_GENSTATS_MASK;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "%s\n", __func__);
+
+	if (vdev_state == IDXD_DEVICE_STATE_DISABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOT_EN);
+		return;
+	}
+
+	wqcfg = (union wqcfg *)&bar0->wq_ctrl_regs[0];
+	wq = vidxd->wq;
+
+	rc = idxd_wq_abort(wq);
+	if (rc < 0)
+		dev_warn(dev, "wq %d drain failed: %d\n", wq->id, rc);
+
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_drain(struct vdcm_idxd *vidxd, int val)
+{
+	vidxd_drain(vidxd);
+}
+
+static void vidxd_wq_abort(struct vdcm_idxd *vidxd, int val)
+{
+	vidxd_abort(vidxd);
+}
+
+void vidxd_reset(struct vdcm_idxd *vidxd)
+{
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	int rc;
+	struct idxd_wq *wq;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	*(u32 *)&bar0->cap_ctrl_regs[IDXD_GENSTATS_OFFSET] =
+		IDXD_DEVICE_STATE_DRAIN;
+
+	wq = vidxd->wq;
+
+	rc = idxd_wq_drain(wq);
+	if (rc < 0)
+		dev_warn(dev, "wq %d drain failed: %d\n", wq->id, rc);
+
+	/* If it is a DWQ, need to disable the DWQ as well */
+	if (wq_dedicated(wq)) {
+		rc = idxd_wq_disable(wq, NULL);
+		if (rc < 0)
+			dev_warn(dev, "vidxd disable wq %d failed: %d\n",
+				 wq->id, rc);
+	}
+
+	vidxd_mmio_init(vidxd);
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_alloc_int_handle(struct vdcm_idxd *vidxd, int vidx)
+{
+	bool ims = (vidx >> 16) & 1;
+	u32 cmdsts;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	vidx = vidx & 0xffff;
+
+	dev_dbg(dev, "allocating int handle for %x\n", vidx);
+
+	if (vidx != 1) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX);
+		return;
+	}
+
+	if (ims) {
+		dev_warn(dev, "IMS allocation is not implemented yet\n");
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_NO_HANDLE);
+	} else {
+		vidx--; /* MSIX idx 0 is a slow path interrupt */
+		cmdsts = vidxd->ims_index[vidx] << 8;
+		dev_dbg(dev, "int handle %d:%lld\n", vidx,
+			vidxd->ims_index[vidx]);
+		idxd_complete_command(vidxd, cmdsts);
+	}
+}
+
+static void vidxd_wq_enable(struct vdcm_idxd *vidxd, int wq_id)
+{
+	struct idxd_wq *wq;
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	union wq_cap_reg *wqcap;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	struct idxd_device *idxd;
+	union wqcfg *vwqcfg, *wqcfg;
+	unsigned long flags;
+	int rc;
+
+	dev_dbg(dev, "%s\n", __func__);
+
+	if (wq_id >= 1) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_WQIDX);
+		return;
+	}
+
+	idxd = vidxd->idxd;
+	wq = vidxd->wq;
+
+	dev_dbg(dev, "%s: wq %u:%u\n", __func__, wq_id, wq->id);
+
+	vwqcfg = (union wqcfg *)&bar0->wq_ctrl_regs[wq_id];
+	wqcap = (union wq_cap_reg *)&bar0->cap_ctrl_regs[IDXD_WQCAP_OFFSET];
+	wqcfg = &wq->wqcfg;
+
+	if (vidxd_state(vidxd) != IDXD_DEVICE_STATE_ENABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOTEN);
+		return;
+	}
+
+	if (vwqcfg->wq_state != IDXD_WQ_DEV_DISABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_ENABLED);
+		return;
+	}
+
+	if (vwqcfg->wq_size == 0) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_SIZE);
+		return;
+	}
+
+	if ((!wq_dedicated(wq) && wqcap->shared_mode == 0) ||
+	    (wq_dedicated(wq) && wqcap->dedicated_mode == 0)) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_MODE);
+		return;
+	}
+
+	if (wq_dedicated(wq)) {
+		int wq_pasid;
+		u32 status;
+		int priv;
+
+		wq_pasid = idxd_get_mdev_pasid(mdev);
+		priv = 1;
+
+		if (wq_pasid >= 0) {
+			wqcfg->bits[2] &= ~0x3fffff00;
+			wqcfg->priv = priv;
+			wqcfg->pasid_en = 1;
+			wqcfg->pasid = wq_pasid;
+			dev_dbg(dev, "program pasid %d in wq %d\n",
+				wq_pasid, wq->id);
+                        spin_lock_irqsave(&idxd->dev_lock, flags);
+                        idxd_wq_update_pasid(wq, wq_pasid);
+                        idxd_wq_update_priv(wq, priv);
+                        rc = idxd_wq_enable(wq, &status);
+                        spin_unlock_irqrestore(&idxd->dev_lock, flags);
+			if (rc < 0) {
+				dev_err(dev, "vidxd enable wq %d failed\n", wq->id);
+				idxd_complete_command(vidxd, status);
+				return;
+			}
+                } else {
+                        dev_err(dev,
+                                "idxd pasid setup failed wq %d wq_pasid %d\n",
+                                wq->id, wq_pasid);
+			idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_PASID_EN);
+			return;
+		}
+	}
+
+	vwqcfg->wq_state = IDXD_WQ_DEV_ENABLED;
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_disable(struct vdcm_idxd *vidxd, int wq_id_mask)
+{
+	struct idxd_wq *wq;
+	union wqcfg *wqcfg;
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	int rc;
+
+	wq = vidxd->wq;
+
+	if (!(wq_id_mask & BIT(0))) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_WQIDX);
+		return;
+	}
+
+	dev_dbg(dev, "vidxd disable wq %u:%u\n", 0, wq->id);
+
+	wqcfg = (union wqcfg *)&bar0->wq_ctrl_regs[0];
+	if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOT_EN);
+		return;
+	}
+
+	if (wq_dedicated(wq)) {
+		u32 status;
+
+		rc = idxd_wq_disable(wq, &status);
+		if (rc < 0) {
+			dev_err(dev, "vidxd disable wq %d failed\n", wq->id);
+			idxd_complete_command(vidxd, status);
+			return;
+		}
+	}
+
+	wqcfg->wq_state = IDXD_WQ_DEV_DISABLED;
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
+{
+	union idxd_command_reg *reg =
+		(union idxd_command_reg *)&vidxd->bar0.cap_ctrl_regs[IDXD_CMD_OFFSET];
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	reg->bits = val;
+
+	dev_dbg(dev, "%s: cmd code: %u reg: %x\n", __func__, reg->cmd,
+		reg->bits);
+
+	switch (reg->cmd) {
+	case IDXD_CMD_ENABLE_DEVICE:
+		vidxd_enable(vidxd);
+		break;
+	case IDXD_CMD_DISABLE_DEVICE:
+		vidxd_disable(vidxd);
+		break;
+	case IDXD_CMD_DRAIN_ALL:
+		vidxd_drain(vidxd);
+		break;
+	case IDXD_CMD_ABORT_ALL:
+		vidxd_abort(vidxd);
+		break;
+	case IDXD_CMD_RESET_DEVICE:
+		vidxd_reset(vidxd);
+		break;
+	case IDXD_CMD_ENABLE_WQ:
+		vidxd_wq_enable(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_DISABLE_WQ:
+		vidxd_wq_disable(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_DRAIN_WQ:
+		vidxd_wq_drain(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_ABORT_WQ:
+		vidxd_wq_abort(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_REQUEST_INT_HANDLE:
+		vidxd_alloc_int_handle(vidxd, reg->operand);
+		break;
+	default:
+		idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_CMD);
+		break;
+	}
+}
+
+int vidxd_setup_ims_entry(struct vdcm_idxd *vidxd, int ims_idx, u32 val)
+{
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	int pasid;
+	unsigned int ims_offset;
+
+	/*
+	 * Current implementation limits to 1 WQ for the vdev and therefore
+	 * also only 1 IMS interrupt for that vdev.
+	 */
+	if (ims_idx >= VIDXD_MAX_WQS) {
+		dev_warn(dev, "ims_idx greater than vidxd allowed: %d\n",
+			 ims_idx);
+		return -EINVAL;
+	}
+
+	/* Setup the PASID filtering */
+	pasid = idxd_get_mdev_pasid(mdev);
+
+	if (pasid >= 0) {
+		val = (1 << 3) | (pasid << 12) | (val & 7);
+		ims_offset = vidxd->idxd->ims_offset +
+			     vidxd->ims_index[ims_idx] * 0x10;
+		iowrite32(val, vidxd->idxd->reg_base + ims_offset + 12);
+	} else {
+		dev_warn(dev, "pasid setup failed for ims entry %lld\n",
+			 vidxd->ims_index[ims_idx]);
+	}
+
+	return 0;
+}
+
+int vidxd_free_ims_entry(struct vdcm_idxd *vidxd, int msix_idx)
+{
+	return 0;
+}
diff --git a/drivers/dma/idxd/vdev.h b/drivers/dma/idxd/vdev.h
new file mode 100644
index 000000000000..3dfff6d0f641
--- /dev/null
+++ b/drivers/dma/idxd/vdev.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_VDEV_H_
+#define _IDXD_VDEV_H_
+
+static inline u64 get_reg_val(void *buf, int size)
+{
+	u64 val = 0;
+
+	switch (size) {
+	case 8:
+		val = *(uint64_t *)buf;
+		break;
+	case 4:
+		val = *(uint32_t *)buf;
+		break;
+	case 2:
+		val = *(uint16_t *)buf;
+		break;
+	case 1:
+		val = *(uint8_t *)buf;
+		break;
+	}
+
+	return val;
+}
+
+static inline u8 vidxd_state(struct vdcm_idxd *vidxd)
+{
+	return vidxd->bar0.cap_ctrl_regs[IDXD_GENSTATS_OFFSET]
+		& IDXD_GENSTATS_MASK;
+}
+
+void vidxd_mmio_init(struct vdcm_idxd *vidxd);
+int vidxd_free_ims_entry(struct vdcm_idxd *vidxd, int msix_idx);
+int vidxd_setup_ims_entry(struct vdcm_idxd *vidxd, int ims_idx, u32 val);
+int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx);
+void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val);
+void vidxd_reset(struct vdcm_idxd *vidxd);
+
+#endif


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 14/15] dmaengine: idxd: add error notification from host driver to mediated device
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (12 preceding siblings ...)
  2020-04-21 23:35 ` [PATCH RFC 13/15] dmaengine: idxd: add support for VFIO mediated device Dave Jiang
@ 2020-04-21 23:35 ` Dave Jiang
  2020-04-21 23:35 ` [PATCH RFC 15/15] dmaengine: idxd: add ABI documentation for mediated device support Dave Jiang
  2020-04-21 23:54 ` [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Jason Gunthorpe
  15 siblings, 0 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:35 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

When a device error occurs, the mediated device need to be notified in
order to notify the guest of device error. Add support to notify the
specific mdev when an error is wq specific and broadcast errors to all mdev
when it's a generic device error.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/idxd.h |    2 ++
 drivers/dma/idxd/irq.c  |    4 ++++
 drivers/dma/idxd/vdev.c |   33 +++++++++++++++++++++++++++++++++
 drivers/dma/idxd/vdev.h |    1 +
 4 files changed, 40 insertions(+)

diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 92a9718daa15..651196514ad5 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -362,5 +362,7 @@ void idxd_wq_del_cdev(struct idxd_wq *wq);
 /* mdev */
 int idxd_mdev_host_init(struct idxd_device *idxd);
 void idxd_mdev_host_release(struct idxd_device *idxd);
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq);
+void idxd_vidxd_send_errors(struct idxd_device *idxd);
 
 #endif
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
index bc634dc4e485..256ef7d8a5c9 100644
--- a/drivers/dma/idxd/irq.c
+++ b/drivers/dma/idxd/irq.c
@@ -167,6 +167,8 @@ irqreturn_t idxd_misc_thread(int vec, void *data)
 
 			if (wq->type == IDXD_WQT_USER)
 				wake_up_interruptible(&wq->idxd_cdev.err_queue);
+			else if (wq->type == IDXD_WQT_MDEV)
+				idxd_wq_vidxd_send_errors(wq);
 		} else {
 			int i;
 
@@ -175,6 +177,8 @@ irqreturn_t idxd_misc_thread(int vec, void *data)
 
 				if (wq->type == IDXD_WQT_USER)
 					wake_up_interruptible(&wq->idxd_cdev.err_queue);
+				else if (wq->type == IDXD_WQT_MDEV)
+					idxd_vidxd_send_errors(idxd);
 			}
 		}
 
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index d2a15f1dae6a..83985f0a336e 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -568,3 +568,36 @@ int vidxd_free_ims_entry(struct vdcm_idxd *vidxd, int msix_idx)
 {
 	return 0;
 }
+
+static void vidxd_send_errors(struct vdcm_idxd *vidxd)
+{
+	struct idxd_device *idxd = vidxd->idxd;
+	struct vdcm_idxd_pci_bar0 *bar0 = &vidxd->bar0;
+	u64 *swerr = (u64 *)&bar0->cap_ctrl_regs[IDXD_SWERR_OFFSET];
+	int i;
+
+	for (i = 0; i < 4; i++) {
+		*swerr = idxd->sw_err.bits[i];
+		swerr++;
+	}
+	vidxd_send_interrupt(vidxd, 0);
+}
+
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq)
+{
+	struct vdcm_idxd *vidxd;
+
+	list_for_each_entry(vidxd, &wq->vdcm_list, list)
+		vidxd_send_errors(vidxd);
+}
+
+void idxd_vidxd_send_errors(struct idxd_device *idxd)
+{
+	int i;
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		idxd_wq_vidxd_send_errors(wq);
+	}
+}
diff --git a/drivers/dma/idxd/vdev.h b/drivers/dma/idxd/vdev.h
index 3dfff6d0f641..14c6631e670c 100644
--- a/drivers/dma/idxd/vdev.h
+++ b/drivers/dma/idxd/vdev.h
@@ -38,5 +38,6 @@ int vidxd_setup_ims_entry(struct vdcm_idxd *vidxd, int ims_idx, u32 val);
 int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx);
 void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val);
 void vidxd_reset(struct vdcm_idxd *vidxd);
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq);
 
 #endif


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH RFC 15/15] dmaengine: idxd: add ABI documentation for mediated device support
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (13 preceding siblings ...)
  2020-04-21 23:35 ` [PATCH RFC 14/15] dmaengine: idxd: add error notification from host driver to " Dave Jiang
@ 2020-04-21 23:35 ` Dave Jiang
  2020-04-21 23:54 ` [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Jason Gunthorpe
  15 siblings, 0 replies; 89+ messages in thread
From: Dave Jiang @ 2020-04-21 23:35 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

From: Jing Lin <jing.lin@intel.com>

Add the sysfs attribute bits in ABI/stable for mediated deivce and guest
support.

Signed-off-by: Jing Lin <jing.lin@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 Documentation/ABI/stable/sysfs-driver-dma-idxd |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-driver-dma-idxd b/Documentation/ABI/stable/sysfs-driver-dma-idxd
index c1adddde23c2..b04cbc5a1827 100644
--- a/Documentation/ABI/stable/sysfs-driver-dma-idxd
+++ b/Documentation/ABI/stable/sysfs-driver-dma-idxd
@@ -76,6 +76,12 @@ Date:		Jan 30, 2020
 KernelVersion:	5.7.0
 Contact:	dmaengine@vger.kernel.org
 Description:	To indicate if PASID (process address space identifier) is
+
+What:           sys/bus/dsa/devices/dsa<m>/ims_size
+Date:           Apr 13, 2020
+KernelVersion:  5.8.0
+Contact:        dmaengine@vger.kernel.org
+Description:	Number of entries in the interrupt message storage table.
 		enabled or not for this device.
 
 What:           sys/bus/dsa/devices/dsa<m>/state
@@ -141,8 +147,16 @@ Date:           Oct 25, 2019
 KernelVersion:  5.6.0
 Contact:        dmaengine@vger.kernel.org
 Description:    The type of this work queue, it can be "kernel" type for work
-		queue usages in the kernel space or "user" type for work queue
-		usages by applications in user space.
+		queue usages in the kernel space, "user" type for work queue
+		usages by applications in user space, or "mdev" type for
+		VFIO mediated devices.
+
+What:           sys/bus/dsa/devices/wq<m>.<n>/uuid
+Date:           Apr 13, 2020
+KernelVersion:  5.8.0
+Contact:        dmaengine@vger.kernel.org
+Description:    The uuid attached to this work queue when the mediated device is
+		created.
 
 What:           sys/bus/dsa/devices/wq<m>.<n>/cdev_minor
 Date:           Oct 25, 2019


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
                   ` (14 preceding siblings ...)
  2020-04-21 23:35 ` [PATCH RFC 15/15] dmaengine: idxd: add ABI documentation for mediated device support Dave Jiang
@ 2020-04-21 23:54 ` Jason Gunthorpe
  2020-04-22  0:53   ` Tian, Kevin
                     ` (3 more replies)
  15 siblings, 4 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-21 23:54 UTC (permalink / raw)
  To: Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> The actual code is independent of the stage 2 driver code submission that adds
> support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
> support dedicated workqueue on a guest with no vIOMMU.
>   
> A new device type "mdev" is introduced for the idxd driver. This allows the wq
> to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
> queue (wq) is enabled, an uuid generated by the user can be added to the wq
> through the uuid sysfs attribute for the wq.  After the association, a mdev can
> be created using this UUID. The mdev driver code will associate the uuid and
> setup the mdev on the driver side. When the create operation is successful, the
> uuid can be passed to qemu. When the guest boots up, it should discover a DSA
> device when doing PCI discovery.

I'm feeling really skeptical that adding all this PCI config space and
MMIO BAR emulation to the kernel just to cram this into a VFIO
interface is a good idea, that kind of stuff is much safer in
userspace.

Particularly since vfio is not really needed once a driver is using
the PASID stuff. We already have general code for drivers to use to
attach a PASID to a mm_struct - and using vfio while disabling all the
DMA/iommu config really seems like an abuse.

A /dev/idxd char dev that mmaps a bar page and links it to a PASID
seems a lot simpler and saner kernel wise.

> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> interrupts for the guest. This preserves MSIX for host usages and also allows a
> significantly larger number of interrupt vectors for guest usage.

I never did get a reply to my earlier remarks on the IMS patches.

The concept of a device specific addr/data table format for MSI is not
Intel specific. This should be general code. We have a device that can
use this kind of kernel capability today.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-21 23:54 ` [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Jason Gunthorpe
@ 2020-04-22  0:53   ` Tian, Kevin
  2020-04-22 11:50     ` Jason Gunthorpe
  2020-04-22 21:24   ` Dan Williams
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 89+ messages in thread
From: Tian, Kevin @ 2020-04-22  0:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Jiang, Dave
  Cc: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams, Dan J,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

> From: Jason Gunthorpe
> Sent: Wednesday, April 22, 2020 7:55 AM
> 
> On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > The actual code is independent of the stage 2 driver code submission that
> adds
> > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code
> series will
> > support dedicated workqueue on a guest with no vIOMMU.
> >
> > A new device type "mdev" is introduced for the idxd driver. This allows the
> wq
> > to be dedicated to the usage of a VFIO mediated device (mdev). Once the
> work
> > queue (wq) is enabled, an uuid generated by the user can be added to the
> wq
> > through the uuid sysfs attribute for the wq.  After the association, a mdev
> can
> > be created using this UUID. The mdev driver code will associate the uuid
> and
> > setup the mdev on the driver side. When the create operation is successful,
> the
> > uuid can be passed to qemu. When the guest boots up, it should discover a
> DSA
> > device when doing PCI discovery.
> 
> I'm feeling really skeptical that adding all this PCI config space and
> MMIO BAR emulation to the kernel just to cram this into a VFIO
> interface is a good idea, that kind of stuff is much safer in
> userspace.
> 
> Particularly since vfio is not really needed once a driver is using
> the PASID stuff. We already have general code for drivers to use to
> attach a PASID to a mm_struct - and using vfio while disabling all the
> DMA/iommu config really seems like an abuse.

Well, this series is for virtualizing idxd device to VMs, instead of supporting
SVA for bare metal processes. idxd implements a hardware-assisted 
mediated device technique called Intel Scalable I/O Virtualization, which
allows each Assignable Device Interface (ADI, e.g. a work queue) tagged 
with an unique PASID to ensure fine-grained DMA isolation when those 
ADIs are assigned to different VMs. For this purpose idxd utilizes the VFIO 
mdev framework and IOMMU aux-domain extension. Bare metal SVA will
be enabled for idxd later by using the general SVA code that you mentioned.
Both paths will co-exist in the end so there is no such case of disabling
DMA/iommu config.

Thanks
Kevin

> 
> A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> seems a lot simpler and saner kernel wise.
> 
> > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > interrupts for the guest. This preserves MSIX for host usages and also
> allows a
> > significantly larger number of interrupt vectors for guest usage.
> 
> I never did get a reply to my earlier remarks on the IMS patches.
> 
> The concept of a device specific addr/data table format for MSI is not
> Intel specific. This should be general code. We have a device that can
> use this kind of kernel capability today.
> 
> Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-22  0:53   ` Tian, Kevin
@ 2020-04-22 11:50     ` Jason Gunthorpe
  2020-04-22 21:14       ` Raj, Ashok
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-22 11:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael, gregkh,
	tglx, hpa, alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing,
	Williams, Dan J, kwankhede, eric.auger, parav, dmaengine,
	linux-kernel, x86, linux-pci, kvm

On Wed, Apr 22, 2020 at 12:53:25AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Wednesday, April 22, 2020 7:55 AM
> > 
> > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > > The actual code is independent of the stage 2 driver code submission that
> > adds
> > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code
> > series will
> > > support dedicated workqueue on a guest with no vIOMMU.
> > >
> > > A new device type "mdev" is introduced for the idxd driver. This allows the
> > wq
> > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the
> > work
> > > queue (wq) is enabled, an uuid generated by the user can be added to the
> > wq
> > > through the uuid sysfs attribute for the wq.  After the association, a mdev
> > can
> > > be created using this UUID. The mdev driver code will associate the uuid
> > and
> > > setup the mdev on the driver side. When the create operation is successful,
> > the
> > > uuid can be passed to qemu. When the guest boots up, it should discover a
> > DSA
> > > device when doing PCI discovery.
> > 
> > I'm feeling really skeptical that adding all this PCI config space and
> > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > interface is a good idea, that kind of stuff is much safer in
> > userspace.
> > 
> > Particularly since vfio is not really needed once a driver is using
> > the PASID stuff. We already have general code for drivers to use to
> > attach a PASID to a mm_struct - and using vfio while disabling all the
> > DMA/iommu config really seems like an abuse.
> 
> Well, this series is for virtualizing idxd device to VMs, instead of
> supporting SVA for bare metal processes. idxd implements a
> hardware-assisted mediated device technique called Intel Scalable
> I/O Virtualization,

I'm familiar with the intel naming scheme.

> which allows each Assignable Device Interface (ADI, e.g. a work
> queue) tagged with an unique PASID to ensure fine-grained DMA
> isolation when those ADIs are assigned to different VMs. For this
> purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain
> extension. Bare metal SVA will be enabled for idxd later by using
> the general SVA code that you mentioned.  Both paths will co-exist
> in the end so there is no such case of disabling DMA/iommu config.
 
Again, if you will have a normal SVA interface, there is no need for a
VFIO version, just use normal SVA for both.

PCI emulation should try to be in userspace, not the kernel, for
security.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-22 11:50     ` Jason Gunthorpe
@ 2020-04-22 21:14       ` Raj, Ashok
  2020-04-23 19:12         ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Raj, Ashok @ 2020-04-22 21:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas,
	rafael, gregkh, tglx, hpa, alex.williamson, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing,
	Williams, Dan J, kwankhede, eric.auger, parav, dmaengine,
	linux-kernel, x86, linux-pci, kvm, Ashok Raj

Hi Jason

> > > 
> > > I'm feeling really skeptical that adding all this PCI config space and
> > > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > > interface is a good idea, that kind of stuff is much safer in
> > > userspace.
> > > 
> > > Particularly since vfio is not really needed once a driver is using
> > > the PASID stuff. We already have general code for drivers to use to
> > > attach a PASID to a mm_struct - and using vfio while disabling all the
> > > DMA/iommu config really seems like an abuse.
> > 
> > Well, this series is for virtualizing idxd device to VMs, instead of
> > supporting SVA for bare metal processes. idxd implements a
> > hardware-assisted mediated device technique called Intel Scalable
> > I/O Virtualization,
> 
> I'm familiar with the intel naming scheme.
> 
> > which allows each Assignable Device Interface (ADI, e.g. a work
> > queue) tagged with an unique PASID to ensure fine-grained DMA
> > isolation when those ADIs are assigned to different VMs. For this
> > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain
> > extension. Bare metal SVA will be enabled for idxd later by using
> > the general SVA code that you mentioned.  Both paths will co-exist
> > in the end so there is no such case of disabling DMA/iommu config.
>  
> Again, if you will have a normal SVA interface, there is no need for a
> VFIO version, just use normal SVA for both.
> 
> PCI emulation should try to be in userspace, not the kernel, for
> security.

Not sure we completely understand your proposal. Mediated devices
are software constructed and they have protected resources like
interrupts and stuff and VFIO already provids abstractions to export
to user space.

Native SVA is simply passing the process CR3 handle to IOMMU so
IOMMU knows how to walk process page tables, kernel handles things
like page-faults, doing device tlb invalidations and such.

That by itself doesn't translate to what a guest typically does
with a VDEV. There are other control paths that need to be serviced
from the kernel code via VFIO. For speed path operations like
ringing doorbells and such they are directly managed from guest.

How do you propose to use the existing SVA api's  to also provide 
full device emulation as opposed to using an existing infrastructure 
that's already in place?

Perhaps Alex can ease Jason's concerns?

Cheers,
Ashok


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-21 23:54 ` [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Jason Gunthorpe
  2020-04-22  0:53   ` Tian, Kevin
@ 2020-04-22 21:24   ` Dan Williams
  2020-04-23 19:17     ` Dan Williams
  2020-04-23 19:18     ` Jason Gunthorpe
  2020-04-22 23:04   ` Dey, Megha
  2020-04-24  6:31   ` Jason Wang
  3 siblings, 2 replies; 89+ messages in thread
From: Dan Williams @ 2020-04-22 21:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, Vinod Koul, Megha Dey, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, baolu.lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list

On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > The actual code is independent of the stage 2 driver code submission that adds
> > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
> > support dedicated workqueue on a guest with no vIOMMU.
> >
> > A new device type "mdev" is introduced for the idxd driver. This allows the wq
> > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
> > queue (wq) is enabled, an uuid generated by the user can be added to the wq
> > through the uuid sysfs attribute for the wq.  After the association, a mdev can
> > be created using this UUID. The mdev driver code will associate the uuid and
> > setup the mdev on the driver side. When the create operation is successful, the
> > uuid can be passed to qemu. When the guest boots up, it should discover a DSA
> > device when doing PCI discovery.
>
> I'm feeling really skeptical that adding all this PCI config space and
> MMIO BAR emulation to the kernel just to cram this into a VFIO
> interface is a good idea, that kind of stuff is much safer in
> userspace.
>
> Particularly since vfio is not really needed once a driver is using
> the PASID stuff. We already have general code for drivers to use to
> attach a PASID to a mm_struct - and using vfio while disabling all the
> DMA/iommu config really seems like an abuse.
>
> A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> seems a lot simpler and saner kernel wise.
>
> > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > interrupts for the guest. This preserves MSIX for host usages and also allows a
> > significantly larger number of interrupt vectors for guest usage.
>
> I never did get a reply to my earlier remarks on the IMS patches.
>
> The concept of a device specific addr/data table format for MSI is not
> Intel specific. This should be general code. We have a device that can
> use this kind of kernel capability today.

This has been my concern reviewing the implementation. IMS needs more
than one in-tree user to validate degrees of freedom in the api. I had
been missing a second "in-tree user" to validate the scope of the
flexibility that was needed.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-21 23:54 ` [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Jason Gunthorpe
  2020-04-22  0:53   ` Tian, Kevin
  2020-04-22 21:24   ` Dan Williams
@ 2020-04-22 23:04   ` Dey, Megha
  2020-04-23 19:44     ` Jason Gunthorpe
  2020-04-24  6:31   ` Jason Wang
  3 siblings, 1 reply; 89+ messages in thread
From: Dey, Megha @ 2020-04-22 23:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Dave Jiang
  Cc: vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa, alex.williamson,
	jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, dmaengine, linux-kernel, x86, linux-pci, kvm



On 4/21/2020 4:54 PM, Jason Gunthorpe wrote:
> On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
>> The actual code is independent of the stage 2 driver code submission that adds
>> support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
>> support dedicated workqueue on a guest with no vIOMMU.
>>    
>> A new device type "mdev" is introduced for the idxd driver. This allows the wq
>> to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
>> queue (wq) is enabled, an uuid generated by the user can be added to the wq
>> through the uuid sysfs attribute for the wq.  After the association, a mdev can
>> be created using this UUID. The mdev driver code will associate the uuid and
>> setup the mdev on the driver side. When the create operation is successful, the
>> uuid can be passed to qemu. When the guest boots up, it should discover a DSA
>> device when doing PCI discovery.
> 
> I'm feeling really skeptical that adding all this PCI config space and
> MMIO BAR emulation to the kernel just to cram this into a VFIO
> interface is a good idea, that kind of stuff is much safer in
> userspace.
> 
> Particularly since vfio is not really needed once a driver is using
> the PASID stuff. We already have general code for drivers to use to
> attach a PASID to a mm_struct - and using vfio while disabling all the
> DMA/iommu config really seems like an abuse.
> 
> A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> seems a lot simpler and saner kernel wise.
> 
>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
>> interrupts for the guest. This preserves MSIX for host usages and also allows a
>> significantly larger number of interrupt vectors for guest usage.
> 
> I never did get a reply to my earlier remarks on the IMS patches.
> 
> The concept of a device specific addr/data table format for MSI is not
> Intel specific. This should be general code. We have a device that can
> use this kind of kernel capability today.
> 

<resending to the mailing list, I had incorrect email options set>

Hi Jason,

I am sorry if I did not address your comments earlier.

The present IMS code is quite generic, most of the code is in the 
drivers/ folder. We basically introduce 2 APIS: allocate and free IMS 
interrupts and a IMS IRQ domain to allocate these interrupts from. These 
APIs are architecture agnostic.

We also introduce a new IMS IRQ domain which is architecture specific. 
This is because IMS generates interrupts only in the remappable format, 
hence interrupt remapping should be enabled for IMS. Currently, the 
interrupt remapping code is only available for Intel and AMD and I don’t 
see anything for ARM.

If a new architecture would want to use IMS, they must simply introduce 
a new IMS IRQ domain. I am not sure if there is any other way around 
this. If you have any ideas, please let me know.

Also, could you give more details on the device that could use IMS? Do 
you have some driver code already? We could then see if and how the 
current IMS code could be made more generic.

> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-22 21:14       ` Raj, Ashok
@ 2020-04-23 19:12         ` Jason Gunthorpe
  2020-04-24  3:27           ` Tian, Kevin
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-23 19:12 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Tian, Kevin, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas,
	rafael, gregkh, tglx, hpa, alex.williamson, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing,
	Williams, Dan J, kwankhede, eric.auger, parav, dmaengine,
	linux-kernel, x86, linux-pci, kvm

On Wed, Apr 22, 2020 at 02:14:36PM -0700, Raj, Ashok wrote:
> Hi Jason
> 
> > > > 
> > > > I'm feeling really skeptical that adding all this PCI config space and
> > > > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > > > interface is a good idea, that kind of stuff is much safer in
> > > > userspace.
> > > > 
> > > > Particularly since vfio is not really needed once a driver is using
> > > > the PASID stuff. We already have general code for drivers to use to
> > > > attach a PASID to a mm_struct - and using vfio while disabling all the
> > > > DMA/iommu config really seems like an abuse.
> > > 
> > > Well, this series is for virtualizing idxd device to VMs, instead of
> > > supporting SVA for bare metal processes. idxd implements a
> > > hardware-assisted mediated device technique called Intel Scalable
> > > I/O Virtualization,
> > 
> > I'm familiar with the intel naming scheme.
> > 
> > > which allows each Assignable Device Interface (ADI, e.g. a work
> > > queue) tagged with an unique PASID to ensure fine-grained DMA
> > > isolation when those ADIs are assigned to different VMs. For this
> > > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain
> > > extension. Bare metal SVA will be enabled for idxd later by using
> > > the general SVA code that you mentioned.  Both paths will co-exist
> > > in the end so there is no such case of disabling DMA/iommu config.
> >  
> > Again, if you will have a normal SVA interface, there is no need for a
> > VFIO version, just use normal SVA for both.
> > 
> > PCI emulation should try to be in userspace, not the kernel, for
> > security.
> 
> Not sure we completely understand your proposal. Mediated devices
> are software constructed and they have protected resources like
> interrupts and stuff and VFIO already provids abstractions to export
> to user space.
> 
> Native SVA is simply passing the process CR3 handle to IOMMU so
> IOMMU knows how to walk process page tables, kernel handles things
> like page-faults, doing device tlb invalidations and such.

> That by itself doesn't translate to what a guest typically does
> with a VDEV. There are other control paths that need to be serviced
> from the kernel code via VFIO. For speed path operations like
> ringing doorbells and such they are directly managed from guest.

You don't need vfio to mmap BAR pages to userspace. The unique thing
that vfio gives is it provides a way to program the classic non-PASID
iommu, which you are not using here.

> How do you propose to use the existing SVA api's  to also provide
> full device emulation as opposed to using an existing infrastructure 
> that's already in place?

You'd provide the 'full device emulation' in userspace (eg qemu),
along side all the other device emulation. Device emulation does not
belong in the kernel without a very good reason.

You get the doorbell BAR page from your own char dev

You setup a PASID IOMMU configuration over your own char dev

Interrupt delivery is triggering a generic event fd

What is VFIO needed for?
 
> Perhaps Alex can ease Jason's concerns?

Last we talked Alex also had doubts on what mdev should be used
for. It is a feature that seems to lack boundaries, and I'll note that
when the discussion came up for VDPA, they eventually choose not to
use VFIO.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-22 21:24   ` Dan Williams
@ 2020-04-23 19:17     ` Dan Williams
  2020-04-23 19:49       ` Jason Gunthorpe
  2020-04-23 19:18     ` Jason Gunthorpe
  1 sibling, 1 reply; 89+ messages in thread
From: Dan Williams @ 2020-04-23 19:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, Vinod Koul, Megha Dey, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, Baolu Lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list

On Wed, Apr 22, 2020 at 2:24 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> >
> > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > > The actual code is independent of the stage 2 driver code submission that adds
> > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
> > > support dedicated workqueue on a guest with no vIOMMU.
> > >
> > > A new device type "mdev" is introduced for the idxd driver. This allows the wq
> > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
> > > queue (wq) is enabled, an uuid generated by the user can be added to the wq
> > > through the uuid sysfs attribute for the wq.  After the association, a mdev can
> > > be created using this UUID. The mdev driver code will associate the uuid and
> > > setup the mdev on the driver side. When the create operation is successful, the
> > > uuid can be passed to qemu. When the guest boots up, it should discover a DSA
> > > device when doing PCI discovery.
> >
> > I'm feeling really skeptical that adding all this PCI config space and
> > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > interface is a good idea, that kind of stuff is much safer in
> > userspace.
> >
> > Particularly since vfio is not really needed once a driver is using
> > the PASID stuff. We already have general code for drivers to use to
> > attach a PASID to a mm_struct - and using vfio while disabling all the
> > DMA/iommu config really seems like an abuse.
> >
> > A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> > seems a lot simpler and saner kernel wise.
> >
> > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > > interrupts for the guest. This preserves MSIX for host usages and also allows a
> > > significantly larger number of interrupt vectors for guest usage.
> >
> > I never did get a reply to my earlier remarks on the IMS patches.
> >
> > The concept of a device specific addr/data table format for MSI is not
> > Intel specific. This should be general code. We have a device that can
> > use this kind of kernel capability today.
>
> This has been my concern reviewing the implementation. IMS needs more
> than one in-tree user to validate degrees of freedom in the api. I had
> been missing a second "in-tree user" to validate the scope of the
> flexibility that was needed.

Hey Jason,

Per Megha's follow-up can you send the details about that other device
and help clear a path for a device-specific MSI addr/data table
format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
to claims about future upstream users. The fact that you have an
additional use case is golden for pushing this into a common area and
validating the scope of the proposed API.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-22 21:24   ` Dan Williams
  2020-04-23 19:17     ` Dan Williams
@ 2020-04-23 19:18     ` Jason Gunthorpe
  2020-05-01 22:31       ` Dey, Megha
  1 sibling, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-23 19:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Jiang, Vinod Koul, Megha Dey, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, baolu.lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list

On Wed, Apr 22, 2020 at 02:24:11PM -0700, Dan Williams wrote:
> On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> >
> > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > > The actual code is independent of the stage 2 driver code submission that adds
> > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
> > > support dedicated workqueue on a guest with no vIOMMU.
> > >
> > > A new device type "mdev" is introduced for the idxd driver. This allows the wq
> > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
> > > queue (wq) is enabled, an uuid generated by the user can be added to the wq
> > > through the uuid sysfs attribute for the wq.  After the association, a mdev can
> > > be created using this UUID. The mdev driver code will associate the uuid and
> > > setup the mdev on the driver side. When the create operation is successful, the
> > > uuid can be passed to qemu. When the guest boots up, it should discover a DSA
> > > device when doing PCI discovery.
> >
> > I'm feeling really skeptical that adding all this PCI config space and
> > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > interface is a good idea, that kind of stuff is much safer in
> > userspace.
> >
> > Particularly since vfio is not really needed once a driver is using
> > the PASID stuff. We already have general code for drivers to use to
> > attach a PASID to a mm_struct - and using vfio while disabling all the
> > DMA/iommu config really seems like an abuse.
> >
> > A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> > seems a lot simpler and saner kernel wise.
> >
> > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > > interrupts for the guest. This preserves MSIX for host usages and also allows a
> > > significantly larger number of interrupt vectors for guest usage.
> >
> > I never did get a reply to my earlier remarks on the IMS patches.
> >
> > The concept of a device specific addr/data table format for MSI is not
> > Intel specific. This should be general code. We have a device that can
> > use this kind of kernel capability today.
> 
> This has been my concern reviewing the implementation. IMS needs more
> than one in-tree user to validate degrees of freedom in the api. I had
> been missing a second "in-tree user" to validate the scope of the
> flexibility that was needed.

IMS is too narrowly specified.

All platforms that support MSI today can support IMS. It is simply a
way for the platform to give the driver an addr/data pair that triggers
an interrupt when a posted write is performed to that pair.

This is different from the other interrupt setup flows which are
tightly tied to the PCI layer. Here the driver should simply ask for
interrupts.

Ie the entire IMS API to the driver should be something very simple
like:

 struct message_irq
 {
   uint64_t addr;
   uint32_t data;
 };

 struct message_irq *request_message_irq(
    struct device *, irq_handler_t handler, unsigned long flags,
    const char *name, void *dev);

And the plumbing underneath should setup the irq chips and so forth as
required.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-22 23:04   ` Dey, Megha
@ 2020-04-23 19:44     ` Jason Gunthorpe
  2020-05-01 22:32       ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-23 19:44 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Dave Jiang, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

> > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > > interrupts for the guest. This preserves MSIX for host usages and also allows a
> > > significantly larger number of interrupt vectors for guest usage.
> > 
> > I never did get a reply to my earlier remarks on the IMS patches.
> > 
> > The concept of a device specific addr/data table format for MSI is not
> > Intel specific. This should be general code. We have a device that can
> > use this kind of kernel capability today.
> 
> I am sorry if I did not address your comments earlier.

It appears noboy from Intel bothered to answer anyone else on that RFC
thread:

https://lore.kernel.org/lkml/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/

However, it seems kind of moot as I see now that this verion of IMS
bears almost no resemblance to the original RFC.

That said, the similiarity to platform-msi was striking, does this new
version harmonize with that?

> The present IMS code is quite generic, most of the code is in the drivers/
> folder. We basically introduce 2 APIS: allocate and free IMS interrupts and
> a IMS IRQ domain to allocate these interrupts from. These APIs are
> architecture agnostic.
>
> We also introduce a new IMS IRQ domain which is architecture specific. This
> is because IMS generates interrupts only in the remappable format, hence
> interrupt remapping should be enabled for IMS. Currently, the interrupt
> remapping code is only available for Intel and AMD and I don’t see anything
> for ARM.

I don't understand these remarks though - IMS is simply the mapping of
a MemWr addr/data pair to a Linux IRQ number? Why does this intersect
with remapping?

AFAIK, any platform that supports MSI today should have the inherent
HW capability to support IMS.

> Also, could you give more details on the device that could use IMS? Do you
> have some driver code already? We could then see if and how the current IMS
> code could be made more generic.

We have several devices of interest, our NICs have very flexible PCI,
so it is no problem to take the MemWR addr/data from someplace other
than the MSI tables.

For this we want to have some way to allocate Linux IRQs dynamically
and get a addr/data pair to trigger them.

Our NIC devices are also linked to our ARM SOC family, so I'd expect
our ARM's to also be able to provide these APIs as the platform.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-23 19:17     ` Dan Williams
@ 2020-04-23 19:49       ` Jason Gunthorpe
  2020-05-01 22:31         ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-23 19:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Jiang, Vinod Koul, Megha Dey, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, Baolu Lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list

On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote:

> Per Megha's follow-up can you send the details about that other device
> and help clear a path for a device-specific MSI addr/data table
> format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
> to claims about future upstream users. The fact that you have an
> additional use case is golden for pushing this into a common area and
> validating the scope of the proposed API.

I think I said it at plumbers, but yes, we are interested in this, and
would like dynamic MSI-like interrupts available to the driver (what
Intel calls IMS)

It is something easy enough to illustrate with any RDMA device really,
just open a MR against the addr and use RDMA_WRITE to trigger the
data. It should trigger a Linux IRQ. Nothing else should be needed.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 07/15] Documentation: Interrupt Message store
  2020-04-21 23:34 ` [PATCH RFC 07/15] Documentation: Interrupt Message store Dave Jiang
@ 2020-04-23 20:04   ` Jason Gunthorpe
  2020-05-01 22:32     ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-23 20:04 UTC (permalink / raw)
  To: Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

On Tue, Apr 21, 2020 at 04:34:30PM -0700, Dave Jiang wrote:

> diff --git a/Documentation/ims-howto.rst b/Documentation/ims-howto.rst
> new file mode 100644
> index 000000000000..a18de152b393
> +++ b/Documentation/ims-howto.rst
> @@ -0,0 +1,210 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: <isonum.txt>
> +
> +==========================
> +The IMS Driver Guide HOWTO
> +==========================
> +
> +:Authors: Megha Dey
> +
> +:Copyright: 2020 Intel Corporation
> +
> +About this guide
> +================
> +
> +This guide describes the basics of Interrupt Message Store (IMS), the
> +need to introduce a new interrupt mechanism, implementation details of
> +IMS in the kernel, driver changes required to support IMS and the general
> +misconceptions and FAQs associated with IMS.

I'm not sure why we need to call this IMS in kernel documentat? I know
Intel is using this term, but this document is really only talking
about extending the existing platform_msi stuff, which looks pretty
good actually.

A lot of this is good for the cover letter..

> +Implementation of IMS in the kernel
> +===================================
> +
> +The Linux kernel today already provides a generic mechanism to support
> +non-PCI compliant MSI interrupts for platform devices (platform-msi.c).
> +To support IMS interrupts, we create a new IMS IRQ domain and extend the
> +existing infrastructure. Dynamic allocation of IMS vectors is a requirement
> +for devices which support Scalable I/O Virtualization. A driver can allocate
> +and free vectors not just once during probe (as was the case with MSI/MSI-X)
> +but also in the post probe phase where actual demand is available. Thus, a
> +new API, platform_msi_domain_alloc_irqs_group is introduced which drivers
> +using IMS would be able to call multiple times. The vectors allocated each
> +time this API is called are associated with a group ID. To free the vectors
> +associated with a particular group, the platform_msi_domain_free_irqs_group
> +API can be called. The existing drivers using platform-msi infrastructure
> +will continue to use the existing alloc (platform_msi_domain_alloc_irqs)
> +and free (platform_msi_domain_free_irqs) APIs and are assigned a default
> +group ID of 0.
> +
> +Thus, platform-msi.c provides the generic methods which can be used by any
> +non-pci MSI interrupt type while the newly created ims-msi.c provides IMS
> +specific callbacks that can be used by drivers capable of generating IMS
> +interrupts. 

How exactly is an IMS interrupt is different from a platform msi?

It looks like it is just some thin wrapper around msi_domain - what is
it for?

> +FAQs and general misconceptions:
> +================================
> +
> +** There were some concerns raised by Thomas Gleixner and Marc Zyngier
> +during Linux plumbers conference 2019:
> +
> +1. Enumeration of IMS needs to be done by PCI core code and not by
> +   individual device drivers:
> +
> +   Currently, if the kernel needs a generic way to discover IMS capability
> +   without host driver dependency, the PCIE Designated Vendor specific
> +
> +   However, we cannot have a standard way of enumerating the IMS size
> +   because for context based devices, the interrupt message is part of
> +   the context itself which is managed entirely by the driver. Since
> +   context creation is done on demand, there is no way to tell during boot
> +   time, the maximum number of contexts (and hence the number of interrupt
> +   messages)that the device can support.

FWIW, I agree with this.

Like platform-msi, IMS should be controlled entirely by the driver.

> +2. Why is Intel designing a new interrupt mechanism rather than extending
> +   MSI-X to address its limitations? Isn't 2048 device interrupts enough?
> +
> +   MSI-X has a rigid definition of one-table and on-device storage and does
> +   not provide the full flexibility required for future multi-tile
> +   accelerator designs.
> +   IMS was envisioned to be used with large number of ADIs in devices where
> +   each will need unique interrupt resources. For example, a DSA shared
> +   work queue can support large number of clients where each client can
> +   have its own interrupt. In future, with user interrupts, we expect the
> +   demand for messages to increase further.

Generally agree

> +Device Driver Changes:
> +=====================
> +
> +1. platform_msi_domain_alloc_irqs_group (struct device *dev, unsigned int
> +   nvec, const struct platform_msi_ops *platform_ops, int *group_id)
> +   to allocate IMS interrupts, where:
> +
> +   dev: The device for which to allocate interrupts
> +   nvec: The number of interrupts to allocate
> +   platform_ops: Callbacks for platform MSI ops (to be provided by driver)
> +   group_id: returned by the call, to be used to free IRQs of a certain type
> +
> +   eg: static struct platform_msi_ops ims_ops  = {
> +        .irq_mask               = ims_irq_mask,
> +        .irq_unmask             = ims_irq_unmask,
> +        .write_msg              = ims_write_msg,
> +        };
> +
> +        int group;
> +        platform_msi_domain_alloc_irqs_group (dev, nvec, platform_ops, &group)
> +
> +   where, struct platform_msi_ops:
> +   irq_mask:   mask an interrupt source
> +   irq_unmask: unmask an interrupt source
> +   irq_write_msi_msg: write message content
> +
> +   This API can be called multiple times. Every time a new group will be
> +   associated with the allocated vectors. Group ID starts from 0.

Need much more closer look, but this seems conceptually fine to me.

As above the API here is called platform_msi - which seems good to
me. Again not sure why the word IMS is needed

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-04-21 23:34 ` [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain Dave Jiang
@ 2020-04-23 20:11   ` Jason Gunthorpe
  2020-05-01 22:30     ` Dey, Megha
  2020-04-25 21:38   ` Thomas Gleixner
  1 sibling, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-23 20:11 UTC (permalink / raw)
  To: Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

On Tue, Apr 21, 2020 at 04:34:11PM -0700, Dave Jiang wrote:
> diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
> new file mode 100644
> index 000000000000..738f6d153155
> +++ b/drivers/base/ims-msi.c
> @@ -0,0 +1,100 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Support for Device Specific IMS interrupts.
> + *
> + * Copyright © 2019 Intel Corporation.
> + *
> + * Author: Megha Dey <megha.dey@intel.com>
> + */
> +
> +#include <linux/dmar.h>
> +#include <linux/irq.h>
> +#include <linux/mdev.h>
> +#include <linux/pci.h>
> +
> +/*
> + * Determine if a dev is mdev or not. Return NULL if not mdev device.
> + * Return mdev's parent dev if success.
> + */
> +static inline struct device *mdev_to_parent(struct device *dev)
> +{
> +	struct device *ret = NULL;
> +	struct device *(*fn)(struct device *dev);
> +	struct bus_type *bus = symbol_get(mdev_bus_type);
> +
> +	if (bus && dev->bus == bus) {
> +		fn = symbol_get(mdev_dev_to_parent_dev);
> +		ret = fn(dev);
> +		symbol_put(mdev_dev_to_parent_dev);
> +		symbol_put(mdev_bus_type);

No, things like this are not OK in the drivers/base

Whatever this is doing needs to be properly architected in some
generic way.

> +static int dev_ims_prepare(struct irq_domain *domain, struct device *dev,
> +			   int nvec, msi_alloc_info_t *arg)
> +{
> +	if (dev_is_mdev(dev))
> +		dev = mdev_to_parent(dev);

Like maybe the caller shouldn't be passing in a mdev in the first
place, or some generic driver layer scheme is needed to go from a
child device (eg a mdev or one of these new virtual bus things) to the
struct device that owns the IRQ interface.

> +	init_irq_alloc_info(arg, NULL);
> +	arg->dev = dev;
> +	arg->type = X86_IRQ_ALLOC_TYPE_IMS;

Also very bewildering to see X86_* in drivers/base

> +struct irq_domain *arch_create_ims_irq_domain(struct irq_domain *parent,
> +					      const char *name)
> +{
> +	struct fwnode_handle *fn;
> +	struct irq_domain *domain;
> +
> +	fn = irq_domain_alloc_named_fwnode(name);
> +	if (!fn)
> +		return NULL;
> +
> +	domain = msi_create_irq_domain(fn, &ims_ir_domain_info, parent);
> +	if (!domain)
> +		return NULL;
> +
> +	irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
> +	irq_domain_free_fwnode(fn);
> +
> +	return domain;
> +}

I'm still not really clear why all this is called IMS.. This looks
like the normal boilerplate to setup an IRQ domain? What is actually
'ims' in here?

> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index 7d922950caaf..c21f1305a76b 100644
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -36,7 +36,6 @@ struct mdev_device {
>  };
>  
>  #define to_mdev_device(dev)	container_of(dev, struct mdev_device, dev)
> -#define dev_is_mdev(d)		((d)->bus == &mdev_bus_type)
>  
>  struct mdev_type {
>  	struct kobject kobj;
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index 0ce30ca78db0..fa2344e239ef 100644
> +++ b/include/linux/mdev.h
> @@ -144,5 +144,8 @@ void mdev_unregister_driver(struct mdev_driver *drv);
>  struct device *mdev_parent_dev(struct mdev_device *mdev);
>  struct device *mdev_dev(struct mdev_device *mdev);
>  struct mdev_device *mdev_from_dev(struct device *dev);
> +struct device *mdev_dev_to_parent_dev(struct device *dev);
> +
> +#define dev_is_mdev(dev) ((dev)->bus == symbol_get(mdev_bus_type))

NAK on the symbol_get

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-23 19:12         ` Jason Gunthorpe
@ 2020-04-24  3:27           ` Tian, Kevin
  2020-04-24 12:44             ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Tian, Kevin @ 2020-04-24  3:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Raj, Ashok
  Cc: Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael, gregkh,
	tglx, hpa, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu,
	Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams, Dan J,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

> From: Jason Gunthorpe <jgg@mellanox.com>
> Sent: Friday, April 24, 2020 3:12 AM
> 
> On Wed, Apr 22, 2020 at 02:14:36PM -0700, Raj, Ashok wrote:
> > Hi Jason
> >
> > > > >
> > > > > I'm feeling really skeptical that adding all this PCI config space and
> > > > > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > > > > interface is a good idea, that kind of stuff is much safer in
> > > > > userspace.
> > > > >
> > > > > Particularly since vfio is not really needed once a driver is using
> > > > > the PASID stuff. We already have general code for drivers to use to
> > > > > attach a PASID to a mm_struct - and using vfio while disabling all the
> > > > > DMA/iommu config really seems like an abuse.
> > > >
> > > > Well, this series is for virtualizing idxd device to VMs, instead of
> > > > supporting SVA for bare metal processes. idxd implements a
> > > > hardware-assisted mediated device technique called Intel Scalable
> > > > I/O Virtualization,
> > >
> > > I'm familiar with the intel naming scheme.
> > >
> > > > which allows each Assignable Device Interface (ADI, e.g. a work
> > > > queue) tagged with an unique PASID to ensure fine-grained DMA
> > > > isolation when those ADIs are assigned to different VMs. For this
> > > > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-
> domain
> > > > extension. Bare metal SVA will be enabled for idxd later by using
> > > > the general SVA code that you mentioned.  Both paths will co-exist
> > > > in the end so there is no such case of disabling DMA/iommu config.
> > >
> > > Again, if you will have a normal SVA interface, there is no need for a
> > > VFIO version, just use normal SVA for both.
> > >
> > > PCI emulation should try to be in userspace, not the kernel, for
> > > security.
> >
> > Not sure we completely understand your proposal. Mediated devices
> > are software constructed and they have protected resources like
> > interrupts and stuff and VFIO already provids abstractions to export
> > to user space.
> >
> > Native SVA is simply passing the process CR3 handle to IOMMU so
> > IOMMU knows how to walk process page tables, kernel handles things
> > like page-faults, doing device tlb invalidations and such.
> 
> > That by itself doesn't translate to what a guest typically does
> > with a VDEV. There are other control paths that need to be serviced
> > from the kernel code via VFIO. For speed path operations like
> > ringing doorbells and such they are directly managed from guest.
> 
> You don't need vfio to mmap BAR pages to userspace. The unique thing
> that vfio gives is it provides a way to program the classic non-PASID
> iommu, which you are not using here.

That unique thing is indeed used here. Please note sharing CPU virtual 
address space with device (what SVA API is invented for) is not the
purpose of this series. We still rely on classic non-PASID iommu programming, 
i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although 
we do use PASID to tag ADI, the PASID is contained within iommu_domain 
and invisible to VFIO. From userspace p.o.v, this is a device passthrough
usage instead of PASID-based address space binding.

> 
> > How do you propose to use the existing SVA api's  to also provide
> > full device emulation as opposed to using an existing infrastructure
> > that's already in place?
> 
> You'd provide the 'full device emulation' in userspace (eg qemu),
> along side all the other device emulation. Device emulation does not
> belong in the kernel without a very good reason.

The problem is that we are not doing full device emulation. It's based
on mediated passthrough. Some emulation logic requires close 
engagement with kernel device driver, e.g. resource allocation, WQ 
configuration, fault report, etc., while the detail interface is very vendor/
device specific (just like between PF and VF). idxd is just the first 
device that supports Scalable IOV. We have a lot more coming later, 
in different types. Then putting such emulation in user space means 
that Qemu needs to support all those vendor specific interfaces for 
every new device which supports Scalable IOV. This is contrast to our 
goal of using Scalable IOV as an alternative to SR-IOV. For SR-IOV, 
Qemu only needs to support one VFIO API then any VF type simply 
works. We want to sustain the same user experience through VFIO 
mdev. 

Specifically for PCI config space emulation, now it's already done 
in multiple kernel places, e.g. vfio-pci, kvmgt, etc. We do plan to 
consolidate them later.

> 
> You get the doorbell BAR page from your own char dev
> 
> You setup a PASID IOMMU configuration over your own char dev
> 
> Interrupt delivery is triggering a generic event fd
> 
> What is VFIO needed for?

Based on above explanation VFIO mdev already meets all of our
requirements then why bother inventing a new one...

> 
> > Perhaps Alex can ease Jason's concerns?
> 
> Last we talked Alex also had doubts on what mdev should be used
> for. It is a feature that seems to lack boundaries, and I'll note that
> when the discussion came up for VDPA, they eventually choose not to
> use VFIO.
> 

Is there a link to Alex's doubt? I'm not sure why vDPA didn't go 
for VFIO, but imho it is a different story. vDPA is specifically for
devices which implement standard vhost/virtio interface, thus
it's reasonable that inventing a new mechanism might be more
efficient for all vDPA type devices. However Scalable IOV is
similar to SR-IOV, only for resource partitioning. It doesn't change
the device programming interface, which could be in any vendor
specific form. Here VFIO mdev is good for providing an unified 
interface for managing resource multiplexing of all such devices.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-21 23:54 ` [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Jason Gunthorpe
                     ` (2 preceding siblings ...)
  2020-04-22 23:04   ` Dey, Megha
@ 2020-04-24  6:31   ` Jason Wang
  3 siblings, 0 replies; 89+ messages in thread
From: Jason Wang @ 2020-04-24  6:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm, Michael S. Tsirkin, Zha Bin, Liu, Jing2


On 2020/4/22 上午7:54, Jason Gunthorpe wrote:
>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
>> interrupts for the guest. This preserves MSIX for host usages and also allows a
>> significantly larger number of interrupt vectors for guest usage.
> I never did get a reply to my earlier remarks on the IMS patches.
>
> The concept of a device specific addr/data table format for MSI is not
> Intel specific. This should be general code. We have a device that can
> use this kind of kernel capability today.
>
> Jason
>

+1.

Another example is to extend virtio MMIO to support MSI[1].

Thanks

[1] https://lkml.org/lkml/2020/2/10/127



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-24  3:27           ` Tian, Kevin
@ 2020-04-24 12:44             ` Jason Gunthorpe
  2020-04-24 16:25               ` Tian, Kevin
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-24 12:44 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael,
	gregkh, tglx, hpa, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Fri, Apr 24, 2020 at 03:27:41AM +0000, Tian, Kevin wrote:

> > > That by itself doesn't translate to what a guest typically does
> > > with a VDEV. There are other control paths that need to be serviced
> > > from the kernel code via VFIO. For speed path operations like
> > > ringing doorbells and such they are directly managed from guest.
> > 
> > You don't need vfio to mmap BAR pages to userspace. The unique thing
> > that vfio gives is it provides a way to program the classic non-PASID
> > iommu, which you are not using here.
> 
> That unique thing is indeed used here. Please note sharing CPU virtual 
> address space with device (what SVA API is invented for) is not the
> purpose of this series. We still rely on classic non-PASID iommu programming, 
> i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although 
> we do use PASID to tag ADI, the PASID is contained within iommu_domain 
> and invisible to VFIO. From userspace p.o.v, this is a device passthrough
> usage instead of PASID-based address space binding.

So you have PASID support but don't use it? Why? PASID is much better
than classic VFIO iommu, it doesn't require page pinning...

> > > How do you propose to use the existing SVA api's  to also provide
> > > full device emulation as opposed to using an existing infrastructure
> > > that's already in place?
> > 
> > You'd provide the 'full device emulation' in userspace (eg qemu),
> > along side all the other device emulation. Device emulation does not
> > belong in the kernel without a very good reason.
> 
> The problem is that we are not doing full device emulation. It's based
> on mediated passthrough. Some emulation logic requires close
> engagement with kernel device driver, e.g. resource allocation, WQ
> configuration, fault report, etc., while the detail interface is very vendor/
> device specific (just like between PF and VF).

Which sounds like the fairly classic case of device emulation to me.

> idxd is just the first device that supports Scalable IOV. We have a
> lot more coming later, in different types. Then putting such
> emulation in user space means that Qemu needs to support all those
> vendor specific interfaces for every new device which supports

It would be very sad to see an endless amount of device emulation code
crammed into the kernel. Userspace is where device emulation is
supposed to live. For security

qemu is the right place to put this stuff.

> > > Perhaps Alex can ease Jason's concerns?
> > 
> > Last we talked Alex also had doubts on what mdev should be used
> > for. It is a feature that seems to lack boundaries, and I'll note that
> > when the discussion came up for VDPA, they eventually choose not to
> > use VFIO.
> > 
> 
> Is there a link to Alex's doubt? I'm not sure why vDPA didn't go 
> for VFIO, but imho it is a different story.

No, not at all. VDPA HW today is using what Intel has been calling
ADI. But qemu already had the device emulation part in userspace, (all
of the virtio emulation parts are in userspace) so they didn't try to
put it in the kernel.

This is the pattern. User space is supposed to do the emulation parts,
the kernel provides the raw elements to manage queues/etc - and it is
not done through mdev.

> efficient for all vDPA type devices. However Scalable IOV is
> similar to SR-IOV, only for resource partitioning. It doesn't change
> the device programming interface, which could be in any vendor
> specific form. Here VFIO mdev is good for providing an unified 
> interface for managing resource multiplexing of all such devices.

SIOV doesn't have a HW config space, and for some reason in these
patches there is BAR emulation too. So, no, it is not like SR-IOV at
all.

This is more like classic device emulation, presumably with some fast
path for the data plane. ie just like VDPA :)

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-24 12:44             ` Jason Gunthorpe
@ 2020-04-24 16:25               ` Tian, Kevin
  2020-04-24 18:12                 ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Tian, Kevin @ 2020-04-24 16:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael,
	gregkh, tglx, hpa, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

> From: Jason Gunthorpe
> Sent: Friday, April 24, 2020 8:45 PM
> 
> On Fri, Apr 24, 2020 at 03:27:41AM +0000, Tian, Kevin wrote:
> 
> > > > That by itself doesn't translate to what a guest typically does
> > > > with a VDEV. There are other control paths that need to be serviced
> > > > from the kernel code via VFIO. For speed path operations like
> > > > ringing doorbells and such they are directly managed from guest.
> > >
> > > You don't need vfio to mmap BAR pages to userspace. The unique thing
> > > that vfio gives is it provides a way to program the classic non-PASID
> > > iommu, which you are not using here.
> >
> > That unique thing is indeed used here. Please note sharing CPU virtual
> > address space with device (what SVA API is invented for) is not the
> > purpose of this series. We still rely on classic non-PASID iommu
> programming,
> > i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although
> > we do use PASID to tag ADI, the PASID is contained within iommu_domain
> > and invisible to VFIO. From userspace p.o.v, this is a device passthrough
> > usage instead of PASID-based address space binding.
> 
> So you have PASID support but don't use it? Why? PASID is much better
> than classic VFIO iommu, it doesn't require page pinning...

PASID and I/O page fault (through ATS/PRI) are orthogonal things. Don't
draw the equation between them. The host driver can tag PASID to 
ADI so every DMA request out of that ADI has a PASID prefix, allowing VT-d
to do PASID-granular DMA isolation. However I/O page fault cannot be
taken for granted. A scalable IOV device may support PASID while without
ATS/PRI. Even when ATS/PRI is supported, the tolerance of I/O page fault
is decided by the work queue mode that is configured by the guest. For 
example, if the guest put the work queue in non-faultable transaction 
mode, the device doesn't do PRI and simply report error if no valid IOMMU 
mapping.

So in this series we support only the basic form for non-faultable transactions,
using the classic VFIO iommu interface plus PASID-granular translation. 
We are working on virtual SVA support in parallel. Once that feature is ready, 
then I/O page fault could be CONDITIONALLY enabled according to guest 
vIOMMU setting, e.g. when virtual context entry has page request enabled 
then we enable nested translation in the physical PASID entry, with 1st 
level linking to guest page table (GVA->GPA) and 2nd-level carrying 
(GPA->HPA).

> 
> > > > How do you propose to use the existing SVA api's  to also provide
> > > > full device emulation as opposed to using an existing infrastructure
> > > > that's already in place?
> > >
> > > You'd provide the 'full device emulation' in userspace (eg qemu),
> > > along side all the other device emulation. Device emulation does not
> > > belong in the kernel without a very good reason.
> >
> > The problem is that we are not doing full device emulation. It's based
> > on mediated passthrough. Some emulation logic requires close
> > engagement with kernel device driver, e.g. resource allocation, WQ
> > configuration, fault report, etc., while the detail interface is very vendor/
> > device specific (just like between PF and VF).
> 
> Which sounds like the fairly classic case of device emulation to me.
> 
> > idxd is just the first device that supports Scalable IOV. We have a
> > lot more coming later, in different types. Then putting such
> > emulation in user space means that Qemu needs to support all those
> > vendor specific interfaces for every new device which supports
> 
> It would be very sad to see an endless amount of device emulation code
> crammed into the kernel. Userspace is where device emulation is
> supposed to live. For security

I think providing an unified abstraction to userspace is also important,
which is what VFIO provides today. The merit of using one set of VFIO 
API to manage all kinds of mediated devices and VF devices is a major
gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
or equivalent device is just overkill and doesn't scale. Also the actual
emulation code in idxd driver is actually small, if putting aside the PCI
config space part for which I already explained most logic could be shared
between mdev device drivers.

> 
> qemu is the right place to put this stuff.
> 
> > > > Perhaps Alex can ease Jason's concerns?
> > >
> > > Last we talked Alex also had doubts on what mdev should be used
> > > for. It is a feature that seems to lack boundaries, and I'll note that
> > > when the discussion came up for VDPA, they eventually choose not to
> > > use VFIO.
> > >
> >
> > Is there a link to Alex's doubt? I'm not sure why vDPA didn't go
> > for VFIO, but imho it is a different story.
> 
> No, not at all. VDPA HW today is using what Intel has been calling
> ADI. But qemu already had the device emulation part in userspace, (all
> of the virtio emulation parts are in userspace) so they didn't try to
> put it in the kernel.
> 
> This is the pattern. User space is supposed to do the emulation parts,
> the kernel provides the raw elements to manage queues/etc - and it is
> not done through mdev.
> 
> > efficient for all vDPA type devices. However Scalable IOV is
> > similar to SR-IOV, only for resource partitioning. It doesn't change
> > the device programming interface, which could be in any vendor
> > specific form. Here VFIO mdev is good for providing an unified
> > interface for managing resource multiplexing of all such devices.
> 
> SIOV doesn't have a HW config space, and for some reason in these
> patches there is BAR emulation too. So, no, it is not like SR-IOV at
> all.
> 
> This is more like classic device emulation, presumably with some fast
> path for the data plane. ie just like VDPA :)
> 
> Jason

Thanks
Kevin

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-24 16:25               ` Tian, Kevin
@ 2020-04-24 18:12                 ` Jason Gunthorpe
  2020-04-26  5:18                   ` Tian, Kevin
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-24 18:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael,
	gregkh, tglx, hpa, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Fri, Apr 24, 2020 at 04:25:56PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Friday, April 24, 2020 8:45 PM
> > 
> > On Fri, Apr 24, 2020 at 03:27:41AM +0000, Tian, Kevin wrote:
> > 
> > > > > That by itself doesn't translate to what a guest typically does
> > > > > with a VDEV. There are other control paths that need to be serviced
> > > > > from the kernel code via VFIO. For speed path operations like
> > > > > ringing doorbells and such they are directly managed from guest.
> > > >
> > > > You don't need vfio to mmap BAR pages to userspace. The unique thing
> > > > that vfio gives is it provides a way to program the classic non-PASID
> > > > iommu, which you are not using here.
> > >
> > > That unique thing is indeed used here. Please note sharing CPU virtual
> > > address space with device (what SVA API is invented for) is not the
> > > purpose of this series. We still rely on classic non-PASID iommu
> > programming,
> > > i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although
> > > we do use PASID to tag ADI, the PASID is contained within iommu_domain
> > > and invisible to VFIO. From userspace p.o.v, this is a device passthrough
> > > usage instead of PASID-based address space binding.
> > 
> > So you have PASID support but don't use it? Why? PASID is much better
> > than classic VFIO iommu, it doesn't require page pinning...
> 
> PASID and I/O page fault (through ATS/PRI) are orthogonal things. Don't
> draw the equation between them. The host driver can tag PASID to 
> ADI so every DMA request out of that ADI has a PASID prefix, allowing VT-d
> to do PASID-granular DMA isolation. However I/O page fault cannot be
> taken for granted. A scalable IOV device may support PASID while without
> ATS/PRI. Even when ATS/PRI is supported, the tolerance of I/O page fault
> is decided by the work queue mode that is configured by the guest. For 
> example, if the guest put the work queue in non-faultable transaction 
> mode, the device doesn't do PRI and simply report error if no valid IOMMU 
> mapping.

Okay, that makes sense, I wasn't aware people were doing PASID without
ATS at this point..

> > > idxd is just the first device that supports Scalable IOV. We have a
> > > lot more coming later, in different types. Then putting such
> > > emulation in user space means that Qemu needs to support all those
> > > vendor specific interfaces for every new device which supports
> > 
> > It would be very sad to see an endless amount of device emulation code
> > crammed into the kernel. Userspace is where device emulation is
> > supposed to live. For security
> 
> I think providing an unified abstraction to userspace is also important,
> which is what VFIO provides today. The merit of using one set of VFIO 
> API to manage all kinds of mediated devices and VF devices is a major
> gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> or equivalent device is just overkill and doesn't scale. Also the actual
> emulation code in idxd driver is actually small, if putting aside the PCI
> config space part for which I already explained most logic could be shared
> between mdev device drivers.

If it was just config space you might have an argument, VFIO already
does some config space mangling, but emulating BAR space is out of
scope of VFIO, IMHO.

I also think it is disingenuous to pretend this is similar to
SR-IOV. SR-IOV is self contained and the BAR does not require
emulation. What you have here sounds like it is just an ordinary
multi-queue device with the ability to PASID tag queues for IOMMU
handling. This is absolutely not SRIOV - it is much closer to VDPA,
which isn't using mdev.

Further, I disagree with your assessment that this doesn't scale. You
already said you plan a normal user interface for idxd, so instead of
having a single sane user interface (ala VDPA) idxd now needs *two*. If
this is the general pattern of things to come, it is a bad path.

The only thing we get out of this is someone doesn't have to write a
idxd emulation driver in qemu, instead they have to write it in the
kernel. I don't see how that is a win for the ecosystem.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 02/15] drivers/base: Introduce a new platform-msi list
  2020-04-21 23:33 ` [PATCH RFC 02/15] drivers/base: Introduce a new platform-msi list Dave Jiang
@ 2020-04-25 21:13   ` Thomas Gleixner
  2020-05-04  0:08     ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2020-04-25 21:13 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Dave Jiang <dave.jiang@intel.com> writes:

> From: Megha Dey <megha.dey@linux.intel.com>
>
> This is a preparatory patch to introduce Interrupt Message Store (IMS).
>
> The struct device has a linked list ('msi_list') of the MSI (msi/msi-x,
> platform-msi) descriptors of that device. This list holds only 1 type
> of descriptor since it is not possible for a device to support more
> than one of these descriptors concurrently.
>
> However, with the introduction of IMS, a device can support IMS as well
> as MSI-X at the same time. Instead of sharing this list between IMS (a
> type of platform-msi) and MSI-X descriptors, introduce a new linked list,
> platform_msi_list, which will hold all the platform-msi descriptors.
>
> Thus, msi_list will point to the MSI/MSIX descriptors of a device, while
> platform_msi_list will point to the platform-msi descriptors of a
> device.

Will point?

You're failing to explain that this actually converts the existing
platform code over to this new list. This also lacks an explanation why
this is not a functional change.

> Signed-off-by: Megha Dey <megha.dey@linux.intel.com>

Lacks an SOB from you.... 

> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 139cdf7e7327..5a0116d1a8d0 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -1984,6 +1984,7 @@ void device_initialize(struct device *dev)
>  	set_dev_node(dev, -1);
>  #ifdef CONFIG_GENERIC_MSI_IRQ
>  	INIT_LIST_HEAD(&dev->msi_list);
> +	INIT_LIST_HEAD(&dev->platform_msi_list);

> --- a/drivers/base/platform-msi.c
> +++ b/drivers/base/platform-msi.c
> @@ -110,7 +110,8 @@ static void platform_msi_free_descs(struct device *dev, int base, int nvec)
>  {
>  	struct msi_desc *desc, *tmp;
>  
> -	list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
> +	list_for_each_entry_safe(desc, tmp, dev_to_platform_msi_list(dev),
> +				 list) {
>  		if (desc->platform.msi_index >= base &&
>  		    desc->platform.msi_index < (base + nvec)) {
>  			list_del(&desc->list);
>  	datap = kzalloc(sizeof(*datap), GFP_KERNEL);
> @@ -255,6 +256,8 @@ int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
>  	struct platform_msi_priv_data *priv_data;
>  	int err;
>  
> +	dev->platform_msi_type = GEN_PLAT_MSI;

What the heck is GEN_PLAT_MSI? Can you please use

   1) A proper name space starting with PLATFORM_MSI_ or such

   2) A proper suffix which is self explaining.

instead of coming up with nonsensical garbage which even lacks any
explanation at the place where it is defined.

> diff --git a/include/linux/device.h b/include/linux/device.h
> index ac8e37cd716a..cbcecb14584e 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -567,6 +567,8 @@ struct device {
>  #endif
>  #ifdef CONFIG_GENERIC_MSI_IRQ
>  	struct list_head	msi_list;
> +	struct list_head	platform_msi_list;
> +	unsigned int		platform_msi_type;

You use an enum for the types so why are you not using an enum for the
struct member which stores it?

>  
> +/**
> + * list_entry_select - get the correct struct for this entry based on condition
> + * @condition:	the condition to choose a particular &struct list head pointer
> + * @ptr_a:      the &struct list_head pointer if @condition is not met.
> + * @ptr_b:      the &struct list_head pointer if @condition is met.
> + * @type:       the type of the struct this is embedded in.
> + * @member:     the name of the list_head within the struct.
> + */
> +#define list_entry_select(condition, ptr_a, ptr_b, type, member)\
> +	(condition) ? list_entry(ptr_a, type, member) :		\
> +		      list_entry(ptr_b, type, member)

This is related to $Subject in which way? It's not a entirely new
process rule that infrastructure changes which touch a completely
different subsystem have to be separate and explained and justified on
their own.

>  
> +enum platform_msi_type {
> +	NOT_PLAT_MSI = 0,

NOT_PLAT_MSI? Not used anywhere and of course equally self explaining as
the other one.

> +	GEN_PLAT_MSI = 1,
> +};
> +
>  /* Helpers to hide struct msi_desc implementation details */
>  #define msi_desc_to_dev(desc)		((desc)->dev)
>  #define dev_to_msi_list(dev)		(&(dev)->msi_list)
> @@ -140,6 +145,22 @@ struct msi_desc {
>  #define for_each_msi_entry_safe(desc, tmp, dev)	\
>  	list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
>  
> +#define dev_to_platform_msi_list(dev)	(&(dev)->platform_msi_list)
> +#define first_platform_msi_entry(dev)		\
> +	list_first_entry(dev_to_platform_msi_list((dev)), struct msi_desc, list)
> +#define for_each_platform_msi_entry(desc, dev)	\
> +	list_for_each_entry((desc), dev_to_platform_msi_list((dev)), list)
> +#define for_each_platform_msi_entry_safe(desc, tmp, dev)	\
> +	list_for_each_entry_safe((desc), (tmp), dev_to_platform_msi_list((dev)), list)

New lines to seperate macros are bad for readability, right? 

> +#define first_msi_entry_common(dev)	\
> +	list_first_entry_select((dev)->platform_msi_type, dev_to_platform_msi_list((dev)),	\
> +				dev_to_msi_list((dev)), struct msi_desc, list)
> +
> +#define for_each_msi_entry_common(desc, dev)	\
> +	list_for_each_entry_select((dev)->platform_msi_type, desc, dev_to_platform_msi_list((dev)), \
> +				   dev_to_msi_list((dev)), list)	\
> +
>  #ifdef CONFIG_IRQ_MSI_IOMMU
>  static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
>  {
> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
> index eb95f6106a1e..bc5f9e32387f 100644
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -320,7 +320,7 @@ int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
>  	struct msi_desc *desc;
>  	int ret = 0;
>  
> -	for_each_msi_entry(desc, dev) {
> +	for_each_msi_entry_common(desc, dev) {

This is absolutely unreadable. What's common here? You hide the decision
which list to iterate behind a misnomed macro. 

And looking at the implementation:

> +#define for_each_msi_entry_common(desc, dev)	\
> +	list_for_each_entry_select((dev)->platform_msi_type, desc, dev_to_platform_msi_list((dev)), \
> +				   dev_to_msi_list((dev)), list)	\

So you implicitely make the decision based on:

   (dev)->platform_msi_type != 0

What? How is that ever supposed to work? The changelog says:

> However, with the introduction of IMS, a device can support IMS as well
> as MSI-X at the same time. Instead of sharing this list between IMS (a
> type of platform-msi) and MSI-X descriptors, introduce a new linked list,
> platform_msi_list, which will hold all the platform-msi descriptors.

So you are not serious about storing the decision in the device struct
and then calling into common code?

That's insane at best. There is absolutely ZERO explanation how this is
supposed to work and why this could even be remotely correct and safe.

Ever heard of the existance of function arguments?

Sorry, this is just voodoo programming and not going anywhere.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 03/15] drivers/base: Allocate/free platform-msi interrupts by group
  2020-04-21 23:34 ` [PATCH RFC 03/15] drivers/base: Allocate/free platform-msi interrupts by group Dave Jiang
@ 2020-04-25 21:23   ` Thomas Gleixner
  2020-05-04  0:08     ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2020-04-25 21:23 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Dave Jiang <dave.jiang@intel.com> writes:
> From: Megha Dey <megha.dey@linux.intel.com>
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -135,6 +135,12 @@ enum platform_msi_type {
>  	GEN_PLAT_MSI = 1,
>  };
>  
> +struct platform_msi_group_entry {
> +	unsigned int group_id;
> +	struct list_head group_list;
> +	struct list_head entry_list;

I surely told you before that struct members want to be written tabular.

> +};
> +
>  /* Helpers to hide struct msi_desc implementation details */
>  #define msi_desc_to_dev(desc)		((desc)->dev)
>  #define dev_to_msi_list(dev)		(&(dev)->msi_list)
> @@ -145,21 +151,31 @@ enum platform_msi_type {
>  #define for_each_msi_entry_safe(desc, tmp, dev)	\
>  	list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
>  
> -#define dev_to_platform_msi_list(dev)	(&(dev)->platform_msi_list)
> -#define first_platform_msi_entry(dev)		\
> -	list_first_entry(dev_to_platform_msi_list((dev)), struct msi_desc, list)
> -#define for_each_platform_msi_entry(desc, dev)	\
> -	list_for_each_entry((desc), dev_to_platform_msi_list((dev)), list)
> -#define for_each_platform_msi_entry_safe(desc, tmp, dev)	\
> -	list_for_each_entry_safe((desc), (tmp), dev_to_platform_msi_list((dev)), list)
> +#define dev_to_platform_msi_group_list(dev)    (&(dev)->platform_msi_list)
> +
> +#define first_platform_msi_group_entry(dev)				\
> +	list_first_entry(dev_to_platform_msi_group_list((dev)),		\
> +			 struct platform_msi_group_entry, group_list)
>  
> -#define first_msi_entry_common(dev)	\
> -	list_first_entry_select((dev)->platform_msi_type, dev_to_platform_msi_list((dev)),	\
> +#define platform_msi_current_group_entry_list(dev)			\
> +	(&((list_last_entry(dev_to_platform_msi_group_list((dev)),	\
> +			    struct platform_msi_group_entry,		\
> +			    group_list))->entry_list))
> +
> +#define first_msi_entry_current_group(dev)				\
> +	list_first_entry_select((dev)->platform_msi_type,		\
> +				platform_msi_current_group_entry_list((dev)),	\
>  				dev_to_msi_list((dev)), struct msi_desc, list)
>  
> -#define for_each_msi_entry_common(desc, dev)	\
> -	list_for_each_entry_select((dev)->platform_msi_type, desc, dev_to_platform_msi_list((dev)), \
> -				   dev_to_msi_list((dev)), list)	\
> +#define for_each_msi_entry_current_group(desc, dev)			\
> +	list_for_each_entry_select((dev)->platform_msi_type, desc,	\
> +				   platform_msi_current_group_entry_list((dev)),\
> +				   dev_to_msi_list((dev)), list)
> +
> +#define for_each_platform_msi_entry_in_group(desc, platform_msi_group, group, dev)	\
> +	list_for_each_entry((platform_msi_group), dev_to_platform_msi_group_list((dev)), group_list)	\
> +		if (((platform_msi_group)->group_id) == (group))			\
> +			list_for_each_entry((desc), (&(platform_msi_group)->entry_list), list)

Yet more unreadable macro maze to obfuscate what the code is actually
doing. 

>  /* When an MSI domain is used as an intermediate domain */
>  int msi_domain_prepare_irqs(struct irq_domain *domain, struct device *dev,
> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
> index bc5f9e32387f..899ade394ec8 100644
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -320,7 +320,7 @@ int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
>  	struct msi_desc *desc;
>  	int ret = 0;
>  
> -	for_each_msi_entry_common(desc, dev) {
> +	for_each_msi_entry_current_group(desc, dev) {

How is anyone supposed to figure out what the heck this means without
going through several layers of macro maze and some magic type/group
storage in struct device?

Again, function arguments exist for a reason.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-04-21 23:34 ` [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain Dave Jiang
  2020-04-23 20:11   ` Jason Gunthorpe
@ 2020-04-25 21:38   ` Thomas Gleixner
  2020-05-04  0:11     ` Dey, Megha
  1 sibling, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2020-04-25 21:38 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Dave Jiang <dave.jiang@intel.com> writes:
> From: Megha Dey <megha.dey@linux.intel.com>
>
> Add support for the creation of a new IMS irq domain. It creates a new
> irq chip associated with the IMS domain and adds the necessary domain
> operations to it.

And how is a X86 specific thingy related to drivers/base?

> diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c

This sits in drivers base because IMS is architecture independent, right?

> new file mode 100644
> index 000000000000..738f6d153155
> --- /dev/null
> +++ b/drivers/base/ims-msi.c
> @@ -0,0 +1,100 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Support for Device Specific IMS interrupts.
> + *
> + * Copyright © 2019 Intel Corporation.
> + *
> + * Author: Megha Dey <megha.dey@intel.com>
> + */
> +
> +#include <linux/dmar.h>
> +#include <linux/irq.h>
> +#include <linux/mdev.h>
> +#include <linux/pci.h>
> +
> +/*
> + * Determine if a dev is mdev or not. Return NULL if not mdev device.
> + * Return mdev's parent dev if success.
> + */
> +static inline struct device *mdev_to_parent(struct device *dev)
> +{
> +	struct device *ret = NULL;
> +	struct device *(*fn)(struct device *dev);
> +	struct bus_type *bus = symbol_get(mdev_bus_type);

symbol_get()?

> +
> +	if (bus && dev->bus == bus) {
> +		fn = symbol_get(mdev_dev_to_parent_dev);

What's wrong with simple function calls?

> +		ret = fn(dev);
> +		symbol_put(mdev_dev_to_parent_dev);
> +		symbol_put(mdev_bus_type);
> +	}
> +
> +	return ret;
> +}
> +
> +static irq_hw_number_t dev_ims_get_hwirq(struct msi_domain_info *info,
> +					 msi_alloc_info_t *arg)
> +{
> +	return arg->ims_hwirq;
> +}
> +
> +static int dev_ims_prepare(struct irq_domain *domain, struct device *dev,
> +			   int nvec, msi_alloc_info_t *arg)
> +{
> +	if (dev_is_mdev(dev))
> +		dev = mdev_to_parent(dev);

This makes absolutely no sense. Somewhere you claimed that this is
solely for mdev. Now this interface takes both a regular device and mdev.

Lack of explanation seems to be a common scheme here.

> +	init_irq_alloc_info(arg, NULL);
> +	arg->dev = dev;
> +	arg->type = X86_IRQ_ALLOC_TYPE_IMS;
> +
> +	return 0;
> +}
> +
> +static void dev_ims_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
> +{
> +	arg->ims_hwirq = platform_msi_calc_hwirq(desc);
> +}
> +
> +static struct msi_domain_ops dev_ims_domain_ops = {
> +	.get_hwirq	= dev_ims_get_hwirq,
> +	.msi_prepare	= dev_ims_prepare,
> +	.set_desc	= dev_ims_set_desc,
> +};
> +
> +static struct irq_chip dev_ims_ir_controller = {
> +	.name			= "IR-DEV-IMS",
> +	.irq_ack		= irq_chip_ack_parent,
> +	.irq_retrigger		= irq_chip_retrigger_hierarchy,
> +	.irq_set_vcpu_affinity	= irq_chip_set_vcpu_affinity_parent,
> +	.flags			= IRQCHIP_SKIP_SET_WAKE,
> +	.irq_write_msi_msg	= platform_msi_write_msg,
> +};
> +
> +static struct msi_domain_info ims_ir_domain_info = {
> +	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
> +	.ops		= &dev_ims_domain_ops,
> +	.chip		= &dev_ims_ir_controller,
> +	.handler	= handle_edge_irq,
> +	.handler_name	= "edge",
> +};
> +
> +struct irq_domain *arch_create_ims_irq_domain(struct irq_domain *parent,
> +					      const char *name)

arch_create_ ???? In drivers/base ??? 

> +{
> +	struct fwnode_handle *fn;
> +	struct irq_domain *domain;
> +
> +	fn = irq_domain_alloc_named_fwnode(name);
> +	if (!fn)
> +		return NULL;
> +
> +	domain = msi_create_irq_domain(fn, &ims_ir_domain_info, parent);
> +	if (!domain)
> +		return NULL;
> +
> +	irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
> +	irq_domain_free_fwnode(fn);
> +
> +	return domain;
> +}
> diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
> index 2696aa75983b..59160e8cbfb1 100644
> --- a/drivers/base/platform-msi.c
> +++ b/drivers/base/platform-msi.c
> @@ -31,12 +31,11 @@ struct platform_msi_priv_data {
>  /* The devid allocator */
>  static DEFINE_IDA(platform_msi_devid_ida);
>  
> -#ifdef GENERIC_MSI_DOMAIN_OPS
>  /*
>   * Convert an msi_desc to a globaly unique identifier (per-device
>   * devid + msi_desc position in the msi_list).
>   */
> -static irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc)
> +irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc)
>  {
>  	u32 devid;
>  
> @@ -45,6 +44,7 @@ static irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc)
>  	return (devid << (32 - DEV_ID_SHIFT)) | desc->platform.msi_index;
>  }
>  
> +#ifdef GENERIC_MSI_DOMAIN_OPS
>  static void platform_msi_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
>  {
>  	arg->desc = desc;
> @@ -76,7 +76,7 @@ static void platform_msi_update_dom_ops(struct msi_domain_info *info)
>  		ops->set_desc = platform_msi_set_desc;
>  }
>  
> -static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
> +void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
>  {
>  	struct msi_desc *desc = irq_data_get_msi_desc(data);
>  	struct platform_msi_priv_data *priv_data;
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index b558d4cfd082..cecc6a6bdbef 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -33,6 +33,12 @@ struct device *mdev_parent_dev(struct mdev_device *mdev)
>  }
>  EXPORT_SYMBOL(mdev_parent_dev);
>  
> +struct device *mdev_dev_to_parent_dev(struct device *dev)
> +{
> +	return to_mdev_device(dev)->parent->dev;
> +}
> +EXPORT_SYMBOL(mdev_dev_to_parent_dev);

And this needs to be EXPORT_SYMBOL because this is designed to support
non GPL drivers from the very beginning, right? Ditto for the other
exports in this file.

> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index 7d922950caaf..c21f1305a76b 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -36,7 +36,6 @@ struct mdev_device {
>  };
>  
>  #define to_mdev_device(dev)	container_of(dev, struct mdev_device, dev)
> -#define dev_is_mdev(d)		((d)->bus == &mdev_bus_type)

Moving stuff around 3 patches later makes tons of sense.
  
Thanks,

        tglx

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 05/15] ims-msi: Add mask/unmask routines
  2020-04-21 23:34 ` [PATCH RFC 05/15] ims-msi: Add mask/unmask routines Dave Jiang
@ 2020-04-25 21:49   ` Thomas Gleixner
  2020-05-04  0:16     ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2020-04-25 21:49 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Dave Jiang <dave.jiang@intel.com> writes:
>  
> +static u32 __dev_ims_desc_mask_irq(struct msi_desc *desc, u32 flag)

...mask_irq()? This is doing both mask and unmask depending on the
availability of the ops callbacks. 

> +{
> +	u32 mask_bits = desc->platform.masked;
> +	const struct platform_msi_ops *ops;
> +
> +	ops = desc->platform.msi_priv_data->ops;
> +	if (!ops)
> +		return 0;
> +
> +	if (flag) {

flag? Darn, this has a clear boolean meaning of mask or unmask and 'u32
flag' is the most natural and obvious self explaining expression for
this, right?

> +		if (ops->irq_mask)
> +			mask_bits = ops->irq_mask(desc);
> +	} else {
> +		if (ops->irq_unmask)
> +			mask_bits = ops->irq_unmask(desc);
> +	}
> +
> +	return mask_bits;

What's mask_bits? This is about _ONE_ IMS interrupt. Can it have
multiple mask bits and if so then the explanation which I decoded by
crystal ball probably looks like this:

Bit  0:  Don't know whether it's masked
Bit  1:  Perhaps it's masked
Bit  2:  Probably it's masked
Bit  3:  Mostly masked
...
Bit 31:  Fully masked

Or something like that. Makes a lot of sense in a XKCD cartoon at least.

> +}
> +
> +/**
> + * dev_ims_mask_irq - Generic irq chip callback to mask IMS interrupts
> + * @data: pointer to irqdata associated to that interrupt
> + */
> +static void dev_ims_mask_irq(struct irq_data *data)
> +{
> +	struct msi_desc *desc = irq_data_get_msi_desc(data);
> +
> +	desc->platform.masked = __dev_ims_desc_mask_irq(desc, 1);

The purpose of this masked information is?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 06/15] ims-msi: Enable IMS interrupts
  2020-04-21 23:34 ` [PATCH RFC 06/15] ims-msi: Enable IMS interrupts Dave Jiang
@ 2020-04-25 22:13   ` Thomas Gleixner
  2020-05-04  0:17     ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Thomas Gleixner @ 2020-04-25 22:13 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, rafael, gregkh, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Dave Jiang <dave.jiang@intel.com> writes:
>  
> +struct irq_domain *dev_get_ims_domain(struct device *dev)
> +{
> +	struct irq_alloc_info info;
> +
> +	if (dev_is_mdev(dev))
> +		dev = mdev_to_parent(dev);
> +
> +	init_irq_alloc_info(&info, NULL);
> +	info.type = X86_IRQ_ALLOC_TYPE_IMS;

So all IMS capabale devices run on X86? I thought these things are PCIe
cards which can be plugged into any platform which supports PCIe.

> +	info.dev = dev;
> +
> +	return irq_remapping_get_irq_domain(&info);
> +}
> +
>  static struct msi_domain_ops dev_ims_domain_ops = {
>  	.get_hwirq	= dev_ims_get_hwirq,
>  	.msi_prepare	= dev_ims_prepare,
> diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
> index 6d8840db4a85..204ce8041c17 100644
> --- a/drivers/base/platform-msi.c
> +++ b/drivers/base/platform-msi.c
> @@ -118,6 +118,8 @@ static void platform_msi_free_descs(struct device *dev, int base, int nvec,
>  			kfree(platform_msi_group);
>  		}
>  	}
> +
> +	dev->platform_msi_type = 0;

I can clearly see the advantage of using '0' over 'NOT_PLAT_MSI'
here. '0' is definitely more intuitive.

>  }
>  
>  static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
> @@ -205,18 +207,22 @@ platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
>  	 * accordingly (which would impact the max number of MSI
>  	 * capable devices).
>  	 */
> -	if (!dev->msi_domain || !platform_ops->write_msg || !nvec ||
> -	    nvec > MAX_DEV_MSIS)
> +	if (!platform_ops->write_msg || !nvec || nvec > MAX_DEV_MSIS)
>  		return ERR_PTR(-EINVAL);
> -	if (dev->msi_domain->bus_token != DOMAIN_BUS_PLATFORM_MSI) {
> -		dev_err(dev, "Incompatible msi_domain, giving up\n");
> -		return ERR_PTR(-EINVAL);
> -	}
> +	if (dev->platform_msi_type == GEN_PLAT_MSI) {
> +		if (!dev->msi_domain)
> +			return ERR_PTR(-EINVAL);
> +
> +		if (dev->msi_domain->bus_token != DOMAIN_BUS_PLATFORM_MSI) {
> +			dev_err(dev, "Incompatible msi_domain, giving up\n");
> +			return ERR_PTR(-EINVAL);
> +		}
>  
> -	/* Already had a helping of MSI? Greed... */
> -	if (!list_empty(platform_msi_current_group_entry_list(dev)))
> -		return ERR_PTR(-EBUSY);
> +		/* Already had a helping of MSI? Greed... */
> +		if (!list_empty(platform_msi_current_group_entry_list(dev)))
> +			return ERR_PTR(-EBUSY);
> +	}
>  
>  	datap = kzalloc(sizeof(*datap), GFP_KERNEL);
>  	if (!datap)
> @@ -254,6 +260,7 @@ static void platform_msi_free_priv_data(struct platform_msi_priv_data *data)
>  int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
>  				   const struct platform_msi_ops *platform_ops)
>  {
> +	dev->platform_msi_type = GEN_PLAT_MSI;
>  	return platform_msi_domain_alloc_irqs_group(dev, nvec, platform_ops,
>  									NULL);
>  }
> @@ -265,12 +272,18 @@ int platform_msi_domain_alloc_irqs_group(struct device *dev, unsigned int nvec,
>  {
>  	struct platform_msi_group_entry *platform_msi_group;
>  	struct platform_msi_priv_data *priv_data;
> +	struct irq_domain *domain;
>  	int err;
>  
> -	dev->platform_msi_type = GEN_PLAT_MSI;

Groan. If you move the type assignment to the caller then do so in a
separate patch. These all in one combo changes are simply not reviewable
without getting nuts.

> -	if (group_id)
> +	if (!dev->platform_msi_type) {

That's really consistent. If the caller does not store a type upfront
then it becomes IMS automagically. Can you pretty please stop to think
that this IMS stuff is the center of the universe? To be clear, it's
just another variant of half thought out hardware design fail as all the
other stuff we already have to support.

Abusing dev->platform_msi_type to decide about the nature of the call
and then decide that anything which does not set it upfront is IMS is
really future proof.

>  		*group_id = ++dev->group_id;
> +		dev->platform_msi_type = IMS;

Oh a new type name 'IMS'. Well suited into the naming scheme. 

> +		domain = dev_get_ims_domain(dev);

No. This is completely inconsistent again and a blatant violation of
layering.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-24 18:12                 ` Jason Gunthorpe
@ 2020-04-26  5:18                   ` Tian, Kevin
  2020-04-26 19:13                     ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Tian, Kevin @ 2020-04-26  5:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael,
	gregkh, tglx, hpa, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

> From: Jason Gunthorpe <jgg@mellanox.com>
> Sent: Saturday, April 25, 2020 2:12 AM
> 
> > > > idxd is just the first device that supports Scalable IOV. We have a
> > > > lot more coming later, in different types. Then putting such
> > > > emulation in user space means that Qemu needs to support all those
> > > > vendor specific interfaces for every new device which supports
> > >
> > > It would be very sad to see an endless amount of device emulation code
> > > crammed into the kernel. Userspace is where device emulation is
> > > supposed to live. For security
> >
> > I think providing an unified abstraction to userspace is also important,
> > which is what VFIO provides today. The merit of using one set of VFIO
> > API to manage all kinds of mediated devices and VF devices is a major
> > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > or equivalent device is just overkill and doesn't scale. Also the actual
> > emulation code in idxd driver is actually small, if putting aside the PCI
> > config space part for which I already explained most logic could be shared
> > between mdev device drivers.
> 
> If it was just config space you might have an argument, VFIO already
> does some config space mangling, but emulating BAR space is out of
> scope of VFIO, IMHO.

out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
of your objections are actually related to the general idea of vfio-mdev.
Scalable IOV just uses PASID to harden DMA isolation in mediated
pass-through usage which vfio-mdev enables. Then are you just opposing
the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
about when using vfio-mdev is good...

> 
> I also think it is disingenuous to pretend this is similar to
> SR-IOV. SR-IOV is self contained and the BAR does not require
> emulation. What you have here sounds like it is just an ordinary

technically Scalable IOV is definitely different from SR-IOV. It's 
simpler in hardware. And we're not emulating SR-IOV. The point
is just in usage-wise we want to present a consistent user 
experience just like passing through a PCI endpoint (PF or VF) device
through vfio eco-system, including various userspace VMMs (Qemu,
firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level 
management stacks. 

> multi-queue device with the ability to PASID tag queues for IOMMU
> handling. This is absolutely not SRIOV - it is much closer to VDPA,
> which isn't using mdev.
> 
> Further, I disagree with your assessment that this doesn't scale. You
> already said you plan a normal user interface for idxd, so instead of
> having a single sane user interface (ala VDPA) idxd now needs *two*. If
> this is the general pattern of things to come, it is a bad path.
> 
> The only thing we get out of this is someone doesn't have to write a
> idxd emulation driver in qemu, instead they have to write it in the
> kernel. I don't see how that is a win for the ecosystem.
> 

No. The clear win is on leveraging classic VFIO iommu and its eco-system
as explained above.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 01/15] drivers/base: Introduce platform_msi_ops
  2020-04-21 23:33 ` [PATCH RFC 01/15] drivers/base: Introduce platform_msi_ops Dave Jiang
@ 2020-04-26  7:01   ` Greg KH
  2020-04-27 21:38     ` Dave Jiang
  0 siblings, 1 reply; 89+ messages in thread
From: Greg KH @ 2020-04-26  7:01 UTC (permalink / raw)
  To: Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, rafael, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav, dmaengine,
	linux-kernel, x86, linux-pci, kvm

On Tue, Apr 21, 2020 at 04:33:53PM -0700, Dave Jiang wrote:
> From: Megha Dey <megha.dey@linux.intel.com>
> 
> This is a preparatory patch to introduce Interrupt Message Store (IMS).
> 
> Until now, platform-msi.c provided a generic way to handle non-PCI MSI
> interrupts. Platform-msi uses its parent chip's mask/unmask routines
> and only provides a way to write the message in the generating device.
> 
> Newly creeping non-PCI complaint MSI-like interrupts (Intel's IMS for
> instance) might need to provide a device specific mask and unmask callback
> as well, apart from the write function.
> 
> Hence, introduce a new structure platform_msi_ops, which would provide
> device specific write function as well as other device specific callbacks
> (mask/unmask).
> 
> Signed-off-by: Megha Dey <megha.dey@linux.intel.com>

As this is not following the Intel-specific rules for sending me new
code, I am just deleting it all from my inbox.

Please follow the rules you all have been given, they are specific and
there for a reason.  And in looking at this code, those rules are not
going away any time soon.

greg k-h

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-26  5:18                   ` Tian, Kevin
@ 2020-04-26 19:13                     ` Jason Gunthorpe
  2020-04-27  3:43                       ` Alex Williamson
  2020-04-27 12:13                       ` Tian, Kevin
  0 siblings, 2 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-26 19:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael,
	gregkh, tglx, hpa, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote:

> > > I think providing an unified abstraction to userspace is also important,
> > > which is what VFIO provides today. The merit of using one set of VFIO
> > > API to manage all kinds of mediated devices and VF devices is a major
> > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > > or equivalent device is just overkill and doesn't scale. Also the actual
> > > emulation code in idxd driver is actually small, if putting aside the PCI
> > > config space part for which I already explained most logic could be shared
> > > between mdev device drivers.
> > 
> > If it was just config space you might have an argument, VFIO already
> > does some config space mangling, but emulating BAR space is out of
> > scope of VFIO, IMHO.
> 
> out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
> of your objections are actually related to the general idea of
> vfio-mdev.

There have been several abusive proposals of vfio-mdev, everything
from a way to create device drivers to this kind of generic emulation
framework.

> Scalable IOV just uses PASID to harden DMA isolation in mediated
> pass-through usage which vfio-mdev enables. Then are you just opposing
> the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
> about when using vfio-mdev is good...

It is appropriate when non-PCI standard techniques are needed to do
raw device assignment, just like VFIO.

Basically if vfio-pci is already doing it then it seems reasonable
that vfio-mdev should do the same. This mission creep where vfio-mdev
gains functionality far beyond VFIO is the problem.

> technically Scalable IOV is definitely different from SR-IOV. It's 
> simpler in hardware. And we're not emulating SR-IOV. The point
> is just in usage-wise we want to present a consistent user 
> experience just like passing through a PCI endpoint (PF or VF) device
> through vfio eco-system, including various userspace VMMs (Qemu,
> firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level 
> management stacks. 

Yes, I understand your desire, but at the same time we have not been
doing device emulation in the kernel. You should at least be
forthwright about that major change in the cover letters/etc.
 
> > The only thing we get out of this is someone doesn't have to write a
> > idxd emulation driver in qemu, instead they have to write it in the
> > kernel. I don't see how that is a win for the ecosystem.
> 
> No. The clear win is on leveraging classic VFIO iommu and its eco-system
> as explained above.

vdpa had no problem implementing iommu support without VFIO. This was
their original argument too, it turned out to be erroneous.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-26 19:13                     ` Jason Gunthorpe
@ 2020-04-27  3:43                       ` Alex Williamson
  2020-04-27 11:58                         ` Jason Gunthorpe
  2020-04-27 12:13                       ` Tian, Kevin
  1 sibling, 1 reply; 89+ messages in thread
From: Alex Williamson @ 2020-04-27  3:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Sun, 26 Apr 2020 16:13:57 -0300
Jason Gunthorpe <jgg@mellanox.com> wrote:

> On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote:
> 
> > > > I think providing an unified abstraction to userspace is also important,
> > > > which is what VFIO provides today. The merit of using one set of VFIO
> > > > API to manage all kinds of mediated devices and VF devices is a major
> > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > > > or equivalent device is just overkill and doesn't scale. Also the actual
> > > > emulation code in idxd driver is actually small, if putting aside the PCI
> > > > config space part for which I already explained most logic could be shared
> > > > between mdev device drivers.  
> > > 
> > > If it was just config space you might have an argument, VFIO already
> > > does some config space mangling, but emulating BAR space is out of
> > > scope of VFIO, IMHO.  
> > 
> > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
> > of your objections are actually related to the general idea of
> > vfio-mdev.  
> 
> There have been several abusive proposals of vfio-mdev, everything
> from a way to create device drivers to this kind of generic emulation
> framework.
> 
> > Scalable IOV just uses PASID to harden DMA isolation in mediated
> > pass-through usage which vfio-mdev enables. Then are you just opposing
> > the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
> > about when using vfio-mdev is good...  
> 
> It is appropriate when non-PCI standard techniques are needed to do
> raw device assignment, just like VFIO.
> 
> Basically if vfio-pci is already doing it then it seems reasonable
> that vfio-mdev should do the same. This mission creep where vfio-mdev
> gains functionality far beyond VFIO is the problem.

Ehm, vfio-pci emulates BARs too.  We also emulate FLR, power
management, DisINTx, and VPD.  FLR, PM, and VPD all have device
specific quirks in the host kernel, and I've generally taken the stance
that would should take advantage of those quirks, not duplicate them in
userspace and not invent new access mechanisms/ioctls for each of them.
Emulating DisINTx is convenient since we must have a mechanism to mask
INTx, whether it's at the device or the APIC, so we can pretend the
hardware supports it.  BAR emulation is really too trivial to argue
about, the BARs mean nothing to the physical device mapping, they're
simply scratch registers that we mask out the alignment bits on read.
vfio-pci is a mix of things that we decide are too complicated or
irrelevant to emulate in the kernel and things that take advantage of
shared quirks or are just too darn easy to worry about.  BARs fall into
that latter category, any sort of mapping into VM address spaces is
necessarily done in userspace, but scratch registers that are masked on
read, *shrug*, vfio-pci does that.  Thanks,

Alex
 
> > technically Scalable IOV is definitely different from SR-IOV. It's 
> > simpler in hardware. And we're not emulating SR-IOV. The point
> > is just in usage-wise we want to present a consistent user 
> > experience just like passing through a PCI endpoint (PF or VF) device
> > through vfio eco-system, including various userspace VMMs (Qemu,
> > firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level 
> > management stacks.   
> 
> Yes, I understand your desire, but at the same time we have not been
> doing device emulation in the kernel. You should at least be
> forthwright about that major change in the cover letters/etc.
>  
> > > The only thing we get out of this is someone doesn't have to write a
> > > idxd emulation driver in qemu, instead they have to write it in the
> > > kernel. I don't see how that is a win for the ecosystem.  
> > 
> > No. The clear win is on leveraging classic VFIO iommu and its eco-system
> > as explained above.  
> 
> vdpa had no problem implementing iommu support without VFIO. This was
> their original argument too, it turned out to be erroneous.
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27  3:43                       ` Alex Williamson
@ 2020-04-27 11:58                         ` Jason Gunthorpe
  2020-04-27 13:19                           ` Alex Williamson
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-27 11:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Sun, Apr 26, 2020 at 09:43:55PM -0600, Alex Williamson wrote:
> On Sun, 26 Apr 2020 16:13:57 -0300
> Jason Gunthorpe <jgg@mellanox.com> wrote:
> 
> > On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote:
> > 
> > > > > I think providing an unified abstraction to userspace is also important,
> > > > > which is what VFIO provides today. The merit of using one set of VFIO
> > > > > API to manage all kinds of mediated devices and VF devices is a major
> > > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > > > > or equivalent device is just overkill and doesn't scale. Also the actual
> > > > > emulation code in idxd driver is actually small, if putting aside the PCI
> > > > > config space part for which I already explained most logic could be shared
> > > > > between mdev device drivers.  
> > > > 
> > > > If it was just config space you might have an argument, VFIO already
> > > > does some config space mangling, but emulating BAR space is out of
> > > > scope of VFIO, IMHO.  
> > > 
> > > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
> > > of your objections are actually related to the general idea of
> > > vfio-mdev.  
> > 
> > There have been several abusive proposals of vfio-mdev, everything
> > from a way to create device drivers to this kind of generic emulation
> > framework.
> > 
> > > Scalable IOV just uses PASID to harden DMA isolation in mediated
> > > pass-through usage which vfio-mdev enables. Then are you just opposing
> > > the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
> > > about when using vfio-mdev is good...  
> > 
> > It is appropriate when non-PCI standard techniques are needed to do
> > raw device assignment, just like VFIO.
> > 
> > Basically if vfio-pci is already doing it then it seems reasonable
> > that vfio-mdev should do the same. This mission creep where vfio-mdev
> > gains functionality far beyond VFIO is the problem.
> 
> Ehm, vfio-pci emulates BARs too.  We also emulate FLR, power
> management, DisINTx, and VPD.  FLR, PM, and VPD all have device
> specific quirks in the host kernel, and I've generally taken the stance
> that would should take advantage of those quirks, not duplicate them in
> userspace and not invent new access mechanisms/ioctls for each of them.
> Emulating DisINTx is convenient since we must have a mechanism to mask
> INTx, whether it's at the device or the APIC, so we can pretend the
> hardware supports it.  BAR emulation is really too trivial to argue
> about, the BARs mean nothing to the physical device mapping, they're
> simply scratch registers that we mask out the alignment bits on read.
> vfio-pci is a mix of things that we decide are too complicated or
> irrelevant to emulate in the kernel and things that take advantage of
> shared quirks or are just too darn easy to worry about.  BARs fall into
> that latter category, any sort of mapping into VM address spaces is
> necessarily done in userspace, but scratch registers that are masked on
> read, *shrug*, vfio-pci does that.  Thanks,

It is not trivial masking. It is a 2000 line patch doing comprehensive
emulation.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-26 19:13                     ` Jason Gunthorpe
  2020-04-27  3:43                       ` Alex Williamson
@ 2020-04-27 12:13                       ` Tian, Kevin
  2020-04-27 12:55                         ` Jason Gunthorpe
  1 sibling, 1 reply; 89+ messages in thread
From: Tian, Kevin @ 2020-04-27 12:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael,
	gregkh, tglx, hpa, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

> From: Jason Gunthorpe <jgg@mellanox.com>
> Sent: Monday, April 27, 2020 3:14 AM
[...]
> > technically Scalable IOV is definitely different from SR-IOV. It's
> > simpler in hardware. And we're not emulating SR-IOV. The point
> > is just in usage-wise we want to present a consistent user
> > experience just like passing through a PCI endpoint (PF or VF) device
> > through vfio eco-system, including various userspace VMMs (Qemu,
> > firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level
> > management stacks.
> 
> Yes, I understand your desire, but at the same time we have not been
> doing device emulation in the kernel. You should at least be
> forthwright about that major change in the cover letters/etc.

I searched 'emulate' in kernel/Documentation:

Documentation/sound/alsa-configuration.rst (emulate oss on alsa)
Documentation/security/tpm/tpm_vtpm_proxy.rst (emulate virtual TPM)
Documentation/networking/generic-hdlc.txt (emulate eth on HDLC)
Documentation/gpu/todo.rst (generic fbdev emulation)
...

I believe the main reason why putting such emulations in kernel is 
because those emulated device interfaces have their established 
eco-systems and values which the kernel shouldn't break. As you 
emphasize earlier, they have good reasons for getting into kernel.

Then back to this context. Almost every newly-born Linux VMM
(firecracker, crosvm, cloud hypervisor, and some proprietary 
implementations) support only two types of devices: virtio and 
vfio, because they want to be simple and slim. Virtio provides a 
basic set of I/O capabilities required by most VMs, while vfio brings
an unified interface for gaining added values or higher performance
from assigned devices. Even Qemu supports a minimal configuration 
('microvm') now, for similar reason.  So the vfio eco-system is 
significant and represents a major trend in the virtualization space.

Then supporting vfio eco-system is actually the usage GOAL 
of this patch series, instead of an optional technique to be opted. 
vfio-pci is there for passing through standalone PCI endpoints 
(PF or VF), and vfio-mdev is there for passing through smaller 
portion of device resources but sharing the same VFIO interface 
to gain the uniform support in this eco-system. 

I believe above is the good reason for putting emulation in idxd 
driver by using vfio-mdev. Yes, it does imply that there will be 
more emulations in kernel when more Scalable-IOV (or alike) 
devices are introduced. But as explained earlier, the pci config 
space emulation can be largely consolidated and reused. and 
the remaining device specific MMIO emulation is relatively 
simple because we define virtual device interface to be same 
as or even simpler than a VF interface. Only a small set of registers 
are emulated after fast-path resource is passed through, and 
such small set of course needs to meet the normal quality 
requirement for getting into the kernel.

We'll definitely highlight this part in future cover letter. 😊

> 
> > > The only thing we get out of this is someone doesn't have to write a
> > > idxd emulation driver in qemu, instead they have to write it in the
> > > kernel. I don't see how that is a win for the ecosystem.
> >
> > No. The clear win is on leveraging classic VFIO iommu and its eco-system
> > as explained above.
> 
> vdpa had no problem implementing iommu support without VFIO. This was
> their original argument too, it turned out to be erroneous.
> 

Every wheel can be re-invented... my gut-feeling is that vdpa is for 
offloading fast-path vhost operations to the underlying accelerators. 
It is just a welcomed/reasonable extension to the existing virtio/vhost 
eco-system. For other types of devices such as idxd, we rely on the vfio
eco-system to catch up fast-evolving VMM spectrum.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 12:13                       ` Tian, Kevin
@ 2020-04-27 12:55                         ` Jason Gunthorpe
  0 siblings, 0 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-27 12:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael,
	gregkh, tglx, hpa, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Mon, Apr 27, 2020 at 12:13:33PM +0000, Tian, Kevin wrote:

> Then back to this context. Almost every newly-born Linux VMM
> (firecracker, crosvm, cloud hypervisor, and some proprietary 
> implementations) support only two types of devices: virtio and 
> vfio, because they want to be simple and slim.

For security. Moving all the sketchy emulation code into the kernel
seems like a worse security posture over all :(

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 11:58                         ` Jason Gunthorpe
@ 2020-04-27 13:19                           ` Alex Williamson
  2020-04-27 13:22                             ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Alex Williamson @ 2020-04-27 13:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Mon, 27 Apr 2020 08:58:18 -0300
Jason Gunthorpe <jgg@mellanox.com> wrote:

> On Sun, Apr 26, 2020 at 09:43:55PM -0600, Alex Williamson wrote:
> > On Sun, 26 Apr 2020 16:13:57 -0300
> > Jason Gunthorpe <jgg@mellanox.com> wrote:
> >   
> > > On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote:
> > >   
> > > > > > I think providing an unified abstraction to userspace is also important,
> > > > > > which is what VFIO provides today. The merit of using one set of VFIO
> > > > > > API to manage all kinds of mediated devices and VF devices is a major
> > > > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > > > > > or equivalent device is just overkill and doesn't scale. Also the actual
> > > > > > emulation code in idxd driver is actually small, if putting aside the PCI
> > > > > > config space part for which I already explained most logic could be shared
> > > > > > between mdev device drivers.    
> > > > > 
> > > > > If it was just config space you might have an argument, VFIO already
> > > > > does some config space mangling, but emulating BAR space is out of
> > > > > scope of VFIO, IMHO.    
> > > > 
> > > > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
> > > > of your objections are actually related to the general idea of
> > > > vfio-mdev.    
> > > 
> > > There have been several abusive proposals of vfio-mdev, everything
> > > from a way to create device drivers to this kind of generic emulation
> > > framework.
> > >   
> > > > Scalable IOV just uses PASID to harden DMA isolation in mediated
> > > > pass-through usage which vfio-mdev enables. Then are you just opposing
> > > > the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
> > > > about when using vfio-mdev is good...    
> > > 
> > > It is appropriate when non-PCI standard techniques are needed to do
> > > raw device assignment, just like VFIO.
> > > 
> > > Basically if vfio-pci is already doing it then it seems reasonable
> > > that vfio-mdev should do the same. This mission creep where vfio-mdev
> > > gains functionality far beyond VFIO is the problem.  
> > 
> > Ehm, vfio-pci emulates BARs too.  We also emulate FLR, power
> > management, DisINTx, and VPD.  FLR, PM, and VPD all have device
> > specific quirks in the host kernel, and I've generally taken the stance
> > that would should take advantage of those quirks, not duplicate them in
> > userspace and not invent new access mechanisms/ioctls for each of them.
> > Emulating DisINTx is convenient since we must have a mechanism to mask
> > INTx, whether it's at the device or the APIC, so we can pretend the
> > hardware supports it.  BAR emulation is really too trivial to argue
> > about, the BARs mean nothing to the physical device mapping, they're
> > simply scratch registers that we mask out the alignment bits on read.
> > vfio-pci is a mix of things that we decide are too complicated or
> > irrelevant to emulate in the kernel and things that take advantage of
> > shared quirks or are just too darn easy to worry about.  BARs fall into
> > that latter category, any sort of mapping into VM address spaces is
> > necessarily done in userspace, but scratch registers that are masked on
> > read, *shrug*, vfio-pci does that.  Thanks,  
> 
> It is not trivial masking. It is a 2000 line patch doing comprehensive
> emulation.

Not sure what you're referring to, I see about 30 lines of code in
vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
config space and maybe a couple hundred lines of code in total handling
config space emulation.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 13:19                           ` Alex Williamson
@ 2020-04-27 13:22                             ` Jason Gunthorpe
  2020-04-27 14:18                               ` Alex Williamson
  2020-04-29  9:42                               ` Tian, Kevin
  0 siblings, 2 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-27 13:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:

> > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > emulation.
> 
> Not sure what you're referring to, I see about 30 lines of code in
> vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> config space and maybe a couple hundred lines of code in total handling
> config space emulation.  Thanks,

Look around vidxd_do_command()

If I understand this flow properly..

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 13:22                             ` Jason Gunthorpe
@ 2020-04-27 14:18                               ` Alex Williamson
  2020-04-27 14:25                                 ` Jason Gunthorpe
  2020-04-29  9:42                               ` Tian, Kevin
  1 sibling, 1 reply; 89+ messages in thread
From: Alex Williamson @ 2020-04-27 14:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Mon, 27 Apr 2020 10:22:18 -0300
Jason Gunthorpe <jgg@mellanox.com> wrote:

> On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> 
> > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > emulation.  
> > 
> > Not sure what you're referring to, I see about 30 lines of code in
> > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > config space and maybe a couple hundred lines of code in total handling
> > config space emulation.  Thanks,  
> 
> Look around vidxd_do_command()
> 
> If I understand this flow properly..

I've only glanced at it, but that's called in response to a write to
MMIO space on the device, so it's implementing a device specific
register.  Are you asking that PCI config space be done in userspace
or any sort of device emulation?  The assumption with mdev is that we
need emulation in the host kernel because we need a trusted entity to
mediate device access and interact with privileged portion of the
device control.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 14:18                               ` Alex Williamson
@ 2020-04-27 14:25                                 ` Jason Gunthorpe
  2020-04-27 15:41                                   ` Alex Williamson
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-27 14:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote:
> On Mon, 27 Apr 2020 10:22:18 -0300
> Jason Gunthorpe <jgg@mellanox.com> wrote:
> 
> > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> > 
> > > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > > emulation.  
> > > 
> > > Not sure what you're referring to, I see about 30 lines of code in
> > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > > config space and maybe a couple hundred lines of code in total handling
> > > config space emulation.  Thanks,  
> > 
> > Look around vidxd_do_command()
> > 
> > If I understand this flow properly..
> 
> I've only glanced at it, but that's called in response to a write to
> MMIO space on the device, so it's implementing a device specific
> register.

It is doing emulation of the secure BAR. The entire 1000 lines of
vidxd_* functions appear to be focused on this task.

> Are you asking that PCI config space be done in userspace
> or any sort of device emulation?  

I'm concerned about doing full emulation of registers on a MMIO BAR
that trigger complex actions in response to MMIO read/write.

Simple masking and simple config space stuff doesn't seem so
problematic.

> The assumption with mdev is that we need emulation in the host
> kernel because we need a trusted entity to mediate device access and
> interact with privileged portion of the device control.  Thanks,

Sure, but there are all kinds of different levels to this - mdev
should not be some open ended device emulation framework, IMHO.

ie other devices need only a small amount of kernel side help and
don't need complex MMIO BAR emulation.

Would you be happy if someone proposed an e1000 NIC emulator using
mdev? Why not move every part of qemu's PCI device emulation into the
kernel?

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 14:25                                 ` Jason Gunthorpe
@ 2020-04-27 15:41                                   ` Alex Williamson
  2020-04-27 16:16                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Alex Williamson @ 2020-04-27 15:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Mon, 27 Apr 2020 11:25:53 -0300
Jason Gunthorpe <jgg@mellanox.com> wrote:

> On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote:
> > On Mon, 27 Apr 2020 10:22:18 -0300
> > Jason Gunthorpe <jgg@mellanox.com> wrote:
> >   
> > > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> > >   
> > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > > > emulation.    
> > > > 
> > > > Not sure what you're referring to, I see about 30 lines of code in
> > > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > > > config space and maybe a couple hundred lines of code in total handling
> > > > config space emulation.  Thanks,    
> > > 
> > > Look around vidxd_do_command()
> > > 
> > > If I understand this flow properly..  
> > 
> > I've only glanced at it, but that's called in response to a write to
> > MMIO space on the device, so it's implementing a device specific
> > register.  
> 
> It is doing emulation of the secure BAR. The entire 1000 lines of
> vidxd_* functions appear to be focused on this task.

Ok, we/I need a terminology clarification, a BAR is a register in
config space for determining the size, type, and setting the location
of a I/O or memory region of a device.  I've been asserting that the
emulation of the BAR itself is trivial, but are you actually focused on
emulation of the region described by the BAR?  This is what mdev is
for, mediating access to a device and filling in gaps such that we can
use re-use the vfio device APIs.

> > Are you asking that PCI config space be done in userspace
> > or any sort of device emulation?    
> 
> I'm concerned about doing full emulation of registers on a MMIO BAR
> that trigger complex actions in response to MMIO read/write.

Maybe what you're recalling me say about mdev is that its Achilles
heel is that we rely on mediation provider (ie. vendor driver) for
security, we don't necessarily have an piece of known, common hardware
like an IOMMU to protect us when things go wrong.  That's true, but
don't we also trust drivers in the host kernel to correctly manage and
validate their own interactions with hardware, including the APIs
provided through other user interfaces.  Is the assertion then that
device specific, register level API is too difficult to emulate?

> Simple masking and simple config space stuff doesn't seem so
> problematic.
> 
> > The assumption with mdev is that we need emulation in the host
> > kernel because we need a trusted entity to mediate device access and
> > interact with privileged portion of the device control.  Thanks,  
> 
> Sure, but there are all kinds of different levels to this - mdev
> should not be some open ended device emulation framework, IMHO.
> 
> ie other devices need only a small amount of kernel side help and
> don't need complex MMIO BAR emulation.
> 
> Would you be happy if someone proposed an e1000 NIC emulator using
> mdev? Why not move every part of qemu's PCI device emulation into the
> kernel?

Well, in order to mediate a device, we certainly expect there to be a
physical device.  I also expect that there's some performance or at
least compatibility advantage to using the device API directly rather
than masquerading everything behind something like virtio.  So no, I
wouldn't expect someone to create a fully emulated device in mdev, but
also I do expect some degree of device emulation in an mdev driver to
fill the gaps in non-performance path that hardware chose to defer to
software.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 15:41                                   ` Alex Williamson
@ 2020-04-27 16:16                                     ` Jason Gunthorpe
  2020-04-27 16:25                                       ` Dave Jiang
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-27 16:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Mon, Apr 27, 2020 at 09:41:37AM -0600, Alex Williamson wrote:
> On Mon, 27 Apr 2020 11:25:53 -0300
> Jason Gunthorpe <jgg@mellanox.com> wrote:
> 
> > On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote:
> > > On Mon, 27 Apr 2020 10:22:18 -0300
> > > Jason Gunthorpe <jgg@mellanox.com> wrote:
> > >   
> > > > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> > > >   
> > > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > > > > emulation.    
> > > > > 
> > > > > Not sure what you're referring to, I see about 30 lines of code in
> > > > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > > > > config space and maybe a couple hundred lines of code in total handling
> > > > > config space emulation.  Thanks,    
> > > > 
> > > > Look around vidxd_do_command()
> > > > 
> > > > If I understand this flow properly..  
> > > 
> > > I've only glanced at it, but that's called in response to a write to
> > > MMIO space on the device, so it's implementing a device specific
> > > register.  
> > 
> > It is doing emulation of the secure BAR. The entire 1000 lines of
> > vidxd_* functions appear to be focused on this task.
> 
> Ok, we/I need a terminology clarification, a BAR is a register in
> config space for determining the size, type, and setting the location
> of a I/O or memory region of a device.  I've been asserting that the
> emulation of the BAR itself is trivial, but are you actually focused on
> emulation of the region described by the BAR?

Yes, BAR here means the actually MMIO memory window - not the config
space part. Config space emulation is largely trivial.

> > > Are you asking that PCI config space be done in userspace
> > > or any sort of device emulation?    
> > 
> > I'm concerned about doing full emulation of registers on a MMIO BAR
> > that trigger complex actions in response to MMIO read/write.
> 
> Maybe what you're recalling me say about mdev is that its Achilles
> heel is that we rely on mediation provider (ie. vendor driver) for
> security, we don't necessarily have an piece of known, common hardware
> like an IOMMU to protect us when things go wrong.  That's true, but
> don't we also trust drivers in the host kernel to correctly manage and
> validate their own interactions with hardware, including the APIs
> provided through other user interfaces.  Is the assertion then that
> device specific, register level API is too difficult to emulate?

No, it is a reflection on the standard Linux philosophy that if it can
be done in user space it should be done in userspace. Ie keep minimal
work in the monolithic kernel.

Also to avoid duplication, ie idxd proposes to have a char dev with a
normal kernel driver interface and then an in-kernel emulated MMIO BAR
version of that same capability for VFIO consumption.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 16:16                                     ` Jason Gunthorpe
@ 2020-04-27 16:25                                       ` Dave Jiang
  2020-04-27 21:56                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Jiang @ 2020-04-27 16:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Tian, Kevin, Raj, Ashok, vkoul, megha.dey, maz, bhelgaas, rafael,
	gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, Lin, Jing, Williams, Dan J, kwankhede,
	eric.auger, parav, dmaengine, linux-kernel, x86, linux-pci, kvm



On 4/27/2020 9:16 AM, Jason Gunthorpe wrote:
> On Mon, Apr 27, 2020 at 09:41:37AM -0600, Alex Williamson wrote:
>> On Mon, 27 Apr 2020 11:25:53 -0300
>> Jason Gunthorpe <jgg@mellanox.com> wrote:
>>
>>> On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote:
>>>> On Mon, 27 Apr 2020 10:22:18 -0300
>>>> Jason Gunthorpe <jgg@mellanox.com> wrote:
>>>>    
>>>>> On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
>>>>>    
>>>>>>> It is not trivial masking. It is a 2000 line patch doing comprehensive
>>>>>>> emulation.
>>>>>>
>>>>>> Not sure what you're referring to, I see about 30 lines of code in
>>>>>> vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
>>>>>> config space and maybe a couple hundred lines of code in total handling
>>>>>> config space emulation.  Thanks,
>>>>>
>>>>> Look around vidxd_do_command()
>>>>>
>>>>> If I understand this flow properly..
>>>>
>>>> I've only glanced at it, but that's called in response to a write to
>>>> MMIO space on the device, so it's implementing a device specific
>>>> register.
>>>
>>> It is doing emulation of the secure BAR. The entire 1000 lines of
>>> vidxd_* functions appear to be focused on this task.
>>
>> Ok, we/I need a terminology clarification, a BAR is a register in
>> config space for determining the size, type, and setting the location
>> of a I/O or memory region of a device.  I've been asserting that the
>> emulation of the BAR itself is trivial, but are you actually focused on
>> emulation of the region described by the BAR?
> 
> Yes, BAR here means the actually MMIO memory window - not the config
> space part. Config space emulation is largely trivial.
> 
>>>> Are you asking that PCI config space be done in userspace
>>>> or any sort of device emulation?
>>>
>>> I'm concerned about doing full emulation of registers on a MMIO BAR
>>> that trigger complex actions in response to MMIO read/write.
>>
>> Maybe what you're recalling me say about mdev is that its Achilles
>> heel is that we rely on mediation provider (ie. vendor driver) for
>> security, we don't necessarily have an piece of known, common hardware
>> like an IOMMU to protect us when things go wrong.  That's true, but
>> don't we also trust drivers in the host kernel to correctly manage and
>> validate their own interactions with hardware, including the APIs
>> provided through other user interfaces.  Is the assertion then that
>> device specific, register level API is too difficult to emulate?
> 
> No, it is a reflection on the standard Linux philosophy that if it can
> be done in user space it should be done in userspace. Ie keep minimal
> work in the monolithic kernel.
> 
> Also to avoid duplication, ie idxd proposes to have a char dev with a
> normal kernel driver interface and then an in-kernel emulated MMIO BAR
> version of that same capability for VFIO consumption.

The char dev interface serves user apps on host (which we will deprecate 
and move to the UACCE framework in near future). The mdev interface will 
be servicing guests only. I'm not sure where the duplication of 
functionality comes into play.

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 01/15] drivers/base: Introduce platform_msi_ops
  2020-04-26  7:01   ` Greg KH
@ 2020-04-27 21:38     ` Dave Jiang
  2020-04-28  7:34       ` Greg KH
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Jiang @ 2020-04-27 21:38 UTC (permalink / raw)
  To: Greg KH
  Cc: vkoul, megha.dey, maz, bhelgaas, rafael, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav, dmaengine,
	linux-kernel, x86, linux-pci, kvm



On 4/26/2020 12:01 AM, Greg KH wrote:
> On Tue, Apr 21, 2020 at 04:33:53PM -0700, Dave Jiang wrote:
>> From: Megha Dey <megha.dey@linux.intel.com>
>>
>> This is a preparatory patch to introduce Interrupt Message Store (IMS).
>>
>> Until now, platform-msi.c provided a generic way to handle non-PCI MSI
>> interrupts. Platform-msi uses its parent chip's mask/unmask routines
>> and only provides a way to write the message in the generating device.
>>
>> Newly creeping non-PCI complaint MSI-like interrupts (Intel's IMS for
>> instance) might need to provide a device specific mask and unmask callback
>> as well, apart from the write function.
>>
>> Hence, introduce a new structure platform_msi_ops, which would provide
>> device specific write function as well as other device specific callbacks
>> (mask/unmask).
>>
>> Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
> 
> As this is not following the Intel-specific rules for sending me new
> code, I am just deleting it all from my inbox.

That is my fault. As the aggregator of the patches, I should've signed 
off Megha's patches.

> 
> Please follow the rules you all have been given, they are specific and
> there for a reason.  And in looking at this code, those rules are not
> going away any time soon.
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 16:25                                       ` Dave Jiang
@ 2020-04-27 21:56                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2020-04-27 21:56 UTC (permalink / raw)
  To: Dave Jiang
  Cc: Alex Williamson, Tian, Kevin, Raj, Ashok, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Mon, Apr 27, 2020 at 09:25:58AM -0700, Dave Jiang wrote:
> > Also to avoid duplication, ie idxd proposes to have a char dev with a
> > normal kernel driver interface and then an in-kernel emulated MMIO BAR
> > version of that same capability for VFIO consumption.
> 
> The char dev interface serves user apps on host (which we will deprecate and
> move to the UACCE framework in near future). 

The point is the char dev or UACCE framework should provide enough
capability to implement the emulation in user space.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 01/15] drivers/base: Introduce platform_msi_ops
  2020-04-27 21:38     ` Dave Jiang
@ 2020-04-28  7:34       ` Greg KH
  0 siblings, 0 replies; 89+ messages in thread
From: Greg KH @ 2020-04-28  7:34 UTC (permalink / raw)
  To: Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, rafael, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav, dmaengine,
	linux-kernel, x86, linux-pci, kvm

On Mon, Apr 27, 2020 at 02:38:12PM -0700, Dave Jiang wrote:
> 
> 
> On 4/26/2020 12:01 AM, Greg KH wrote:
> > On Tue, Apr 21, 2020 at 04:33:53PM -0700, Dave Jiang wrote:
> > > From: Megha Dey <megha.dey@linux.intel.com>
> > > 
> > > This is a preparatory patch to introduce Interrupt Message Store (IMS).
> > > 
> > > Until now, platform-msi.c provided a generic way to handle non-PCI MSI
> > > interrupts. Platform-msi uses its parent chip's mask/unmask routines
> > > and only provides a way to write the message in the generating device.
> > > 
> > > Newly creeping non-PCI complaint MSI-like interrupts (Intel's IMS for
> > > instance) might need to provide a device specific mask and unmask callback
> > > as well, apart from the write function.
> > > 
> > > Hence, introduce a new structure platform_msi_ops, which would provide
> > > device specific write function as well as other device specific callbacks
> > > (mask/unmask).
> > > 
> > > Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
> > 
> > As this is not following the Intel-specific rules for sending me new
> > code, I am just deleting it all from my inbox.
> 
> That is my fault. As the aggregator of the patches, I should've signed off
> Megha's patches.

That is NOT the Intel-specific rules I am talking about.  Please go work
with the "Linux group" at Intel to find out what I am referring to, they
know what I mean.

The not-signing-off is just a normal kernel community rule, everyone has
to follow that.

greg k-h

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-27 13:22                             ` Jason Gunthorpe
  2020-04-27 14:18                               ` Alex Williamson
@ 2020-04-29  9:42                               ` Tian, Kevin
  2020-05-08 20:47                                 ` Raj, Ashok
  1 sibling, 1 reply; 89+ messages in thread
From: Tian, Kevin @ 2020-04-29  9:42 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Raj, Ashok, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas, rafael,
	gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, Lin, Jing, Williams, Dan J, kwankhede,
	eric.auger, parav, dmaengine, linux-kernel, x86, linux-pci, kvm

> From: Jason Gunthorpe <jgg@mellanox.com>
> Sent: Monday, April 27, 2020 9:22 PM
> 
> On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> 
> > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > emulation.
> >
> > Not sure what you're referring to, I see about 30 lines of code in
> > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > config space and maybe a couple hundred lines of code in total handling
> > config space emulation.  Thanks,
> 
> Look around vidxd_do_command()
> 
> If I understand this flow properly..
> 

Hi, Jason,

I guess the 2000 lines mostly refer to the changes in mdev.c and vdev.c. 
We did a break-down among them:

1) ~150 LOC for vdev initialization
2) ~150 LOC for cfg space emulation
3) ~230 LOC for mmio r/w emulation
4) ~500 LOC for controlling the work queue (vidxd_do_command), 
triggered by write emulation of IDXD_CMD_OFFSET register
5) the remaining lines are all about vfio-mdev registration/callbacks,
for reporting mmio/irq resource, eventfd, mmap, etc.

1/2/3) are pure device emulation, which counts for ~500 LOC. 

4) needs be in the kernel regardless of which uAPI is used, because it
talks to the physical work queue (enable, disable, drain, abort, reset, etc.)

Then if just talking about ~500 LOC emulation code left in the kernel, 
is it still a big concern to you? 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-04-23 20:11   ` Jason Gunthorpe
@ 2020-05-01 22:30     ` Dey, Megha
  2020-05-03 22:25       ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Dey, Megha @ 2020-05-01 22:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Dave Jiang
  Cc: vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa, alex.williamson,
	jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, dmaengine, linux-kernel, x86, linux-pci, kvm

Hi Jason,

On 4/23/2020 1:11 PM, Jason Gunthorpe wrote:
> On Tue, Apr 21, 2020 at 04:34:11PM -0700, Dave Jiang wrote:
>> diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
>> new file mode 100644
>> index 000000000000..738f6d153155
>> +++ b/drivers/base/ims-msi.c
>> @@ -0,0 +1,100 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Support for Device Specific IMS interrupts.
>> + *
>> + * Copyright © 2019 Intel Corporation.
>> + *
>> + * Author: Megha Dey <megha.dey@intel.com>
>> + */
>> +
>> +#include <linux/dmar.h>
>> +#include <linux/irq.h>
>> +#include <linux/mdev.h>
>> +#include <linux/pci.h>
>> +
>> +/*
>> + * Determine if a dev is mdev or not. Return NULL if not mdev device.
>> + * Return mdev's parent dev if success.
>> + */
>> +static inline struct device *mdev_to_parent(struct device *dev)
>> +{
>> +	struct device *ret = NULL;
>> +	struct device *(*fn)(struct device *dev);
>> +	struct bus_type *bus = symbol_get(mdev_bus_type);
>> +
>> +	if (bus && dev->bus == bus) {
>> +		fn = symbol_get(mdev_dev_to_parent_dev);
>> +		ret = fn(dev);
>> +		symbol_put(mdev_dev_to_parent_dev);
>> +		symbol_put(mdev_bus_type);
> 
> No, things like this are not OK in the drivers/base
> 
> Whatever this is doing needs to be properly architected in some
> generic way.

Basically what I am trying to do here is to determine if the device is 
an mdev device or not. mdev devices have no IRQ domain associated to it 
and use their parent dev's IRQ domain to allocate interrupts.

The issue is that
1. all the vfio-mdev code today can be compiled as a module
2. None of the mdev macros/functions are being used outside of the 
drivers/vfio/mdev code path (where they are defined).

Hence, these definitions are not visible outside of drivers/vfio/mdev 
when compiled as a module and thus I have used symbol_get/put.

I will try asking the mdev folks if they would have a better solution 
for this or some of this code can be mde more generic.

> 
>> +static int dev_ims_prepare(struct irq_domain *domain, struct device *dev,
>> +			   int nvec, msi_alloc_info_t *arg)
>> +{
>> +	if (dev_is_mdev(dev))
>> +		dev = mdev_to_parent(dev);
> 
> Like maybe the caller shouldn't be passing in a mdev in the first
> place, or some generic driver layer scheme is needed to go from a
> child device (eg a mdev or one of these new virtual bus things) to the
> struct device that owns the IRQ interface.

In our current use case, IMS interrupts are only used by guest (mdev's), 
although they can be used by host as well. So the 'dev' passed by the 
caller of platform_msi_domain_alloc_irqs_group() is effectively an mdev.

I am not sure about how we could have a generic code to convert the 
'child' mdev device to struct device. Do you have any suggestions on how 
we could do this?

> 
>> +	init_irq_alloc_info(arg, NULL);
>> +	arg->dev = dev;
>> +	arg->type = X86_IRQ_ALLOC_TYPE_IMS;
> 
> Also very bewildering to see X86_* in drivers/base

Well, this needs to go for sure. I will replace it with something more 
generic.
> 
>> +struct irq_domain *arch_create_ims_irq_domain(struct irq_domain *parent,
>> +					      const char *name)
>> +{
>> +	struct fwnode_handle *fn;
>> +	struct irq_domain *domain;
>> +
>> +	fn = irq_domain_alloc_named_fwnode(name);
>> +	if (!fn)
>> +		return NULL;
>> +
>> +	domain = msi_create_irq_domain(fn, &ims_ir_domain_info, parent);
>> +	if (!domain)
>> +		return NULL;
>> +
>> +	irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
>> +	irq_domain_free_fwnode(fn);
>> +
>> +	return domain;
>> +}
> 
> I'm still not really clear why all this is called IMS.. This looks
> like the normal boilerplate to setup an IRQ domain? What is actually
> 'ims' in here?

It is just a way to create a new domain specifically for IMS interrupts. 
Although, since there is a platform_msi_create_irq_domain already, which 
does something similar, I will use the same for IMS as well.

Also, since there is quite a stir over the name 'IMS' do you have any 
suggestion for a more generic name for this?

> 
>> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
>> index 7d922950caaf..c21f1305a76b 100644
>> +++ b/drivers/vfio/mdev/mdev_private.h
>> @@ -36,7 +36,6 @@ struct mdev_device {
>>   };
>>   
>>   #define to_mdev_device(dev)	container_of(dev, struct mdev_device, dev)
>> -#define dev_is_mdev(d)		((d)->bus == &mdev_bus_type)
>>   
>>   struct mdev_type {
>>   	struct kobject kobj;
>> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
>> index 0ce30ca78db0..fa2344e239ef 100644
>> +++ b/include/linux/mdev.h
>> @@ -144,5 +144,8 @@ void mdev_unregister_driver(struct mdev_driver *drv);
>>   struct device *mdev_parent_dev(struct mdev_device *mdev);
>>   struct device *mdev_dev(struct mdev_device *mdev);
>>   struct mdev_device *mdev_from_dev(struct device *dev);
>> +struct device *mdev_dev_to_parent_dev(struct device *dev);
>> +
>> +#define dev_is_mdev(dev) ((dev)->bus == symbol_get(mdev_bus_type))
> 
> NAK on the symbol_get

As I mentioned earlier, given the way the current mdev code is 
structured, the only way to use dev_is_mdev or other macros/functions in 
the mdev subsystem outside of drivers/vfio/mdev is to use 
symbol_get/put. Obviously, seems like this is not a correct thing to do, 
I will have to find another way.

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-23 19:49       ` Jason Gunthorpe
@ 2020-05-01 22:31         ` Dey, Megha
  2020-05-03 22:21           ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Dey, Megha @ 2020-05-01 22:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Dave Jiang, Vinod Koul, maz, Bjorn Helgaas, Rafael J. Wysocki,
	Greg KH, Thomas Gleixner, H. Peter Anvin, Alex Williamson,
	Jacob jun Pan, Raj, Ashok, Yi L Liu, Baolu Lu, Tian, Kevin,
	Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede, eric.auger,
	parav, dmaengine, Linux Kernel Mailing List, X86 ML, linux-pci,
	KVM list

Hi Jason,

On 4/23/2020 12:49 PM, Jason Gunthorpe wrote:
> On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote:
> 
>> Per Megha's follow-up can you send the details about that other device
>> and help clear a path for a device-specific MSI addr/data table
>> format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
>> to claims about future upstream users. The fact that you have an
>> additional use case is golden for pushing this into a common area and
>> validating the scope of the proposed API.
> 
> I think I said it at plumbers, but yes, we are interested in this, and
> would like dynamic MSI-like interrupts available to the driver (what
> Intel calls IMS)
> 

So basically you are looking for a way to dynamically allocate the 
platform-msi interrupts, correct?

Since I don't have access to any of the platform-msi devices, it is hard 
for me to test this code for other drivers expect idxd for now.
Once I submit the next round of patches, after addressing all the 
comments, would it be possible for you to test this code for any of your 
devices?

> It is something easy enough to illustrate with any RDMA device really,
> just open a MR against the addr and use RDMA_WRITE to trigger the
> data. It should trigger a Linux IRQ. Nothing else should be needed.
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-23 19:18     ` Jason Gunthorpe
@ 2020-05-01 22:31       ` Dey, Megha
  2020-05-03 22:22         ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Dey, Megha @ 2020-05-01 22:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Dave Jiang, Vinod Koul, maz, Bjorn Helgaas, Rafael J. Wysocki,
	Greg KH, Thomas Gleixner, H. Peter Anvin, Alex Williamson,
	Jacob jun Pan, Raj, Ashok, Yi L Liu, baolu.lu, Tian, Kevin,
	Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede, eric.auger,
	parav, dmaengine, Linux Kernel Mailing List, X86 ML, linux-pci,
	KVM list

Hi Jason,

On 4/23/2020 12:18 PM, Jason Gunthorpe wrote:
> On Wed, Apr 22, 2020 at 02:24:11PM -0700, Dan Williams wrote:
>> On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>>>
>>> On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
>>>> The actual code is independent of the stage 2 driver code submission that adds
>>>> support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
>>>> support dedicated workqueue on a guest with no vIOMMU.
>>>>
>>>> A new device type "mdev" is introduced for the idxd driver. This allows the wq
>>>> to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
>>>> queue (wq) is enabled, an uuid generated by the user can be added to the wq
>>>> through the uuid sysfs attribute for the wq.  After the association, a mdev can
>>>> be created using this UUID. The mdev driver code will associate the uuid and
>>>> setup the mdev on the driver side. When the create operation is successful, the
>>>> uuid can be passed to qemu. When the guest boots up, it should discover a DSA
>>>> device when doing PCI discovery.
>>>
>>> I'm feeling really skeptical that adding all this PCI config space and
>>> MMIO BAR emulation to the kernel just to cram this into a VFIO
>>> interface is a good idea, that kind of stuff is much safer in
>>> userspace.
>>>
>>> Particularly since vfio is not really needed once a driver is using
>>> the PASID stuff. We already have general code for drivers to use to
>>> attach a PASID to a mm_struct - and using vfio while disabling all the
>>> DMA/iommu config really seems like an abuse.
>>>
>>> A /dev/idxd char dev that mmaps a bar page and links it to a PASID
>>> seems a lot simpler and saner kernel wise.
>>>
>>>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
>>>> interrupts for the guest. This preserves MSIX for host usages and also allows a
>>>> significantly larger number of interrupt vectors for guest usage.
>>>
>>> I never did get a reply to my earlier remarks on the IMS patches.
>>>
>>> The concept of a device specific addr/data table format for MSI is not
>>> Intel specific. This should be general code. We have a device that can
>>> use this kind of kernel capability today.
>>
>> This has been my concern reviewing the implementation. IMS needs more
>> than one in-tree user to validate degrees of freedom in the api. I had
>> been missing a second "in-tree user" to validate the scope of the
>> flexibility that was needed.
> 
> IMS is too narrowly specified.
> 
> All platforms that support MSI today can support IMS. It is simply a
> way for the platform to give the driver an addr/data pair that triggers
> an interrupt when a posted write is performed to that pair.
> 

Well, yes and no. IMS requires interrupt remapping in addition to the 
dynamic nature of IRQ allocation.

> This is different from the other interrupt setup flows which are
> tightly tied to the PCI layer. Here the driver should simply ask for
> interrupts.
> 
> Ie the entire IMS API to the driver should be something very simple
> like:
> 
>   struct message_irq
>   {
>     uint64_t addr;
>     uint32_t data;
>   };
> 
>   struct message_irq *request_message_irq(
>      struct device *, irq_handler_t handler, unsigned long flags,
>      const char *name, void *dev);
> 
> And the plumbing underneath should setup the irq chips and so forth as
> required.
> 

yes, this seems correct.
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 07/15] Documentation: Interrupt Message store
  2020-04-23 20:04   ` Jason Gunthorpe
@ 2020-05-01 22:32     ` Dey, Megha
  2020-05-03 22:28       ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Dey, Megha @ 2020-05-01 22:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Dave Jiang
  Cc: vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa, alex.williamson,
	jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, dmaengine, linux-kernel, x86, linux-pci, kvm

Hi Jason,

On 4/23/2020 1:04 PM, Jason Gunthorpe wrote:
> On Tue, Apr 21, 2020 at 04:34:30PM -0700, Dave Jiang wrote:
> 
>> diff --git a/Documentation/ims-howto.rst b/Documentation/ims-howto.rst
>> new file mode 100644
>> index 000000000000..a18de152b393
>> +++ b/Documentation/ims-howto.rst
>> @@ -0,0 +1,210 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +.. include:: <isonum.txt>
>> +
>> +==========================
>> +The IMS Driver Guide HOWTO
>> +==========================
>> +
>> +:Authors: Megha Dey
>> +
>> +:Copyright: 2020 Intel Corporation
>> +
>> +About this guide
>> +================
>> +
>> +This guide describes the basics of Interrupt Message Store (IMS), the
>> +need to introduce a new interrupt mechanism, implementation details of
>> +IMS in the kernel, driver changes required to support IMS and the general
>> +misconceptions and FAQs associated with IMS.
> 
> I'm not sure why we need to call this IMS in kernel documentat? I know
> Intel is using this term, but this document is really only talking
> about extending the existing platform_msi stuff, which looks pretty
> good actually.

hmmm, so maybe we call it something else or just say dynamic platform-msi?

> 
> A lot of this is good for the cover letter..

Well, I got a lot of comments internally and externally about how the 
cover page needs to have just the basics and all the ugly details can go 
in the Documentation. So well, I am confused here.
> 
>> +Implementation of IMS in the kernel
>> +===================================
>> +
>> +The Linux kernel today already provides a generic mechanism to support
>> +non-PCI compliant MSI interrupts for platform devices (platform-msi.c).
>> +To support IMS interrupts, we create a new IMS IRQ domain and extend the
>> +existing infrastructure. Dynamic allocation of IMS vectors is a requirement
>> +for devices which support Scalable I/O Virtualization. A driver can allocate
>> +and free vectors not just once during probe (as was the case with MSI/MSI-X)
>> +but also in the post probe phase where actual demand is available. Thus, a
>> +new API, platform_msi_domain_alloc_irqs_group is introduced which drivers
>> +using IMS would be able to call multiple times. The vectors allocated each
>> +time this API is called are associated with a group ID. To free the vectors
>> +associated with a particular group, the platform_msi_domain_free_irqs_group
>> +API can be called. The existing drivers using platform-msi infrastructure
>> +will continue to use the existing alloc (platform_msi_domain_alloc_irqs)
>> +and free (platform_msi_domain_free_irqs) APIs and are assigned a default
>> +group ID of 0.
>> +
>> +Thus, platform-msi.c provides the generic methods which can be used by any
>> +non-pci MSI interrupt type while the newly created ims-msi.c provides IMS
>> +specific callbacks that can be used by drivers capable of generating IMS
>> +interrupts.
> 
> How exactly is an IMS interrupt is different from a platform msi?
> 
> It looks like it is just some thin wrapper around msi_domain - what is
> it for?

So I think conceptually, there is no difference between platform-msi and 
IMS. (Just thinking out loud).

 From a code stand-point, currently
1. Allocation of interrupts is static. I don't think the 
platform-msi-domain_alloc_irqs can be called multiple times.
2. only a write-msg callback is present and they use the parent IRQ 
chip's mask/unmask functions
3. IMS needs interrupt remapping support to be enabled (this is 
independent of the above 2).

If 1 and 2 is all that you are looking for, then we can split the code 
such that we have a generic platform_msi_domain_alloc_irqs_dyn, which 
will be used for the dynamic allocation of IRQs and another 
platform_msi_domain_alloc_irqs_ims (or whatever the name IMS will boil 
down to) which will use interrupt remapping support to get the IRQ 
domain etc.

> 
>> +FAQs and general misconceptions:
>> +================================
>> +
>> +** There were some concerns raised by Thomas Gleixner and Marc Zyngier
>> +during Linux plumbers conference 2019:
>> +
>> +1. Enumeration of IMS needs to be done by PCI core code and not by
>> +   individual device drivers:
>> +
>> +   Currently, if the kernel needs a generic way to discover IMS capability
>> +   without host driver dependency, the PCIE Designated Vendor specific
>> +
>> +   However, we cannot have a standard way of enumerating the IMS size
>> +   because for context based devices, the interrupt message is part of
>> +   the context itself which is managed entirely by the driver. Since
>> +   context creation is done on demand, there is no way to tell during boot
>> +   time, the maximum number of contexts (and hence the number of interrupt
>> +   messages)that the device can support.
> 
> FWIW, I agree with this
> 
> Like platform-msi, IMS should be controlled entirely by the driver.
yup!

> 
>> +2. Why is Intel designing a new interrupt mechanism rather than extending
>> +   MSI-X to address its limitations? Isn't 2048 device interrupts enough?
>> +
>> +   MSI-X has a rigid definition of one-table and on-device storage and does
>> +   not provide the full flexibility required for future multi-tile
>> +   accelerator designs.
>> +   IMS was envisioned to be used with large number of ADIs in devices where
>> +   each will need unique interrupt resources. For example, a DSA shared
>> +   work queue can support large number of clients where each client can
>> +   have its own interrupt. In future, with user interrupts, we expect the
>> +   demand for messages to increase further.
> 
> Generally agree
> 
ok!

>> +Device Driver Changes:
>> +=====================
>> +
>> +1. platform_msi_domain_alloc_irqs_group (struct device *dev, unsigned int
>> +   nvec, const struct platform_msi_ops *platform_ops, int *group_id)
>> +   to allocate IMS interrupts, where:
>> +
>> +   dev: The device for which to allocate interrupts
>> +   nvec: The number of interrupts to allocate
>> +   platform_ops: Callbacks for platform MSI ops (to be provided by driver)
>> +   group_id: returned by the call, to be used to free IRQs of a certain type
>> +
>> +   eg: static struct platform_msi_ops ims_ops  = {
>> +        .irq_mask               = ims_irq_mask,
>> +        .irq_unmask             = ims_irq_unmask,
>> +        .write_msg              = ims_write_msg,
>> +        };
>> +
>> +        int group;
>> +        platform_msi_domain_alloc_irqs_group (dev, nvec, platform_ops, &group)
>> +
>> +   where, struct platform_msi_ops:
>> +   irq_mask:   mask an interrupt source
>> +   irq_unmask: unmask an interrupt source
>> +   irq_write_msi_msg: write message content
>> +
>> +   This API can be called multiple times. Every time a new group will be
>> +   associated with the allocated vectors. Group ID starts from 0.
> 
> Need much more closer look, but this seems conceptually fine to me.
> 
> As above the API here is called platform_msi - which seems good to
> me. Again not sure why the word IMS is needed
>

well, in this case, ims_ops, ims_mask etc are just example names.

> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-23 19:44     ` Jason Gunthorpe
@ 2020-05-01 22:32       ` Dey, Megha
  0 siblings, 0 replies; 89+ messages in thread
From: Dey, Megha @ 2020-05-01 22:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm



On 4/23/2020 12:44 PM, Jason Gunthorpe wrote:
>>>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
>>>> interrupts for the guest. This preserves MSIX for host usages and also allows a
>>>> significantly larger number of interrupt vectors for guest usage.
>>>
>>> I never did get a reply to my earlier remarks on the IMS patches.
>>>
>>> The concept of a device specific addr/data table format for MSI is not
>>> Intel specific. This should be general code. We have a device that can
>>> use this kind of kernel capability today.
>>
>> I am sorry if I did not address your comments earlier.
> 
> It appears noboy from Intel bothered to answer anyone else on that RFC
> thread:
> 
> https://lore.kernel.org/lkml/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/
> 
> However, it seems kind of moot as I see now that this verion of IMS
> bears almost no resemblance to the original RFC.

hmm yeah, we changed most of the code after getting a lot of feedback 
from you and folks at plumbers. But yes, I should have replied to all 
the feedback, lesson learnt :)

> 
> That said, the similiarity to platform-msi was striking, does this new
> version harmonize with that?

yes!
> 
>> The present IMS code is quite generic, most of the code is in the drivers/
>> folder. We basically introduce 2 APIS: allocate and free IMS interrupts and
>> a IMS IRQ domain to allocate these interrupts from. These APIs are
>> architecture agnostic.
>>
>> We also introduce a new IMS IRQ domain which is architecture specific. This
>> is because IMS generates interrupts only in the remappable format, hence
>> interrupt remapping should be enabled for IMS. Currently, the interrupt
>> remapping code is only available for Intel and AMD and I don’t see anything
>> for ARM.
> 
> I don't understand these remarks though - IMS is simply the mapping of
> a MemWr addr/data pair to a Linux IRQ number? Why does this intersect
> with remapping?
> 

 From your comments so far, I think your requirement is a subset of what 
IMS is trying to do.

What you want:
have a dynamic means of allocating platform-msi interrupts

On top of this IMS has a requirement that all of the interrupts should 
be remapped.

So we can have tiered code: generic dynamic platform-msi infrastructure
and add the IMS specific bits (Intel specific) on top of this.

The generic code will have no reference to IMS.

> AFAIK, any platform that supports MSI today should have the inherent
> HW capability to support IMS.
> 
>> Also, could you give more details on the device that could use IMS? Do you
>> have some driver code already? We could then see if and how the current IMS
>> code could be made more generic.
> 
> We have several devices of interest, our NICs have very flexible PCI,
> so it is no problem to take the MemWR addr/data from someplace other
> than the MSI tables.
> 
> For this we want to have some way to allocate Linux IRQs dynamically
> and get a addr/data pair to trigger them.
> 
> Our NIC devices are also linked to our ARM SOC family, so I'd expect
> our ARM's to also be able to provide these APIs as the platform.

cool, so I will hope that you can test out the generic APIs from the ARM 
side!
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-01 22:31         ` Dey, Megha
@ 2020-05-03 22:21           ` Jason Gunthorpe
  2020-05-03 22:32             ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-03 22:21 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Dan Williams, Dave Jiang, Vinod Koul, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, Baolu Lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list

On Fri, May 01, 2020 at 03:31:14PM -0700, Dey, Megha wrote:
> Hi Jason,
> 
> On 4/23/2020 12:49 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote:
> > 
> > > Per Megha's follow-up can you send the details about that other device
> > > and help clear a path for a device-specific MSI addr/data table
> > > format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
> > > to claims about future upstream users. The fact that you have an
> > > additional use case is golden for pushing this into a common area and
> > > validating the scope of the proposed API.
> > 
> > I think I said it at plumbers, but yes, we are interested in this, and
> > would like dynamic MSI-like interrupts available to the driver (what
> > Intel calls IMS)
> > 
> 
> So basically you are looking for a way to dynamically allocate the
> platform-msi interrupts, correct?

The basic high level interface here seems fine, which is bascially a
way for a driver to grab a bunch of platform-msi interrupts for its
own use
 
> Since I don't have access to any of the platform-msi devices, it is hard for
> me to test this code for other drivers expect idxd for now.
> Once I submit the next round of patches, after addressing all the comments,
> would it be possible for you to test this code for any of your devices?

Possibly, need to find time

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-01 22:31       ` Dey, Megha
@ 2020-05-03 22:22         ` Jason Gunthorpe
  2020-05-03 22:31           ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-03 22:22 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Dan Williams, Dave Jiang, Vinod Koul, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, baolu.lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list

On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote:
> > > This has been my concern reviewing the implementation. IMS needs more
> > > than one in-tree user to validate degrees of freedom in the api. I had
> > > been missing a second "in-tree user" to validate the scope of the
> > > flexibility that was needed.
> > 
> > IMS is too narrowly specified.
> > 
> > All platforms that support MSI today can support IMS. It is simply a
> > way for the platform to give the driver an addr/data pair that triggers
> > an interrupt when a posted write is performed to that pair.
> > 
> 
> Well, yes and no. IMS requires interrupt remapping in addition to the
> dynamic nature of IRQ allocation.

You've mentioned remapping a few times, but I really can't understand
why it has anything to do with platform_msi or IMS..

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-05-01 22:30     ` Dey, Megha
@ 2020-05-03 22:25       ` Jason Gunthorpe
  2020-05-03 22:40         ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-03 22:25 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Dave Jiang, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

On Fri, May 01, 2020 at 03:30:02PM -0700, Dey, Megha wrote:
> Hi Jason,
> 
> On 4/23/2020 1:11 PM, Jason Gunthorpe wrote:
> > On Tue, Apr 21, 2020 at 04:34:11PM -0700, Dave Jiang wrote:
> > > diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
> > > new file mode 100644
> > > index 000000000000..738f6d153155
> > > +++ b/drivers/base/ims-msi.c
> > > @@ -0,0 +1,100 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Support for Device Specific IMS interrupts.
> > > + *
> > > + * Copyright © 2019 Intel Corporation.
> > > + *
> > > + * Author: Megha Dey <megha.dey@intel.com>
> > > + */
> > > +
> > > +#include <linux/dmar.h>
> > > +#include <linux/irq.h>
> > > +#include <linux/mdev.h>
> > > +#include <linux/pci.h>
> > > +
> > > +/*
> > > + * Determine if a dev is mdev or not. Return NULL if not mdev device.
> > > + * Return mdev's parent dev if success.
> > > + */
> > > +static inline struct device *mdev_to_parent(struct device *dev)
> > > +{
> > > +	struct device *ret = NULL;
> > > +	struct device *(*fn)(struct device *dev);
> > > +	struct bus_type *bus = symbol_get(mdev_bus_type);
> > > +
> > > +	if (bus && dev->bus == bus) {
> > > +		fn = symbol_get(mdev_dev_to_parent_dev);
> > > +		ret = fn(dev);
> > > +		symbol_put(mdev_dev_to_parent_dev);
> > > +		symbol_put(mdev_bus_type);
> > 
> > No, things like this are not OK in the drivers/base
> > 
> > Whatever this is doing needs to be properly architected in some
> > generic way.
> 
> Basically what I am trying to do here is to determine if the device is an
> mdev device or not.

Why? mdev devices are virtual they don't have HW elements.

The caller should use the concrete pci_device to allocate
platform_msi? What is preventing this?

> > > +struct irq_domain *arch_create_ims_irq_domain(struct irq_domain *parent,
> > > +					      const char *name)
> > > +{
> > > +	struct fwnode_handle *fn;
> > > +	struct irq_domain *domain;
> > > +
> > > +	fn = irq_domain_alloc_named_fwnode(name);
> > > +	if (!fn)
> > > +		return NULL;
> > > +
> > > +	domain = msi_create_irq_domain(fn, &ims_ir_domain_info, parent);
> > > +	if (!domain)
> > > +		return NULL;
> > > +
> > > +	irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
> > > +	irq_domain_free_fwnode(fn);
> > > +
> > > +	return domain;
> > > +}
> > 
> > I'm still not really clear why all this is called IMS.. This looks
> > like the normal boilerplate to setup an IRQ domain? What is actually
> > 'ims' in here?
> 
> It is just a way to create a new domain specifically for IMS interrupts.
> Although, since there is a platform_msi_create_irq_domain already, which
> does something similar, I will use the same for IMS as well.

But this is all code already intended to be used by the platform, why
is it in drivers/base?

> Also, since there is quite a stir over the name 'IMS' do you have any
> suggestion for a more generic name for this?

It seems we have a name, this is called platform_msi in Linux?

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 07/15] Documentation: Interrupt Message store
  2020-05-01 22:32     ` Dey, Megha
@ 2020-05-03 22:28       ` Jason Gunthorpe
  2020-05-03 22:41         ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-03 22:28 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Dave Jiang, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

On Fri, May 01, 2020 at 03:32:22PM -0700, Dey, Megha wrote:
> Hi Jason,
> 
> On 4/23/2020 1:04 PM, Jason Gunthorpe wrote:
> > On Tue, Apr 21, 2020 at 04:34:30PM -0700, Dave Jiang wrote:
> > 
> > > diff --git a/Documentation/ims-howto.rst b/Documentation/ims-howto.rst
> > > new file mode 100644
> > > index 000000000000..a18de152b393
> > > +++ b/Documentation/ims-howto.rst
> > > @@ -0,0 +1,210 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +.. include:: <isonum.txt>
> > > +
> > > +==========================
> > > +The IMS Driver Guide HOWTO
> > > +==========================
> > > +
> > > +:Authors: Megha Dey
> > > +
> > > +:Copyright: 2020 Intel Corporation
> > > +
> > > +About this guide
> > > +================
> > > +
> > > +This guide describes the basics of Interrupt Message Store (IMS), the
> > > +need to introduce a new interrupt mechanism, implementation details of
> > > +IMS in the kernel, driver changes required to support IMS and the general
> > > +misconceptions and FAQs associated with IMS.
> > 
> > I'm not sure why we need to call this IMS in kernel documentat? I know
> > Intel is using this term, but this document is really only talking
> > about extending the existing platform_msi stuff, which looks pretty
> > good actually.
> 
> hmmm, so maybe we call it something else or just say dynamic platform-msi?
> 
> > 
> > A lot of this is good for the cover letter..
> 
> Well, I got a lot of comments internally and externally about how the cover
> page needs to have just the basics and all the ugly details can go in the
> Documentation. So well, I am confused here.

Documentation should be documentation for users and developers.

Justification and rational for why functionality should be merged
belong in the commit message and cover letter, IMHO.

Here too much time is spent belabouring IMS's rational and not enough
is spent explaining how a driver should consume it or how a platform
should provide it.

And since most of this tightly related to platform-msi it might make
sense to start by documenting platform msi then adding a diff on that
to explain what change is being made to accommodate IMS.

Most likely few people are very familiar with platform-msi in the
first place..

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-03 22:22         ` Jason Gunthorpe
@ 2020-05-03 22:31           ` Dey, Megha
  2020-05-03 22:36             ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Dey, Megha @ 2020-05-03 22:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Dave Jiang, Vinod Koul, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, baolu.lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list


Hi Jason,

On 5/3/2020 3:22 PM, Jason Gunthorpe wrote:
> On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote:
>>>> This has been my concern reviewing the implementation. IMS needs more
>>>> than one in-tree user to validate degrees of freedom in the api. I had
>>>> been missing a second "in-tree user" to validate the scope of the
>>>> flexibility that was needed.
>>>
>>> IMS is too narrowly specified.
>>>
>>> All platforms that support MSI today can support IMS. It is simply a
>>> way for the platform to give the driver an addr/data pair that triggers
>>> an interrupt when a posted write is performed to that pair.
>>>
>>
>> Well, yes and no. IMS requires interrupt remapping in addition to the
>> dynamic nature of IRQ allocation.
> 
> You've mentioned remapping a few times, but I really can't understand
> why it has anything to do with platform_msi or IMS..

So after some internal discussions, we have concluded that IMS has no 
linkage with Interrupt remapping, IR is just a platform concept. IMS is 
just a name Intel came up with, all it really means is device managed 
addr/data writes to generate interrupts. Technically we can call 
something IMS even if device has its own location to store interrupts in 
non-pci standard mechanism, much like platform-msi indeed. We simply 
need to extend platform-msi to its address some of its shortcomings: 
increase number of interrupts to > 2048, enable dynamic allocation of 
interrupts, add mask/unmask callbacks in addition to write_msg etc.
FWIW, even MSI can be IMS with rules on how to manage the addr/data 
writes following pci sig .. its just that.

I will be sending out an email shortly outlining the new design for IMS 
(A.K.A platform-msi part 2) and what are the improvements we want to add 
to the already existing platform-msi infrastructure.

Thank you so much for your comments, it helped us iron out some of these 
details :)

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-03 22:21           ` Jason Gunthorpe
@ 2020-05-03 22:32             ` Dey, Megha
  0 siblings, 0 replies; 89+ messages in thread
From: Dey, Megha @ 2020-05-03 22:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Dave Jiang, Vinod Koul, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, Baolu Lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list

Hi Jason,

On 5/3/2020 3:21 PM, Jason Gunthorpe wrote:
> On Fri, May 01, 2020 at 03:31:14PM -0700, Dey, Megha wrote:
>> Hi Jason,
>>
>> On 4/23/2020 12:49 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote:
>>>
>>>> Per Megha's follow-up can you send the details about that other device
>>>> and help clear a path for a device-specific MSI addr/data table
>>>> format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
>>>> to claims about future upstream users. The fact that you have an
>>>> additional use case is golden for pushing this into a common area and
>>>> validating the scope of the proposed API.
>>>
>>> I think I said it at plumbers, but yes, we are interested in this, and
>>> would like dynamic MSI-like interrupts available to the driver (what
>>> Intel calls IMS)
>>>
>>
>> So basically you are looking for a way to dynamically allocate the
>> platform-msi interrupts, correct?
> 
> The basic high level interface here seems fine, which is bascially a
> way for a driver to grab a bunch of platform-msi interrupts for its
> own use

ok!
>   
>> Since I don't have access to any of the platform-msi devices, it is hard for
>> me to test this code for other drivers expect idxd for now.
>> Once I submit the next round of patches, after addressing all the comments,
>> would it be possible for you to test this code for any of your devices?
> 
> Possibly, need to find time
> 
Sure, thanks!

> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-03 22:31           ` Dey, Megha
@ 2020-05-03 22:36             ` Jason Gunthorpe
  2020-05-04  0:20               ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-03 22:36 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Dan Williams, Dave Jiang, Vinod Koul, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, baolu.lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list

On Sun, May 03, 2020 at 03:31:39PM -0700, Dey, Megha wrote:
> 
> Hi Jason,
> 
> On 5/3/2020 3:22 PM, Jason Gunthorpe wrote:
> > On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote:
> > > > > This has been my concern reviewing the implementation. IMS needs more
> > > > > than one in-tree user to validate degrees of freedom in the api. I had
> > > > > been missing a second "in-tree user" to validate the scope of the
> > > > > flexibility that was needed.
> > > > 
> > > > IMS is too narrowly specified.
> > > > 
> > > > All platforms that support MSI today can support IMS. It is simply a
> > > > way for the platform to give the driver an addr/data pair that triggers
> > > > an interrupt when a posted write is performed to that pair.
> > > > 
> > > 
> > > Well, yes and no. IMS requires interrupt remapping in addition to the
> > > dynamic nature of IRQ allocation.
> > 
> > You've mentioned remapping a few times, but I really can't understand
> > why it has anything to do with platform_msi or IMS..
> 
> So after some internal discussions, we have concluded that IMS has no
> linkage with Interrupt remapping, IR is just a platform concept. IMS is just
> a name Intel came up with, all it really means is device managed addr/data
> writes to generate interrupts. Technically we can call something IMS even if
> device has its own location to store interrupts in non-pci standard
> mechanism, much like platform-msi indeed. We simply need to extend
> platform-msi to its address some of its shortcomings: increase number of
> interrupts to > 2048, enable dynamic allocation of interrupts, add
> mask/unmask callbacks in addition to write_msg etc.

Sounds right to me

Persumably you still need a way for the driver, eg vfio, to ensure a
MSI is remappable, but shouldn't that be exactly the same way as done
in normal PCI MSI today?

> FWIW, even MSI can be IMS with rules on how to manage the addr/data writes
> following pci sig .. its just that.

Yep, IMHO, our whole handling of MSI is very un-general sometimes..

I thought the msi_domain stuff that some platforms are using is a way
to improve on that? You might find that updating x86 to use msi_domain
might be helpful in this project???

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-05-03 22:25       ` Jason Gunthorpe
@ 2020-05-03 22:40         ` Dey, Megha
  2020-05-03 22:46           ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Dey, Megha @ 2020-05-03 22:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

Hi Jason,

On 5/3/2020 3:25 PM, Jason Gunthorpe wrote:
> On Fri, May 01, 2020 at 03:30:02PM -0700, Dey, Megha wrote:
>> Hi Jason,
>>
>> On 4/23/2020 1:11 PM, Jason Gunthorpe wrote:
>>> On Tue, Apr 21, 2020 at 04:34:11PM -0700, Dave Jiang wrote:
>>>> diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
>>>> new file mode 100644
>>>> index 000000000000..738f6d153155
>>>> +++ b/drivers/base/ims-msi.c
>>>> @@ -0,0 +1,100 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +/*
>>>> + * Support for Device Specific IMS interrupts.
>>>> + *
>>>> + * Copyright © 2019 Intel Corporation.
>>>> + *
>>>> + * Author: Megha Dey <megha.dey@intel.com>
>>>> + */
>>>> +
>>>> +#include <linux/dmar.h>
>>>> +#include <linux/irq.h>
>>>> +#include <linux/mdev.h>
>>>> +#include <linux/pci.h>
>>>> +
>>>> +/*
>>>> + * Determine if a dev is mdev or not. Return NULL if not mdev device.
>>>> + * Return mdev's parent dev if success.
>>>> + */
>>>> +static inline struct device *mdev_to_parent(struct device *dev)
>>>> +{
>>>> +	struct device *ret = NULL;
>>>> +	struct device *(*fn)(struct device *dev);
>>>> +	struct bus_type *bus = symbol_get(mdev_bus_type);
>>>> +
>>>> +	if (bus && dev->bus == bus) {
>>>> +		fn = symbol_get(mdev_dev_to_parent_dev);
>>>> +		ret = fn(dev);
>>>> +		symbol_put(mdev_dev_to_parent_dev);
>>>> +		symbol_put(mdev_bus_type);
>>>
>>> No, things like this are not OK in the drivers/base
>>>
>>> Whatever this is doing needs to be properly architected in some
>>> generic way.
>>
>> Basically what I am trying to do here is to determine if the device is an
>> mdev device or not.
> 
> Why? mdev devices are virtual they don't have HW elements.

Hmm yeah exactly, since they are virtual, they do not have an associated 
IRQ domain right? So they use the irq domain of the parent device..

> 
> The caller should use the concrete pci_device to allocate
> platform_msi? What is preventing this?

hmmm do you mean to say all platform-msi adhere to the rules of a PCI 
device? The use case if when we have a device assigned to a guest and we 
want to allocate IMS(platform-msi) interrupts for that guest-assigned 
device. Currently, this is abstracted through a mdev interface.

> 
>>>> +struct irq_domain *arch_create_ims_irq_domain(struct irq_domain *parent,
>>>> +					      const char *name)
>>>> +{
>>>> +	struct fwnode_handle *fn;
>>>> +	struct irq_domain *domain;
>>>> +
>>>> +	fn = irq_domain_alloc_named_fwnode(name);
>>>> +	if (!fn)
>>>> +		return NULL;
>>>> +
>>>> +	domain = msi_create_irq_domain(fn, &ims_ir_domain_info, parent);
>>>> +	if (!domain)
>>>> +		return NULL;
>>>> +
>>>> +	irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
>>>> +	irq_domain_free_fwnode(fn);
>>>> +
>>>> +	return domain;
>>>> +}
>>>
>>> I'm still not really clear why all this is called IMS.. This looks
>>> like the normal boilerplate to setup an IRQ domain? What is actually
>>> 'ims' in here?
>>
>> It is just a way to create a new domain specifically for IMS interrupts.
>> Although, since there is a platform_msi_create_irq_domain already, which
>> does something similar, I will use the same for IMS as well.
> 
> But this is all code already intended to be used by the platform, why
> is it in drivers/base?

yeah this code will not exist in the next version anyways..
> 
>> Also, since there is quite a stir over the name 'IMS' do you have any
>> suggestion for a more generic name for this?
> 
> It seems we have a name, this is called platform_msi in Linux?

yeah, ultimately IMS boils down to "Extended platform_msi" looks like ..
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 07/15] Documentation: Interrupt Message store
  2020-05-03 22:28       ` Jason Gunthorpe
@ 2020-05-03 22:41         ` Dey, Megha
  0 siblings, 0 replies; 89+ messages in thread
From: Dey, Megha @ 2020-05-03 22:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm



On 5/3/2020 3:28 PM, Jason Gunthorpe wrote:
> On Fri, May 01, 2020 at 03:32:22PM -0700, Dey, Megha wrote:
>> Hi Jason,
>>
>> On 4/23/2020 1:04 PM, Jason Gunthorpe wrote:
>>> On Tue, Apr 21, 2020 at 04:34:30PM -0700, Dave Jiang wrote:
>>>
>>>> diff --git a/Documentation/ims-howto.rst b/Documentation/ims-howto.rst
>>>> new file mode 100644
>>>> index 000000000000..a18de152b393
>>>> +++ b/Documentation/ims-howto.rst
>>>> @@ -0,0 +1,210 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0
>>>> +.. include:: <isonum.txt>
>>>> +
>>>> +==========================
>>>> +The IMS Driver Guide HOWTO
>>>> +==========================
>>>> +
>>>> +:Authors: Megha Dey
>>>> +
>>>> +:Copyright: 2020 Intel Corporation
>>>> +
>>>> +About this guide
>>>> +================
>>>> +
>>>> +This guide describes the basics of Interrupt Message Store (IMS), the
>>>> +need to introduce a new interrupt mechanism, implementation details of
>>>> +IMS in the kernel, driver changes required to support IMS and the general
>>>> +misconceptions and FAQs associated with IMS.
>>>
>>> I'm not sure why we need to call this IMS in kernel documentat? I know
>>> Intel is using this term, but this document is really only talking
>>> about extending the existing platform_msi stuff, which looks pretty
>>> good actually.
>>
>> hmmm, so maybe we call it something else or just say dynamic platform-msi?
>>
>>>
>>> A lot of this is good for the cover letter..
>>
>> Well, I got a lot of comments internally and externally about how the cover
>> page needs to have just the basics and all the ugly details can go in the
>> Documentation. So well, I am confused here.
> 
> Documentation should be documentation for users and developers.
> 
> Justification and rational for why functionality should be merged
> belong in the commit message and cover letter, IMHO.
> 
> Here too much time is spent belabouring IMS's rational and not enough
> is spent explaining how a driver should consume it or how a platform
> should provide it.
> 
> And since most of this tightly related to platform-msi it might make
> sense to start by documenting platform msi then adding a diff on that
> to explain what change is being made to accommodate IMS.
> 
> Most likely few people are very familiar with platform-msi in the
> first place..

Ok makes sense, will rework this in the next version..
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-05-03 22:40         ` Dey, Megha
@ 2020-05-03 22:46           ` Jason Gunthorpe
  2020-05-04  0:25             ` Dey, Megha
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-03 22:46 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Dave Jiang, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

On Sun, May 03, 2020 at 03:40:44PM -0700, Dey, Megha wrote:
> On 5/3/2020 3:25 PM, Jason Gunthorpe wrote:
> > On Fri, May 01, 2020 at 03:30:02PM -0700, Dey, Megha wrote:
> > > Hi Jason,
> > > 
> > > On 4/23/2020 1:11 PM, Jason Gunthorpe wrote:
> > > > On Tue, Apr 21, 2020 at 04:34:11PM -0700, Dave Jiang wrote:
> > > > > diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
> > > > > new file mode 100644
> > > > > index 000000000000..738f6d153155
> > > > > +++ b/drivers/base/ims-msi.c
> > > > > @@ -0,0 +1,100 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > > +/*
> > > > > + * Support for Device Specific IMS interrupts.
> > > > > + *
> > > > > + * Copyright © 2019 Intel Corporation.
> > > > > + *
> > > > > + * Author: Megha Dey <megha.dey@intel.com>
> > > > > + */
> > > > > +
> > > > > +#include <linux/dmar.h>
> > > > > +#include <linux/irq.h>
> > > > > +#include <linux/mdev.h>
> > > > > +#include <linux/pci.h>
> > > > > +
> > > > > +/*
> > > > > + * Determine if a dev is mdev or not. Return NULL if not mdev device.
> > > > > + * Return mdev's parent dev if success.
> > > > > + */
> > > > > +static inline struct device *mdev_to_parent(struct device *dev)
> > > > > +{
> > > > > +	struct device *ret = NULL;
> > > > > +	struct device *(*fn)(struct device *dev);
> > > > > +	struct bus_type *bus = symbol_get(mdev_bus_type);
> > > > > +
> > > > > +	if (bus && dev->bus == bus) {
> > > > > +		fn = symbol_get(mdev_dev_to_parent_dev);
> > > > > +		ret = fn(dev);
> > > > > +		symbol_put(mdev_dev_to_parent_dev);
> > > > > +		symbol_put(mdev_bus_type);
> > > > 
> > > > No, things like this are not OK in the drivers/base
> > > > 
> > > > Whatever this is doing needs to be properly architected in some
> > > > generic way.
> > > 
> > > Basically what I am trying to do here is to determine if the device is an
> > > mdev device or not.
> > 
> > Why? mdev devices are virtual they don't have HW elements.
> 
> Hmm yeah exactly, since they are virtual, they do not have an associated IRQ
> domain right? So they use the irq domain of the parent device..
> 
> > 
> > The caller should use the concrete pci_device to allocate
> > platform_msi? What is preventing this?
> 
> hmmm do you mean to say all platform-msi adhere to the rules of a PCI
> device? 

I mean where a platform-msi can work should be defined by the arch,
and probably is related to things like having an irq_domain attached

So, like pci, drivers must only try to do platfor_msi stuff on
particular devices. eg on pci_device and platform_device types.

Even so it may not even work, but I can't think of any reason why it
should be made to work on a virtual device like mdev.

> The use case if when we have a device assigned to a guest and we
> want to allocate IMS(platform-msi) interrupts for that
> guest-assigned device. Currently, this is abstracted through a mdev
> interface.

And the mdev has the pci_device internally, so it should simply pass
that pci_device to the platform_msi machinery.

This is no different from something like pci_iomap() which must be
used with the pci_device.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 02/15] drivers/base: Introduce a new platform-msi list
  2020-04-25 21:13   ` Thomas Gleixner
@ 2020-05-04  0:08     ` Dey, Megha
  0 siblings, 0 replies; 89+ messages in thread
From: Dey, Megha @ 2020-05-04  0:08 UTC (permalink / raw)
  To: Thomas Gleixner, Dave Jiang, vkoul, maz, bhelgaas, rafael,
	gregkh, hpa, alex.williamson, jacob.jun.pan, ashok.raj, jgg,
	yi.l.liu, baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck,
	jing.lin, dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Hi Thomas,

On 4/25/2020 2:13 PM, Thomas Gleixner wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
> 
>> From: Megha Dey <megha.dey@linux.intel.com>
>>
>> This is a preparatory patch to introduce Interrupt Message Store (IMS).
>>
>> The struct device has a linked list ('msi_list') of the MSI (msi/msi-x,
>> platform-msi) descriptors of that device. This list holds only 1 type
>> of descriptor since it is not possible for a device to support more
>> than one of these descriptors concurrently.
>>
>> However, with the introduction of IMS, a device can support IMS as well
>> as MSI-X at the same time. Instead of sharing this list between IMS (a
>> type of platform-msi) and MSI-X descriptors, introduce a new linked list,
>> platform_msi_list, which will hold all the platform-msi descriptors.
>>
>> Thus, msi_list will point to the MSI/MSIX descriptors of a device, while
>> platform_msi_list will point to the platform-msi descriptors of a
>> device.
> 
> Will point?
> 

I meant to say msi_list will be the list head for the MSI/MSI-X 
descriptors whereas platform_msi_list will be the list head for all the 
platform-msi descriptors.

> You're failing to explain that this actually converts the existing
> platform code over to this new list. This also lacks an explanation why
> this is not a functional change.

Hmm yeah makes sense. I will add these details in the next version.

> 
>> Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
> 
> Lacks an SOB from you....

Yeah, will be added in the next version.

> 
>> diff --git a/drivers/base/core.c b/drivers/base/core.c
>> index 139cdf7e7327..5a0116d1a8d0 100644
>> --- a/drivers/base/core.c
>> +++ b/drivers/base/core.c
>> @@ -1984,6 +1984,7 @@ void device_initialize(struct device *dev)
>>   	set_dev_node(dev, -1);
>>   #ifdef CONFIG_GENERIC_MSI_IRQ
>>   	INIT_LIST_HEAD(&dev->msi_list);
>> +	INIT_LIST_HEAD(&dev->platform_msi_list);
> 
>> --- a/drivers/base/platform-msi.c
>> +++ b/drivers/base/platform-msi.c
>> @@ -110,7 +110,8 @@ static void platform_msi_free_descs(struct device *dev, int base, int nvec)
>>   {
>>   	struct msi_desc *desc, *tmp;
>>   
>> -	list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
>> +	list_for_each_entry_safe(desc, tmp, dev_to_platform_msi_list(dev),
>> +				 list) {
>>   		if (desc->platform.msi_index >= base &&
>>   		    desc->platform.msi_index < (base + nvec)) {
>>   			list_del(&desc->list);
>>   	datap = kzalloc(sizeof(*datap), GFP_KERNEL);
>> @@ -255,6 +256,8 @@ int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
>>   	struct platform_msi_priv_data *priv_data;
>>   	int err;
>>   
>> +	dev->platform_msi_type = GEN_PLAT_MSI;
> 
> What the heck is GEN_PLAT_MSI? Can you please use
> 
>     1) A proper name space starting with PLATFORM_MSI_ or such
> 
>     2) A proper suffix which is self explaining.
> 
> instead of coming up with nonsensical garbage which even lacks any
> explanation at the place where it is defined.

So basically, I wanted to differentiate between the existing 
platform-msi interrupts(GEN_PLAT_MSI) and the IMS interrupts.

But sure, I will try to come up with a more sensible name , 
PLATFORM_MSI_STATIC/DYNAMIC perhaps?

> 
>> diff --git a/include/linux/device.h b/include/linux/device.h
>> index ac8e37cd716a..cbcecb14584e 100644
>> --- a/include/linux/device.h
>> +++ b/include/linux/device.h
>> @@ -567,6 +567,8 @@ struct device {
>>   #endif
>>   #ifdef CONFIG_GENERIC_MSI_IRQ
>>   	struct list_head	msi_list;
>> +	struct list_head	platform_msi_list;
>> +	unsigned int		platform_msi_type;
> 
> You use an enum for the types so why are you not using an enum for the
> struct member which stores it?

Ok, will change this in the next version.

> 
>>   
>> +/**
>> + * list_entry_select - get the correct struct for this entry based on condition
>> + * @condition:	the condition to choose a particular &struct list head pointer
>> + * @ptr_a:      the &struct list_head pointer if @condition is not met.
>> + * @ptr_b:      the &struct list_head pointer if @condition is met.
>> + * @type:       the type of the struct this is embedded in.
>> + * @member:     the name of the list_head within the struct.
>> + */
>> +#define list_entry_select(condition, ptr_a, ptr_b, type, member)\
>> +	(condition) ? list_entry(ptr_a, type, member) :		\
>> +		      list_entry(ptr_b, type, member)
> 
> This is related to $Subject in which way? It's not a entirely new
> process rule that infrastructure changes which touch a completely
> different subsystem have to be separate and explained and justified on
> their own.

True, this should be an independent change, I will add it as a separate 
patch next time.

> 
>>   
>> +enum platform_msi_type {
>> +	NOT_PLAT_MSI = 0,
> 
> NOT_PLAT_MSI? Not used anywhere and of course equally self explaining as
> the other one.

Ya, this seems unnecessary, will remove it.

> 
>> +	GEN_PLAT_MSI = 1,
>> +};
>> +
>>   /* Helpers to hide struct msi_desc implementation details */
>>   #define msi_desc_to_dev(desc)		((desc)->dev)
>>   #define dev_to_msi_list(dev)		(&(dev)->msi_list)
>> @@ -140,6 +145,22 @@ struct msi_desc {
>>   #define for_each_msi_entry_safe(desc, tmp, dev)	\
>>   	list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
>>   
>> +#define dev_to_platform_msi_list(dev)	(&(dev)->platform_msi_list)
>> +#define first_platform_msi_entry(dev)		\
>> +	list_first_entry(dev_to_platform_msi_list((dev)), struct msi_desc, list)
>> +#define for_each_platform_msi_entry(desc, dev)	\
>> +	list_for_each_entry((desc), dev_to_platform_msi_list((dev)), list)
>> +#define for_each_platform_msi_entry_safe(desc, tmp, dev)	\
>> +	list_for_each_entry_safe((desc), (tmp), dev_to_platform_msi_list((dev)), list)
> 
> New lines to seperate macros are bad for readability, right?

Sigh, I was trying to follow the same spacing scheme as is for the msi 
list above. Will make it readable next time around.

> 
>> +#define first_msi_entry_common(dev)	\
>> +	list_first_entry_select((dev)->platform_msi_type, dev_to_platform_msi_list((dev)),	\
>> +				dev_to_msi_list((dev)), struct msi_desc, list)
>> +
>> +#define for_each_msi_entry_common(desc, dev)	\
>> +	list_for_each_entry_select((dev)->platform_msi_type, desc, dev_to_platform_msi_list((dev)), \
>> +				   dev_to_msi_list((dev)), list)	\
>> +
>>   #ifdef CONFIG_IRQ_MSI_IOMMU
>>   static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
>>   {
>> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
>> index eb95f6106a1e..bc5f9e32387f 100644
>> --- a/kernel/irq/msi.c
>> +++ b/kernel/irq/msi.c
>> @@ -320,7 +320,7 @@ int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
>>   	struct msi_desc *desc;
>>   	int ret = 0;
>>   
>> -	for_each_msi_entry(desc, dev) {
>> +	for_each_msi_entry_common(desc, dev) {
> 
> This is absolutely unreadable. What's common here? You hide the decision
> which list to iterate behind a misnomed macro.

Hmm, so this macro is basically to be be used by the common code(kernel 
IRQ subsystem for instance) to know which list needs to be traversed, 
msi_list or platform_msi_list of a device.

Finding suitable names for macros is clearly my Achilles heel.

> 
> And looking at the implementation:
> 
>> +#define for_each_msi_entry_common(desc, dev)	\
>> +	list_for_each_entry_select((dev)->platform_msi_type, desc, dev_to_platform_msi_list((dev)), \
>> +				   dev_to_msi_list((dev)), list)	\
> 
> So you implicitely make the decision based on:
> 
>     (dev)->platform_msi_type != 0
> 
> What? How is that ever supposed to work? The changelog says:
> 
>> However, with the introduction of IMS, a device can support IMS as well
>> as MSI-X at the same time. Instead of sharing this list between IMS (a
>> type of platform-msi) and MSI-X descriptors, introduce a new linked list,
>> platform_msi_list, which will hold all the platform-msi descriptors.
> 
> So you are not serious about storing the decision in the device struct
> and then calling into common code?
> 
> That's insane at best. There is absolutely ZERO explanation how this is
> supposed to work and why this could even be remotely correct and safe.
> 

You are right. I think this code would have problems if there is 
concurrent access of the struct device. I probably need to impose some 
kind of locking mechanism here if a device supports both MSI-X and 
platform msi.

> Ever heard of the existance of function arguments?
> 
> Sorry, this is just voodoo programming and not going anywhere.

hmm, will try to ensure sane programming in the next attempt.
> 
> Thanks,
> 
>          tglx
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 03/15] drivers/base: Allocate/free platform-msi interrupts by group
  2020-04-25 21:23   ` Thomas Gleixner
@ 2020-05-04  0:08     ` Dey, Megha
  0 siblings, 0 replies; 89+ messages in thread
From: Dey, Megha @ 2020-05-04  0:08 UTC (permalink / raw)
  To: Thomas Gleixner, Dave Jiang, vkoul, maz, bhelgaas, rafael,
	gregkh, hpa, alex.williamson, jacob.jun.pan, ashok.raj, jgg,
	yi.l.liu, baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck,
	jing.lin, dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Hi Thomas,

On 4/25/2020 2:23 PM, Thomas Gleixner wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
>> From: Megha Dey <megha.dey@linux.intel.com>
>> --- a/include/linux/msi.h
>> +++ b/include/linux/msi.h
>> @@ -135,6 +135,12 @@ enum platform_msi_type {
>>   	GEN_PLAT_MSI = 1,
>>   };
>>   
>> +struct platform_msi_group_entry {
>> +	unsigned int group_id;
>> +	struct list_head group_list;
>> +	struct list_head entry_list;
> 
> I surely told you before that struct members want to be written tabular.

yep, you surely did :) I will use tabs henceforth!
> 
>> +};
>> +
>>   /* Helpers to hide struct msi_desc implementation details */
>>   #define msi_desc_to_dev(desc)		((desc)->dev)
>>   #define dev_to_msi_list(dev)		(&(dev)->msi_list)
>> @@ -145,21 +151,31 @@ enum platform_msi_type {
>>   #define for_each_msi_entry_safe(desc, tmp, dev)	\
>>   	list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
>>   
>> -#define dev_to_platform_msi_list(dev)	(&(dev)->platform_msi_list)
>> -#define first_platform_msi_entry(dev)		\
>> -	list_first_entry(dev_to_platform_msi_list((dev)), struct msi_desc, list)
>> -#define for_each_platform_msi_entry(desc, dev)	\
>> -	list_for_each_entry((desc), dev_to_platform_msi_list((dev)), list)
>> -#define for_each_platform_msi_entry_safe(desc, tmp, dev)	\
>> -	list_for_each_entry_safe((desc), (tmp), dev_to_platform_msi_list((dev)), list)
>> +#define dev_to_platform_msi_group_list(dev)    (&(dev)->platform_msi_list)
>> +
>> +#define first_platform_msi_group_entry(dev)				\
>> +	list_first_entry(dev_to_platform_msi_group_list((dev)),		\
>> +			 struct platform_msi_group_entry, group_list)
>>   
>> -#define first_msi_entry_common(dev)	\
>> -	list_first_entry_select((dev)->platform_msi_type, dev_to_platform_msi_list((dev)),	\
>> +#define platform_msi_current_group_entry_list(dev)			\
>> +	(&((list_last_entry(dev_to_platform_msi_group_list((dev)),	\
>> +			    struct platform_msi_group_entry,		\
>> +			    group_list))->entry_list))
>> +
>> +#define first_msi_entry_current_group(dev)				\
>> +	list_first_entry_select((dev)->platform_msi_type,		\
>> +				platform_msi_current_group_entry_list((dev)),	\
>>   				dev_to_msi_list((dev)), struct msi_desc, list)
>>   
>> -#define for_each_msi_entry_common(desc, dev)	\
>> -	list_for_each_entry_select((dev)->platform_msi_type, desc, dev_to_platform_msi_list((dev)), \
>> -				   dev_to_msi_list((dev)), list)	\
>> +#define for_each_msi_entry_current_group(desc, dev)			\
>> +	list_for_each_entry_select((dev)->platform_msi_type, desc,	\
>> +				   platform_msi_current_group_entry_list((dev)),\
>> +				   dev_to_msi_list((dev)), list)
>> +
>> +#define for_each_platform_msi_entry_in_group(desc, platform_msi_group, group, dev)	\
>> +	list_for_each_entry((platform_msi_group), dev_to_platform_msi_group_list((dev)), group_list)	\
>> +		if (((platform_msi_group)->group_id) == (group))			\
>> +			list_for_each_entry((desc), (&(platform_msi_group)->entry_list), list)
> 
> Yet more unreadable macro maze to obfuscate what the code is actually
> doing.

hmm I will i guess add some more documentation either in the commit 
message or somewhere in documentation to make it clearer about the 
purpose of these macros.

> 
>>   /* When an MSI domain is used as an intermediate domain */
>>   int msi_domain_prepare_irqs(struct irq_domain *domain, struct device *dev,
>> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
>> index bc5f9e32387f..899ade394ec8 100644
>> --- a/kernel/irq/msi.c
>> +++ b/kernel/irq/msi.c
>> @@ -320,7 +320,7 @@ int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
>>   	struct msi_desc *desc;
>>   	int ret = 0;
>>   
>> -	for_each_msi_entry_common(desc, dev) {
>> +	for_each_msi_entry_current_group(desc, dev) {
> 
> How is anyone supposed to figure out what the heck this means without
> going through several layers of macro maze and some magic type/group
> storage in struct device?
> 

Point noted. I think I am better off committing smaller logical changes 
in each patch.

> Again, function arguments exist for a reason.

ok makes sense, I will do this in the next version.

> 
> Thanks,
> 
>          tglx
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-04-25 21:38   ` Thomas Gleixner
@ 2020-05-04  0:11     ` Dey, Megha
  0 siblings, 0 replies; 89+ messages in thread
From: Dey, Megha @ 2020-05-04  0:11 UTC (permalink / raw)
  To: Thomas Gleixner, Dave Jiang, vkoul, maz, bhelgaas, rafael,
	gregkh, hpa, alex.williamson, jacob.jun.pan, ashok.raj, jgg,
	yi.l.liu, baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck,
	jing.lin, dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Hi Thomas,

On 4/25/2020 2:38 PM, Thomas Gleixner wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
>> From: Megha Dey <megha.dey@linux.intel.com>
>>
>> Add support for the creation of a new IMS irq domain. It creates a new
>> irq chip associated with the IMS domain and adds the necessary domain
>> operations to it.
> 
> And how is a X86 specific thingy related to drivers/base?

Well, clearly this file has both arch independent sand dependent code 
which is incorrect. From various discussions, we have now concluded that 
IMS is after all not a X86 specific thingy after all. IMS is just a name 
intel came up with, all it really means is device managed addr/data 
writes to generate interrupts.

> 
>> diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
> 
> This sits in drivers base because IMS is architecture independent, right?

Per my above comment, technically we can call something IMS even if 
device has its own location to store interrupts in non-pci standard 
mechanism, much like platform-msi indeed. We simply need to extend 
platform-msi to its address some of its shortcomings: increase number of 
interrupts to > 2048, enable dynamic allocation of interrupts, add 
mask/unmask callbacks in addition to write_msg etc.

I will be sending out an email shortly outlining the new design for IMS 
and what are the improvements we want to add to the already exisitng 
platform-msi infrastructure.

> 
>> new file mode 100644
>> index 000000000000..738f6d153155
>> --- /dev/null
>> +++ b/drivers/base/ims-msi.c
>> @@ -0,0 +1,100 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Support for Device Specific IMS interrupts.
>> + *
>> + * Copyright © 2019 Intel Corporation.
>> + *
>> + * Author: Megha Dey <megha.dey@intel.com>
>> + */
>> +
>> +#include <linux/dmar.h>
>> +#include <linux/irq.h>
>> +#include <linux/mdev.h>
>> +#include <linux/pci.h>
>> +
>> +/*
>> + * Determine if a dev is mdev or not. Return NULL if not mdev device.
>> + * Return mdev's parent dev if success.
>> + */
>> +static inline struct device *mdev_to_parent(struct device *dev)
>> +{
>> +	struct device *ret = NULL;
>> +	struct device *(*fn)(struct device *dev);
>> +	struct bus_type *bus = symbol_get(mdev_bus_type);
> 
> symbol_get()?

mdev_bus_type is defined in driver/vfio/mdev/ directory. The entire 
vfio-mdev can be compiled as a module and if so, then this symbol is not 
visible outside of that directory and there are some linker errors. 
Currently, there these symbols sare self-contained and are not used 
outside of the directory where they are defined. I did not know earlier 
that is not advisible to use symbol_get() for this. I will try to come 
up with a better approach.

> 
>> +
>> +	if (bus && dev->bus == bus) {
>> +		fn = symbol_get(mdev_dev_to_parent_dev);
> 
> What's wrong with simple function calls?

Hmmm, same reason as above..
> 
>> +		ret = fn(dev);
>> +		symbol_put(mdev_dev_to_parent_dev);
>> +		symbol_put(mdev_bus_type);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static irq_hw_number_t dev_ims_get_hwirq(struct msi_domain_info *info,
>> +					 msi_alloc_info_t *arg)
>> +{
>> +	return arg->ims_hwirq;
>> +}
>> +
>> +static int dev_ims_prepare(struct irq_domain *domain, struct device *dev,
>> +			   int nvec, msi_alloc_info_t *arg)
>> +{
>> +	if (dev_is_mdev(dev))
>> +		dev = mdev_to_parent(dev);
> 
> This makes absolutely no sense. Somewhere you claimed that this is
> solely for mdev. Now this interface takes both a regular device and mdev.
> 
> Lack of explanation seems to be a common scheme here.

IMS can be used for mdev or a regular device. I do not think it is 
claimed anywhere that IMS is solely for mdev. In the current use case 
for DSA, IMS is used only by the guest (mdev) although it can very well 
be used by the host driver as well.

> 
>> +	init_irq_alloc_info(arg, NULL);
>> +	arg->dev = dev;
>> +	arg->type = X86_IRQ_ALLOC_TYPE_IMS;
>> +
>> +	return 0;
>> +}
>> +
>> +static void dev_ims_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
>> +{
>> +	arg->ims_hwirq = platform_msi_calc_hwirq(desc);
>> +}
>> +
>> +static struct msi_domain_ops dev_ims_domain_ops = {
>> +	.get_hwirq	= dev_ims_get_hwirq,
>> +	.msi_prepare	= dev_ims_prepare,
>> +	.set_desc	= dev_ims_set_desc,
>> +};
>> +
>> +static struct irq_chip dev_ims_ir_controller = {
>> +	.name			= "IR-DEV-IMS",
>> +	.irq_ack		= irq_chip_ack_parent,
>> +	.irq_retrigger		= irq_chip_retrigger_hierarchy,
>> +	.irq_set_vcpu_affinity	= irq_chip_set_vcpu_affinity_parent,
>> +	.flags			= IRQCHIP_SKIP_SET_WAKE,
>> +	.irq_write_msi_msg	= platform_msi_write_msg,
>> +};
>> +
>> +static struct msi_domain_info ims_ir_domain_info = {
>> +	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
>> +	.ops		= &dev_ims_domain_ops,
>> +	.chip		= &dev_ims_ir_controller,
>> +	.handler	= handle_edge_irq,
>> +	.handler_name	= "edge",
>> +};
>> +
>> +struct irq_domain *arch_create_ims_irq_domain(struct irq_domain *parent,
>> +					      const char *name)
> 
> arch_create_ ???? In drivers/base ???

Needs to go away. On second thought, per Jason Gunthorpe's comment, this 
is not even required. We can simply use the existing 
platform_msi_create_irq_domain API itself.
> 
>> +{
>> +	struct fwnode_handle *fn;
>> +	struct irq_domain *domain;
>> +
>> +	fn = irq_domain_alloc_named_fwnode(name);
>> +	if (!fn)
>> +		return NULL;
>> +
>> +	domain = msi_create_irq_domain(fn, &ims_ir_domain_info, parent);
>> +	if (!domain)
>> +		return NULL;
>> +
>> +	irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
>> +	irq_domain_free_fwnode(fn);
>> +
>> +	return domain;
>> +}
>> diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
>> index 2696aa75983b..59160e8cbfb1 100644
>> --- a/drivers/base/platform-msi.c
>> +++ b/drivers/base/platform-msi.c
>> @@ -31,12 +31,11 @@ struct platform_msi_priv_data {
>>   /* The devid allocator */
>>   static DEFINE_IDA(platform_msi_devid_ida);
>>   
>> -#ifdef GENERIC_MSI_DOMAIN_OPS
>>   /*
>>    * Convert an msi_desc to a globaly unique identifier (per-device
>>    * devid + msi_desc position in the msi_list).
>>    */
>> -static irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc)
>> +irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc)
>>   {
>>   	u32 devid;
>>   
>> @@ -45,6 +44,7 @@ static irq_hw_number_t platform_msi_calc_hwirq(struct msi_desc *desc)
>>   	return (devid << (32 - DEV_ID_SHIFT)) | desc->platform.msi_index;
>>   }
>>   
>> +#ifdef GENERIC_MSI_DOMAIN_OPS
>>   static void platform_msi_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
>>   {
>>   	arg->desc = desc;
>> @@ -76,7 +76,7 @@ static void platform_msi_update_dom_ops(struct msi_domain_info *info)
>>   		ops->set_desc = platform_msi_set_desc;
>>   }
>>   
>> -static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
>> +void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
>>   {
>>   	struct msi_desc *desc = irq_data_get_msi_desc(data);
>>   	struct platform_msi_priv_data *priv_data;
>> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
>> index b558d4cfd082..cecc6a6bdbef 100644
>> --- a/drivers/vfio/mdev/mdev_core.c
>> +++ b/drivers/vfio/mdev/mdev_core.c
>> @@ -33,6 +33,12 @@ struct device *mdev_parent_dev(struct mdev_device *mdev)
>>   }
>>   EXPORT_SYMBOL(mdev_parent_dev);
>>   
>> +struct device *mdev_dev_to_parent_dev(struct device *dev)
>> +{
>> +	return to_mdev_device(dev)->parent->dev;
>> +}
>> +EXPORT_SYMBOL(mdev_dev_to_parent_dev);
> 
> And this needs to be EXPORT_SYMBOL because this is designed to support
> non GPL drivers from the very beginning, right? Ditto for the other
> exports in this file.

Hmm, I followed the same convention as the other exports here. Guess I 
would have to change all other exports to EXPORT_SYMBOL_GPL as well.

> 
>> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
>> index 7d922950caaf..c21f1305a76b 100644
>> --- a/drivers/vfio/mdev/mdev_private.h
>> +++ b/drivers/vfio/mdev/mdev_private.h
>> @@ -36,7 +36,6 @@ struct mdev_device {
>>   };
>>   
>>   #define to_mdev_device(dev)	container_of(dev, struct mdev_device, dev)
>> -#define dev_is_mdev(d)		((d)->bus == &mdev_bus_type)
> 
> Moving stuff around 3 patches later makes tons of sense.

ok will add it earlier then.
>    
> Thanks,
> 
>          tglx
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 05/15] ims-msi: Add mask/unmask routines
  2020-04-25 21:49   ` Thomas Gleixner
@ 2020-05-04  0:16     ` Dey, Megha
  0 siblings, 0 replies; 89+ messages in thread
From: Dey, Megha @ 2020-05-04  0:16 UTC (permalink / raw)
  To: Thomas Gleixner, Dave Jiang, vkoul, maz, bhelgaas, rafael,
	gregkh, hpa, alex.williamson, jacob.jun.pan, ashok.raj, jgg,
	yi.l.liu, baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck,
	jing.lin, dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Hi Thomas,

On 4/25/2020 2:49 PM, Thomas Gleixner wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
>>   
>> +static u32 __dev_ims_desc_mask_irq(struct msi_desc *desc, u32 flag)
> 
> ...mask_irq()? This is doing both mask and unmask depending on the
> availability of the ops callbacks.

yes, should have called it __dev_ims_desc_mask_unmask_irq perhaps.
> 
>> +{
>> +	u32 mask_bits = desc->platform.masked;
>> +	const struct platform_msi_ops *ops;
>> +
>> +	ops = desc->platform.msi_priv_data->ops;
>> +	if (!ops)
>> +		return 0;
>> +
>> +	if (flag) {
> 
> flag? Darn, this has a clear boolean meaning of mask or unmask and 'u32
> flag' is the most natural and obvious self explaining expression for
> this, right?

will change it a more meaningful name next time around ..
> 
>> +		if (ops->irq_mask)
>> +			mask_bits = ops->irq_mask(desc);
>> +	} else {
>> +		if (ops->irq_unmask)
>> +			mask_bits = ops->irq_unmask(desc);
>> +	}
>> +
>> +	return mask_bits;
> 
> What's mask_bits? This is about _ONE_ IMS interrupt. Can it have
> multiple mask bits and if so then the explanation which I decoded by
> crystal ball probably looks like this:
> 
> Bit  0:  Don't know whether it's masked
> Bit  1:  Perhaps it's masked
> Bit  2:  Probably it's masked
> Bit  3:  Mostly masked
> ...
> Bit 31:  Fully masked
> 
> Or something like that. Makes a lot of sense in a XKCD cartoon at least.
> 

After a close look, we can simply do away with this mask_bits. Looks 
like a crystal ball will not be required next time around after all.

>> +}
>> +
>> +/**
>> + * dev_ims_mask_irq - Generic irq chip callback to mask IMS interrupts
>> + * @data: pointer to irqdata associated to that interrupt
>> + */
>> +static void dev_ims_mask_irq(struct irq_data *data)
>> +{
>> +	struct msi_desc *desc = irq_data_get_msi_desc(data);
>> +
>> +	desc->platform.masked = __dev_ims_desc_mask_irq(desc, 1);
> 
> The purpose of this masked information is?

serves no purpose, borrowed this concept from the PCI-msi code but is 
just junk here.  Will be removed next time around.

> 
> Thanks,
> 
>          tglx
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 06/15] ims-msi: Enable IMS interrupts
  2020-04-25 22:13   ` Thomas Gleixner
@ 2020-05-04  0:17     ` Dey, Megha
  0 siblings, 0 replies; 89+ messages in thread
From: Dey, Megha @ 2020-05-04  0:17 UTC (permalink / raw)
  To: Thomas Gleixner, Dave Jiang, vkoul, maz, bhelgaas, rafael,
	gregkh, hpa, alex.williamson, jacob.jun.pan, ashok.raj, jgg,
	yi.l.liu, baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck,
	jing.lin, dan.j.williams, kwankhede, eric.auger, parav
  Cc: dmaengine, linux-kernel, x86, linux-pci, kvm

Hi Thomas,

On 4/25/2020 3:13 PM, Thomas Gleixner wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
>>   
>> +struct irq_domain *dev_get_ims_domain(struct device *dev)
>> +{
>> +	struct irq_alloc_info info;
>> +
>> +	if (dev_is_mdev(dev))
>> +		dev = mdev_to_parent(dev);
>> +
>> +	init_irq_alloc_info(&info, NULL);
>> +	info.type = X86_IRQ_ALLOC_TYPE_IMS;
> 
> So all IMS capabale devices run on X86? I thought these things are PCIe
> cards which can be plugged into any platform which supports PCIe.

No, IMS is architecture independent.

and yes they are PCIe cards which can be plugged into any platform which 
supports PCIe.
> 
>> +	info.dev = dev;
>> +
>> +	return irq_remapping_get_irq_domain(&info);
>> +}
>> +
>>   static struct msi_domain_ops dev_ims_domain_ops = {
>>   	.get_hwirq	= dev_ims_get_hwirq,
>>   	.msi_prepare	= dev_ims_prepare,
>> diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
>> index 6d8840db4a85..204ce8041c17 100644
>> --- a/drivers/base/platform-msi.c
>> +++ b/drivers/base/platform-msi.c
>> @@ -118,6 +118,8 @@ static void platform_msi_free_descs(struct device *dev, int base, int nvec,
>>   			kfree(platform_msi_group);
>>   		}
>>   	}
>> +
>> +	dev->platform_msi_type = 0;
> 
> I can clearly see the advantage of using '0' over 'NOT_PLAT_MSI'
> here. '0' is definitely more intuitive.
> 

Hmm, this will no longer be needed in the next version of patches.
>>   }
>>   
>>   static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
>> @@ -205,18 +207,22 @@ platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
>>   	 * accordingly (which would impact the max number of MSI
>>   	 * capable devices).
>>   	 */
>> -	if (!dev->msi_domain || !platform_ops->write_msg || !nvec ||
>> -	    nvec > MAX_DEV_MSIS)
>> +	if (!platform_ops->write_msg || !nvec || nvec > MAX_DEV_MSIS)
>>   		return ERR_PTR(-EINVAL);
>> -	if (dev->msi_domain->bus_token != DOMAIN_BUS_PLATFORM_MSI) {
>> -		dev_err(dev, "Incompatible msi_domain, giving up\n");
>> -		return ERR_PTR(-EINVAL);
>> -	}
>> +	if (dev->platform_msi_type == GEN_PLAT_MSI) {
>> +		if (!dev->msi_domain)
>> +			return ERR_PTR(-EINVAL);
>> +
>> +		if (dev->msi_domain->bus_token != DOMAIN_BUS_PLATFORM_MSI) {
>> +			dev_err(dev, "Incompatible msi_domain, giving up\n");
>> +			return ERR_PTR(-EINVAL);
>> +		}
>>   
>> -	/* Already had a helping of MSI? Greed... */
>> -	if (!list_empty(platform_msi_current_group_entry_list(dev)))
>> -		return ERR_PTR(-EBUSY);
>> +		/* Already had a helping of MSI? Greed... */
>> +		if (!list_empty(platform_msi_current_group_entry_list(dev)))
>> +			return ERR_PTR(-EBUSY);
>> +	}
>>   
>>   	datap = kzalloc(sizeof(*datap), GFP_KERNEL);
>>   	if (!datap)
>> @@ -254,6 +260,7 @@ static void platform_msi_free_priv_data(struct platform_msi_priv_data *data)
>>   int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
>>   				   const struct platform_msi_ops *platform_ops)
>>   {
>> +	dev->platform_msi_type = GEN_PLAT_MSI;
>>   	return platform_msi_domain_alloc_irqs_group(dev, nvec, platform_ops,
>>   									NULL);
>>   }
>> @@ -265,12 +272,18 @@ int platform_msi_domain_alloc_irqs_group(struct device *dev, unsigned int nvec,
>>   {
>>   	struct platform_msi_group_entry *platform_msi_group;
>>   	struct platform_msi_priv_data *priv_data;
>> +	struct irq_domain *domain;
>>   	int err;
>>   
>> -	dev->platform_msi_type = GEN_PLAT_MSI;
> 
> Groan. If you move the type assignment to the caller then do so in a
> separate patch. These all in one combo changes are simply not reviewable
> without getting nuts.

sure, makes sense to add it as a separate patch.
> 
>> -	if (group_id)
>> +	if (!dev->platform_msi_type) {
> 
> That's really consistent. If the caller does not store a type upfront
> then it becomes IMS automagically. Can you pretty please stop to think
> that this IMS stuff is the center of the universe? To be clear, it's
> just another variant of half thought out hardware design fail as all the
> other stuff we already have to support.

well, as we have recently concluded, IMS is merely an extension and 
improvements over the already existing platform-msi. So well, it is not 
the center of the universe indeed.

> 
> Abusing dev->platform_msi_type to decide about the nature of the call
> and then decide that anything which does not set it upfront is IMS is
> really future proof.

Have to think of something else indeed <scratching my head>

> 
>>   		*group_id = ++dev->group_id;
>> +		dev->platform_msi_type = IMS;
> 
> Oh a new type name 'IMS'. Well suited into the naming scheme.

coming up with a coherent naming scheme in the next version of patches.
> 
>> +		domain = dev_get_ims_domain(dev);
> 
> No. This is completely inconsistent again and a blatant violation of
> layering.

yes, i earlier thought what differentiates the already existing 
platform-msi from IMS is that IMS has to have IR enabled and thus we 
need  to have some way to finding the IRQ domain corresponding to that 
interrupt remapping unit. Now that this theory is not true, we would not 
be needing this call after all.

> 
> Thanks,
> 
>          tglx
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-03 22:36             ` Jason Gunthorpe
@ 2020-05-04  0:20               ` Dey, Megha
  0 siblings, 0 replies; 89+ messages in thread
From: Dey, Megha @ 2020-05-04  0:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Dave Jiang, Vinod Koul, maz, Bjorn Helgaas,
	Rafael J. Wysocki, Greg KH, Thomas Gleixner, H. Peter Anvin,
	Alex Williamson, Jacob jun Pan, Raj, Ashok, Yi L Liu, baolu.lu,
	Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin, kwankhede,
	eric.auger, parav, dmaengine, Linux Kernel Mailing List, X86 ML,
	linux-pci, KVM list



On 5/3/2020 3:36 PM, Jason Gunthorpe wrote:
> On Sun, May 03, 2020 at 03:31:39PM -0700, Dey, Megha wrote:
>>
>> Hi Jason,
>>
>> On 5/3/2020 3:22 PM, Jason Gunthorpe wrote:
>>> On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote:
>>>>>> This has been my concern reviewing the implementation. IMS needs more
>>>>>> than one in-tree user to validate degrees of freedom in the api. I had
>>>>>> been missing a second "in-tree user" to validate the scope of the
>>>>>> flexibility that was needed.
>>>>>
>>>>> IMS is too narrowly specified.
>>>>>
>>>>> All platforms that support MSI today can support IMS. It is simply a
>>>>> way for the platform to give the driver an addr/data pair that triggers
>>>>> an interrupt when a posted write is performed to that pair.
>>>>>
>>>>
>>>> Well, yes and no. IMS requires interrupt remapping in addition to the
>>>> dynamic nature of IRQ allocation.
>>>
>>> You've mentioned remapping a few times, but I really can't understand
>>> why it has anything to do with platform_msi or IMS..
>>
>> So after some internal discussions, we have concluded that IMS has no
>> linkage with Interrupt remapping, IR is just a platform concept. IMS is just
>> a name Intel came up with, all it really means is device managed addr/data
>> writes to generate interrupts. Technically we can call something IMS even if
>> device has its own location to store interrupts in non-pci standard
>> mechanism, much like platform-msi indeed. We simply need to extend
>> platform-msi to its address some of its shortcomings: increase number of
>> interrupts to > 2048, enable dynamic allocation of interrupts, add
>> mask/unmask callbacks in addition to write_msg etc.
> 
> Sounds right to me
> 
> Persumably you still need a way for the driver, eg vfio, to ensure a
> MSI is remappable, but shouldn't that be exactly the same way as done
> in normal PCI MSI today?

yes exactly, it should be done in the same way as PCI-MSI, if IR is 
enabled we will have IR_PCI_MSI for platform msi as well.
> 
>> FWIW, even MSI can be IMS with rules on how to manage the addr/data writes
>> following pci sig .. its just that.
> 
> Yep, IMHO, our whole handling of MSI is very un-general sometimes..
> 
> I thought the msi_domain stuff that some platforms are using is a way
> to improve on that? You might find that updating x86 to use msi_domain
> might be helpful in this project???

yes, we need to take a closer look at this.
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-05-03 22:46           ` Jason Gunthorpe
@ 2020-05-04  0:25             ` Dey, Megha
  2020-05-04 12:14               ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Dey, Megha @ 2020-05-04  0:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm



On 5/3/2020 3:46 PM, Jason Gunthorpe wrote:
> On Sun, May 03, 2020 at 03:40:44PM -0700, Dey, Megha wrote:
>> On 5/3/2020 3:25 PM, Jason Gunthorpe wrote:
>>> On Fri, May 01, 2020 at 03:30:02PM -0700, Dey, Megha wrote:
>>>> Hi Jason,
>>>>
>>>> On 4/23/2020 1:11 PM, Jason Gunthorpe wrote:
>>>>> On Tue, Apr 21, 2020 at 04:34:11PM -0700, Dave Jiang wrote:
>>>>>> diff --git a/drivers/base/ims-msi.c b/drivers/base/ims-msi.c
>>>>>> new file mode 100644
>>>>>> index 000000000000..738f6d153155
>>>>>> +++ b/drivers/base/ims-msi.c
>>>>>> @@ -0,0 +1,100 @@
>>>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>>>> +/*
>>>>>> + * Support for Device Specific IMS interrupts.
>>>>>> + *
>>>>>> + * Copyright © 2019 Intel Corporation.
>>>>>> + *
>>>>>> + * Author: Megha Dey <megha.dey@intel.com>
>>>>>> + */
>>>>>> +
>>>>>> +#include <linux/dmar.h>
>>>>>> +#include <linux/irq.h>
>>>>>> +#include <linux/mdev.h>
>>>>>> +#include <linux/pci.h>
>>>>>> +
>>>>>> +/*
>>>>>> + * Determine if a dev is mdev or not. Return NULL if not mdev device.
>>>>>> + * Return mdev's parent dev if success.
>>>>>> + */
>>>>>> +static inline struct device *mdev_to_parent(struct device *dev)
>>>>>> +{
>>>>>> +	struct device *ret = NULL;
>>>>>> +	struct device *(*fn)(struct device *dev);
>>>>>> +	struct bus_type *bus = symbol_get(mdev_bus_type);
>>>>>> +
>>>>>> +	if (bus && dev->bus == bus) {
>>>>>> +		fn = symbol_get(mdev_dev_to_parent_dev);
>>>>>> +		ret = fn(dev);
>>>>>> +		symbol_put(mdev_dev_to_parent_dev);
>>>>>> +		symbol_put(mdev_bus_type);
>>>>>
>>>>> No, things like this are not OK in the drivers/base
>>>>>
>>>>> Whatever this is doing needs to be properly architected in some
>>>>> generic way.
>>>>
>>>> Basically what I am trying to do here is to determine if the device is an
>>>> mdev device or not.
>>>
>>> Why? mdev devices are virtual they don't have HW elements.
>>
>> Hmm yeah exactly, since they are virtual, they do not have an associated IRQ
>> domain right? So they use the irq domain of the parent device..
>>
>>>
>>> The caller should use the concrete pci_device to allocate
>>> platform_msi? What is preventing this?
>>
>> hmmm do you mean to say all platform-msi adhere to the rules of a PCI
>> device?
> 
> I mean where a platform-msi can work should be defined by the arch,
> and probably is related to things like having an irq_domain attached
> 
> So, like pci, drivers must only try to do platfor_msi stuff on
> particular devices. eg on pci_device and platform_device types.
> 
> Even so it may not even work, but I can't think of any reason why it
> should be made to work on a virtual device like mdev.
> 
>> The use case if when we have a device assigned to a guest and we
>> want to allocate IMS(platform-msi) interrupts for that
>> guest-assigned device. Currently, this is abstracted through a mdev
>> interface.
> 
> And the mdev has the pci_device internally, so it should simply pass
> that pci_device to the platform_msi machinery.

hmm i am not sure I follow this. mdev has a pci_device internally? which 
struct are you referring to here?

mdev is merely a micropartitioned PCI device right, which no real PCI 
resource backing. I am not how else we can find the IRQ domain 
associated with an mdev..

> 
> This is no different from something like pci_iomap() which must be
> used with the pci_device.
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-05-04  0:25             ` Dey, Megha
@ 2020-05-04 12:14               ` Jason Gunthorpe
  2020-05-06 10:27                 ` Tian, Kevin
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-04 12:14 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Dave Jiang, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

On Sun, May 03, 2020 at 05:25:28PM -0700, Dey, Megha wrote:
> > > The use case if when we have a device assigned to a guest and we
> > > want to allocate IMS(platform-msi) interrupts for that
> > > guest-assigned device. Currently, this is abstracted through a mdev
> > > interface.
> > 
> > And the mdev has the pci_device internally, so it should simply pass
> > that pci_device to the platform_msi machinery.
> 
> hmm i am not sure I follow this. mdev has a pci_device internally? which
> struct are you referring to here?

mdev in general may not, but any ADI trying to use mdev will
necessarily have access to a struct pci_device.

> mdev is merely a micropartitioned PCI device right, which no real PCI
> resource backing. I am not how else we can find the IRQ domain associated
> with an mdev..

ADI always has real PCI resource backing.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain
  2020-05-04 12:14               ` Jason Gunthorpe
@ 2020-05-06 10:27                 ` Tian, Kevin
  0 siblings, 0 replies; 89+ messages in thread
From: Tian, Kevin @ 2020-05-06 10:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Dey, Megha
  Cc: Jiang, Dave, vkoul, maz, bhelgaas, rafael, gregkh, tglx, hpa,
	alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams, Dan J,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Monday, May 4, 2020 8:14 PM
> 
> On Sun, May 03, 2020 at 05:25:28PM -0700, Dey, Megha wrote:
> > > > The use case if when we have a device assigned to a guest and we
> > > > want to allocate IMS(platform-msi) interrupts for that
> > > > guest-assigned device. Currently, this is abstracted through a mdev
> > > > interface.
> > >
> > > And the mdev has the pci_device internally, so it should simply pass
> > > that pci_device to the platform_msi machinery.
> >
> > hmm i am not sure I follow this. mdev has a pci_device internally? which
> > struct are you referring to here?
> 
> mdev in general may not, but any ADI trying to use mdev will
> necessarily have access to a struct pci_device.

Agree here. Mdev is just driver internal concept. It doesn't make sense to
expose it in driver/base, just like how we avoided exposing mdev in iommu
layer.

Megha, every mdev/ADI has a parent device, which is the struct pci_device
that Jason refers to. In irq domain level, it only needs to care about the
PCI device and related IMS management. It doesn't matter whether the
allocated IMS entry is used for a mdev or by parent driver itself.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-04-29  9:42                               ` Tian, Kevin
@ 2020-05-08 20:47                                 ` Raj, Ashok
  2020-05-08 23:16                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Raj, Ashok @ 2020-05-08 20:47 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Jiang, Dave, vkoul, megha.dey,
	maz, bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing,
	Williams, Dan J, kwankhede, eric.auger, parav, dmaengine,
	linux-kernel, x86, linux-pci, kvm, Ashok Raj

Hi Jason

In general your idea of moving pure emulation code to user space 
is a good strategy.


On Wed, Apr 29, 2020 at 02:42:20AM -0700, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > Sent: Monday, April 27, 2020 9:22 PM
> > 
> > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> > 
> > > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > > emulation.
> > >
> > > Not sure what you're referring to, I see about 30 lines of code in
> > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > > config space and maybe a couple hundred lines of code in total handling
> > > config space emulation.  Thanks,
> > 
> > Look around vidxd_do_command()
> > 
> > If I understand this flow properly..
> > 
> 
> Hi, Jason,
> 
> I guess the 2000 lines mostly refer to the changes in mdev.c and vdev.c. 
> We did a break-down among them:
> 
> 1) ~150 LOC for vdev initialization
> 2) ~150 LOC for cfg space emulation
> 3) ~230 LOC for mmio r/w emulation
> 4) ~500 LOC for controlling the work queue (vidxd_do_command), 
> triggered by write emulation of IDXD_CMD_OFFSET register
> 5) the remaining lines are all about vfio-mdev registration/callbacks,
> for reporting mmio/irq resource, eventfd, mmap, etc.
> 
> 1/2/3) are pure device emulation, which counts for ~500 LOC. 
> 
> 4) needs be in the kernel regardless of which uAPI is used, because it
> talks to the physical work queue (enable, disable, drain, abort, reset, etc.)
> 
> Then if just talking about ~500 LOC emulation code left in the kernel, 
> is it still a big concern to you? 😊

Even when uaccel was under development, one of the options
was to use VFIO as the transport, goal was the same i.e to keep
the user space have one interface. But the needs of generic 
user space application is significantly different from exporting
a more functional device model to guest, which isn't full emulated
device. which is why VFIO didn't make sense for native use.

And when we move things from VFIO which is already established
as a general device model and accepted by multiple VMM's it gives
instant footing without a whole redesign. When we move things from 
VFIO to uaccel to bolt on the functionality like VFIO, I suspect we 
would be moving code/functionality from VFIO to Uaccel. I don't know
what the net gain would be.

IMS is being reworked based on your feedback. And for mdev
since the code is minimal for emulation, and rest are control paths
that need kernel code to deal with.

For mdev, would you agree we can keep the current architecture,
and investigate moving some emulation code to user space (say even for
standard vfio_pci) and then expand scope later.

Cheers
Ashok

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-08 20:47                                 ` Raj, Ashok
@ 2020-05-08 23:16                                   ` Jason Gunthorpe
  2020-05-08 23:52                                     ` Dave Jiang
  2020-05-09  0:09                                     ` Raj, Ashok
  0 siblings, 2 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-08 23:16 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Tian, Kevin, Alex Williamson, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm

On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:

> Even when uaccel was under development, one of the options
> was to use VFIO as the transport, goal was the same i.e to keep
> the user space have one interface. 

I feel a bit out of the loop here, uaccel isn't in today's kernel is
it? I've heard about it for a while, it sounds very similar to RDMA,
so I hope they took some of my advice...

> But the needs of generic user space application is significantly
> different from exporting a more functional device model to guest,
> which isn't full emulated device. which is why VFIO didn't make
> sense for native use.

I'm not sure this is true. We've done these kinds of emulated SIOV
like things already and there is a huge overlap between what a generic
user application needs and what the VMM neds. Actually almost a
perfect subset except for interrupt remapping (which is quite
trivial).

The things vfio focuses on, like groups and managing a real config
space just don't apply here.

> And when we move things from VFIO which is already established
> as a general device model and accepted by multiple VMM's it gives
> instant footing without a whole redesign. 

Yes, I understand, but I think you need to get more people to support
this idea. From my standpoint this is taking secure lean VMMs and
putting emulation code back into them, except in a more dangerous
kernel location. This does not seem like a net win to me.

You'd be much better to have some userspace library scheme instead of
being completely tied to a kernel interface for modularity.

> When we move things from VFIO to uaccel to bolt on the functionality
> like VFIO, I suspect we would be moving code/functionality from VFIO
> to Uaccel. I don't know what the net gain would be.

Most of VFIO functionality is already decomposed inside the kernel,
and you need most of it to do secure user access anyhow.

> For mdev, would you agree we can keep the current architecture,
> and investigate moving some emulation code to user space (say even for
> standard vfio_pci) and then expand scope later.

I won't hard NAK this, but I think you need more people to support
this general idea of more emulation code in the kernel to go ahead -
particularly since this is one of many future drivers along this
design.

It would be good to hear from the VMM teams that this is what they
want (and why), for instance.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-08 23:16                                   ` Jason Gunthorpe
@ 2020-05-08 23:52                                     ` Dave Jiang
  2020-05-09  0:09                                     ` Raj, Ashok
  1 sibling, 0 replies; 89+ messages in thread
From: Dave Jiang @ 2020-05-08 23:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Raj, Ashok
  Cc: Tian, Kevin, Alex Williamson, vkoul, megha.dey, maz, bhelgaas,
	rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L, Lu, Baolu,
	Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams, Dan J,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm



On 5/8/2020 4:16 PM, Jason Gunthorpe wrote:
> On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:
> 
>> Even when uaccel was under development, one of the options
>> was to use VFIO as the transport, goal was the same i.e to keep
>> the user space have one interface.
> 
> I feel a bit out of the loop here, uaccel isn't in today's kernel is
> it? I've heard about it for a while, it sounds very similar to RDMA,
> so I hope they took some of my advice...

It went into 5.7 kernel. drivers/misc/uacce. It looks char device exported with 
SVM support.

> 
>> But the needs of generic user space application is significantly
>> different from exporting a more functional device model to guest,
>> which isn't full emulated device. which is why VFIO didn't make
>> sense for native use.
> 
> I'm not sure this is true. We've done these kinds of emulated SIOV
> like things already and there is a huge overlap between what a generic
> user application needs and what the VMM neds. Actually almost a
> perfect subset except for interrupt remapping (which is quite
> trivial).
> 
> The things vfio focuses on, like groups and managing a real config
> space just don't apply here.
> 
>> And when we move things from VFIO which is already established
>> as a general device model and accepted by multiple VMM's it gives
>> instant footing without a whole redesign.
> 
> Yes, I understand, but I think you need to get more people to support
> this idea. From my standpoint this is taking secure lean VMMs and
> putting emulation code back into them, except in a more dangerous
> kernel location. This does not seem like a net win to me.
> 
> You'd be much better to have some userspace library scheme instead of
> being completely tied to a kernel interface for modularity.
> 
>> When we move things from VFIO to uaccel to bolt on the functionality
>> like VFIO, I suspect we would be moving code/functionality from VFIO
>> to Uaccel. I don't know what the net gain would be.
> 
> Most of VFIO functionality is already decomposed inside the kernel,
> and you need most of it to do secure user access anyhow.
> 
>> For mdev, would you agree we can keep the current architecture,
>> and investigate moving some emulation code to user space (say even for
>> standard vfio_pci) and then expand scope later.
> 
> I won't hard NAK this, but I think you need more people to support
> this general idea of more emulation code in the kernel to go ahead -
> particularly since this is one of many future drivers along this
> design.
> 
> It would be good to hear from the VMM teams that this is what they
> want (and why), for instance.
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-08 23:16                                   ` Jason Gunthorpe
  2020-05-08 23:52                                     ` Dave Jiang
@ 2020-05-09  0:09                                     ` Raj, Ashok
  2020-05-09 12:21                                       ` Jason Gunthorpe
  1 sibling, 1 reply; 89+ messages in thread
From: Raj, Ashok @ 2020-05-09  0:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm, Ashok Raj, Paolo Bonzini

Hi Jason

On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote:
> On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:
> 
> > Even when uaccel was under development, one of the options
> > was to use VFIO as the transport, goal was the same i.e to keep
> > the user space have one interface. 
> 
> I feel a bit out of the loop here, uaccel isn't in today's kernel is
> it? I've heard about it for a while, it sounds very similar to RDMA,
> so I hope they took some of my advice...

I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like
RDMA, its just a plain accelerator. There is no connection management,
memory registration or other things.. IB was my first job at Intel,
but saying that i would be giving my age away :)

> 
> > But the needs of generic user space application is significantly
> > different from exporting a more functional device model to guest,
> > which isn't full emulated device. which is why VFIO didn't make
> > sense for native use.
> 
> I'm not sure this is true. We've done these kinds of emulated SIOV
> like things already and there is a huge overlap between what a generic
> user application needs and what the VMM neds. Actually almost a
> perfect subset except for interrupt remapping (which is quite
> trivial).

From a simple user application POV, if we need to do simple compression
or such with a shared WQ, all the application needs do do is
bind_mm() that somehow associates the process address space with the 
IOMMU to create that association and communication channel.

For supporting this with guest user, we need to support the same actions
from a guest OS. i.e a guest OS bind should be serviced and end up with the 
IOMMU plumbing it with the guest cr3, and making sure the guest 2nd level 
is plumed right for the nested walk. 

Now we can certainly go bolt all these things again. When VFIO has already 
done the pluming in a generic way.

> 
> The things vfio focuses on, like groups and managing a real config
> space just don't apply here.
> 
> > And when we move things from VFIO which is already established
> > as a general device model and accepted by multiple VMM's it gives
> > instant footing without a whole redesign. 
> 
> Yes, I understand, but I think you need to get more people to support
> this idea. From my standpoint this is taking secure lean VMMs and

When we decided on VFIO, it was after using the best practices then,
after discussion with Kirti Wankhede and Alex. Kevin had used it for
graphics virtualization. It was even presented at KVM forum and such
dating back to 2017. No one has raised alarms until now :-)


> putting emulation code back into them, except in a more dangerous
> kernel location. This does not seem like a net win to me.

Its not a whole lot of emulation right? mdev are soft partitioned. There is
just a single PF, but we can create a separate partition for the guest using
PASID along with the normal BDF (RID). And exposing a consistent PCI like
interface to user space you get everything else for free.

Yes, its not SRIOV, but giving that interface to user space via VFIO, we get 
all of that functionality without having to reinvent a different way to do it.

vDPA went the other way, IRC, they went and put a HW implementation of what
virtio is in hardware. So they sort of fit the model. Here the instance
looks and feels like real hardware for the setup and control aspect.


> 
> You'd be much better to have some userspace library scheme instead of
> being completely tied to a kernel interface for modularity.

Couldn't agree more :-).. all I'm asking is if we can do a phased approach to 
get to that goodness! If we need to move things to user space for emulation
that's a great goal, but it can be evolutionary.

> 
> > When we move things from VFIO to uaccel to bolt on the functionality
> > like VFIO, I suspect we would be moving code/functionality from VFIO
> > to Uaccel. I don't know what the net gain would be.
> 
> Most of VFIO functionality is already decomposed inside the kernel,
> and you need most of it to do secure user access anyhow.
> 
> > For mdev, would you agree we can keep the current architecture,
> > and investigate moving some emulation code to user space (say even for
> > standard vfio_pci) and then expand scope later.
> 
> I won't hard NAK this, but I think you need more people to support
> this general idea of more emulation code in the kernel to go ahead -
> particularly since this is one of many future drivers along this
> design.
> 
> It would be good to hear from the VMM teams that this is what they
> want (and why), for instance.

IRC Paolo was present I think and we can find other VMM folks to chime in if
that helps.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-09  0:09                                     ` Raj, Ashok
@ 2020-05-09 12:21                                       ` Jason Gunthorpe
  2020-05-13  2:29                                         ` Jason Wang
  2020-05-13  8:30                                         ` Tian, Kevin
  0 siblings, 2 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-09 12:21 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Tian, Kevin, Alex Williamson, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm, Paolo Bonzini

On Fri, May 08, 2020 at 05:09:09PM -0700, Raj, Ashok wrote:
> Hi Jason
> 
> On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:
> > 
> > > Even when uaccel was under development, one of the options
> > > was to use VFIO as the transport, goal was the same i.e to keep
> > > the user space have one interface. 
> > 
> > I feel a bit out of the loop here, uaccel isn't in today's kernel is
> > it? I've heard about it for a while, it sounds very similar to RDMA,
> > so I hope they took some of my advice...
> 
> I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like
> RDMA, its just a plain accelerator. There is no connection management,
> memory registration or other things.. IB was my first job at Intel,
> but saying that i would be giving my age away :)

rdma was the first thing to do kernel bypass, all this stuff is like
rdma at some level.. I see this looks like the 'warp driver' stuff
redone

Wow, lots wrong here. Oh well.

> > putting emulation code back into them, except in a more dangerous
> > kernel location. This does not seem like a net win to me.
> 
> Its not a whole lot of emulation right? mdev are soft partitioned. There is
> just a single PF, but we can create a separate partition for the guest using
> PASID along with the normal BDF (RID). And exposing a consistent PCI like
> interface to user space you get everything else for free.
> 
> Yes, its not SRIOV, but giving that interface to user space via VFIO, we get 
> all of that functionality without having to reinvent a different way to do it.
> 
> vDPA went the other way, IRC, they went and put a HW implementation of what
> virtio is in hardware. So they sort of fit the model. Here the instance
> looks and feels like real hardware for the setup and control aspect.

VDPA and this are very similar, of course it depends on the exact HW
implementation.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-09 12:21                                       ` Jason Gunthorpe
@ 2020-05-13  2:29                                         ` Jason Wang
  2020-05-13  8:30                                         ` Tian, Kevin
  1 sibling, 0 replies; 89+ messages in thread
From: Jason Wang @ 2020-05-13  2:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Raj, Ashok
  Cc: Tian, Kevin, Alex Williamson, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm, Paolo Bonzini


On 2020/5/9 下午8:21, Jason Gunthorpe wrote:
> On Fri, May 08, 2020 at 05:09:09PM -0700, Raj, Ashok wrote:
>> Hi Jason
>>
>> On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote:
>>> On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:
>>>
>>>> Even when uaccel was under development, one of the options
>>>> was to use VFIO as the transport, goal was the same i.e to keep
>>>> the user space have one interface.
>>> I feel a bit out of the loop here, uaccel isn't in today's kernel is
>>> it? I've heard about it for a while, it sounds very similar to RDMA,
>>> so I hope they took some of my advice...
>> I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like
>> RDMA, its just a plain accelerator. There is no connection management,
>> memory registration or other things.. IB was my first job at Intel,
>> but saying that i would be giving my age away:)
> rdma was the first thing to do kernel bypass, all this stuff is like
> rdma at some level.. I see this looks like the 'warp driver' stuff
> redone
>
> Wow, lots wrong here. Oh well.
>
>>> putting emulation code back into them, except in a more dangerous
>>> kernel location. This does not seem like a net win to me.
>> Its not a whole lot of emulation right? mdev are soft partitioned. There is
>> just a single PF, but we can create a separate partition for the guest using
>> PASID along with the normal BDF (RID). And exposing a consistent PCI like
>> interface to user space you get everything else for free.
>>
>> Yes, its not SRIOV, but giving that interface to user space via VFIO, we get
>> all of that functionality without having to reinvent a different way to do it.
>>
>> vDPA went the other way, IRC, they went and put a HW implementation of what
>> virtio is in hardware. So they sort of fit the model. Here the instance
>> looks and feels like real hardware for the setup and control aspect.
> VDPA and this are very similar, of course it depends on the exact HW
> implementation.
>
> Jason


Actually this is not a must. Technically we can do ring/descriptor 
translation in the vDPA driver as what zerocopy AF_XDP did.

Thanks


>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-09 12:21                                       ` Jason Gunthorpe
  2020-05-13  2:29                                         ` Jason Wang
@ 2020-05-13  8:30                                         ` Tian, Kevin
  2020-05-13 12:40                                           ` Jason Gunthorpe
  1 sibling, 1 reply; 89+ messages in thread
From: Tian, Kevin @ 2020-05-13  8:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Raj, Ashok
  Cc: Alex Williamson, Jiang, Dave, vkoul, megha.dey, maz, bhelgaas,
	rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L, Lu, Baolu,
	Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams, Dan J,
	kwankhede, eric.auger, parav, dmaengine, linux-kernel, x86,
	linux-pci, kvm, Paolo Bonzini

> From: Jason Gunthorpe
> Sent: Saturday, May 9, 2020 8:21 PM
> > > putting emulation code back into them, except in a more dangerous
> > > kernel location. This does not seem like a net win to me.
> >
> > Its not a whole lot of emulation right? mdev are soft partitioned. There is
> > just a single PF, but we can create a separate partition for the guest using
> > PASID along with the normal BDF (RID). And exposing a consistent PCI like
> > interface to user space you get everything else for free.
> >
> > Yes, its not SRIOV, but giving that interface to user space via VFIO, we get
> > all of that functionality without having to reinvent a different way to do it.
> >
> > vDPA went the other way, IRC, they went and put a HW implementation of
> what
> > virtio is in hardware. So they sort of fit the model. Here the instance
> > looks and feels like real hardware for the setup and control aspect.
> 
> VDPA and this are very similar, of course it depends on the exact HW
> implementation.
> 

Hi, Jason,

I have more thoughts below. let's see whether making sense to you.

When talking about virtualization, here the target is unmodified guest 
kernel driver which expects seeing the raw controllability of queues 
as defined by device spec. In idxd, such controllability includes enable/
disable SVA, dedicated or shared WQ, size, threshold, privilege, fault 
mode, max batch size, and many other attributes. Different guest OS 
has its own policy of using all or partial available controllability. 

When talking about application, we care about providing an efficient
programming interface to userspace. For example with uacce, we
allow an application to submit vaddr-based workloads to a reserved
WQ with kernel bypassed. But it's not necessary to export the raw
controllability of the reserved WQ to userspace, and we still rely on
kernel driver to configure it including bind_mm. I'm not sure whether 
uacce would like to evolve as a generic queue management system
including non-SVA and all vendor specific raw capabilities as 
expected by all kinds of guest kernel drivers. It sounds like not 
worthwhile at this point, given that we already have an highly efficient 
SVA interface for user applications.

That is why we start with mdev as an evolutionary approach. Mdev is 
introduced to expose raw controllability of a subdevice (WQ or ADI) to 
guest. It build a channel between guest kernel driver and host kernel 
driver and uses device spec as the uAPI by sticking to the mmio interface.
and all virtualization related setups are just consolidated together in vfio. 
the drawback, as you pointed out, is putting some degree of emulation
code in the kernel. But as explained earlier, they are only small portion of
code. Moreover, most registers are emulated as simple memory read/
write, while the remaining logic mostly belongs to raw controllability 
(e.g. cmd register) that host driver grants to the guest thus must 
propagate to the device. For the latter part, I would call it more as 
'mediation' instead of 'emulation', as required in whatever uapi would 
be used.

If in the future, there do have such requirement of delegating raw
WQ controllability to pure userspace applications for DMA engines, 
and there is be a well-defined uAPI to cover a large common set of 
controllability across multiple vendors, we will look at that option for
sure.

From above p.o.v, I feel vdpa is a different story. virtio/vhost has a 
well established eco-system between guest and host. The user
space VMM already emulates all available controllability as defined 
in virtio spec. Host kernel already supports vhost uAPI for vring
setup, iotlb management, etc. Extending that path for data path
offloading sounds a reasonable choice for vdpa...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.
  2020-05-13  8:30                                         ` Tian, Kevin
@ 2020-05-13 12:40                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2020-05-13 12:40 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Alex Williamson, Jiang, Dave, vkoul, megha.dey, maz,
	bhelgaas, rafael, gregkh, tglx, hpa, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, Lin, Jing, Williams,
	Dan J, kwankhede, eric.auger, parav, dmaengine, linux-kernel,
	x86, linux-pci, kvm, Paolo Bonzini

On Wed, May 13, 2020 at 08:30:15AM +0000, Tian, Kevin wrote:

> When talking about virtualization, here the target is unmodified guest 
> kernel driver which expects seeing the raw controllability of queues 
> as defined by device spec. In idxd, such controllability includes enable/
> disable SVA, dedicated or shared WQ, size, threshold, privilege, fault 
> mode, max batch size, and many other attributes. Different guest OS 
> has its own policy of using all or partial available controllability. 
> 
> When talking about application, we care about providing an efficient
> programming interface to userspace. For example with uacce, we
> allow an application to submit vaddr-based workloads to a reserved
> WQ with kernel bypassed. But it's not necessary to export the raw
> controllability of the reserved WQ to userspace, and we still rely on
> kernel driver to configure it including bind_mm. I'm not sure whether 
> uacce would like to evolve as a generic queue management system
> including non-SVA and all vendor specific raw capabilities as 
> expected by all kinds of guest kernel drivers. It sounds like not 
> worthwhile at this point, given that we already have an highly efficient 
> SVA interface for user applications.

Like I already said, you should get the people who care about this
stuff to support emulation in the kernel. I think it has not been
explained well in past.

Most Intel info on SIOV draws a close parallel to SRIOV and I think
people generally assume, that like SRIOV, SIOV does not include kernel
side MMIO emulations.

> If in the future, there do have such requirement of delegating raw
> WQ controllability to pure userspace applications for DMA engines, 
> and there is be a well-defined uAPI to cover a large common set of 
> controllability across multiple vendors, we will look at that option for
> sure.

All this Kernel bypass stuff is 'HW specific' by nature, you should
not expect to have general interfaces.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2020-05-13 12:40 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-21 23:33 [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Dave Jiang
2020-04-21 23:33 ` [PATCH RFC 01/15] drivers/base: Introduce platform_msi_ops Dave Jiang
2020-04-26  7:01   ` Greg KH
2020-04-27 21:38     ` Dave Jiang
2020-04-28  7:34       ` Greg KH
2020-04-21 23:33 ` [PATCH RFC 02/15] drivers/base: Introduce a new platform-msi list Dave Jiang
2020-04-25 21:13   ` Thomas Gleixner
2020-05-04  0:08     ` Dey, Megha
2020-04-21 23:34 ` [PATCH RFC 03/15] drivers/base: Allocate/free platform-msi interrupts by group Dave Jiang
2020-04-25 21:23   ` Thomas Gleixner
2020-05-04  0:08     ` Dey, Megha
2020-04-21 23:34 ` [PATCH RFC 04/15] drivers/base: Add support for a new IMS irq domain Dave Jiang
2020-04-23 20:11   ` Jason Gunthorpe
2020-05-01 22:30     ` Dey, Megha
2020-05-03 22:25       ` Jason Gunthorpe
2020-05-03 22:40         ` Dey, Megha
2020-05-03 22:46           ` Jason Gunthorpe
2020-05-04  0:25             ` Dey, Megha
2020-05-04 12:14               ` Jason Gunthorpe
2020-05-06 10:27                 ` Tian, Kevin
2020-04-25 21:38   ` Thomas Gleixner
2020-05-04  0:11     ` Dey, Megha
2020-04-21 23:34 ` [PATCH RFC 05/15] ims-msi: Add mask/unmask routines Dave Jiang
2020-04-25 21:49   ` Thomas Gleixner
2020-05-04  0:16     ` Dey, Megha
2020-04-21 23:34 ` [PATCH RFC 06/15] ims-msi: Enable IMS interrupts Dave Jiang
2020-04-25 22:13   ` Thomas Gleixner
2020-05-04  0:17     ` Dey, Megha
2020-04-21 23:34 ` [PATCH RFC 07/15] Documentation: Interrupt Message store Dave Jiang
2020-04-23 20:04   ` Jason Gunthorpe
2020-05-01 22:32     ` Dey, Megha
2020-05-03 22:28       ` Jason Gunthorpe
2020-05-03 22:41         ` Dey, Megha
2020-04-21 23:34 ` [PATCH RFC 08/15] vfio/mdev: Add a member for iommu domain in mdev_device Dave Jiang
2020-04-21 23:34 ` [PATCH RFC 09/15] vfio/type1: Save domain when attach domain to mdev Dave Jiang
2020-04-21 23:34 ` [PATCH RFC 10/15] dmaengine: idxd: add config support for readonly devices Dave Jiang
2020-04-21 23:34 ` [PATCH RFC 11/15] dmaengine: idxd: add IMS support in base driver Dave Jiang
2020-04-21 23:35 ` [PATCH RFC 12/15] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
2020-04-21 23:35 ` [PATCH RFC 13/15] dmaengine: idxd: add support for VFIO mediated device Dave Jiang
2020-04-21 23:35 ` [PATCH RFC 14/15] dmaengine: idxd: add error notification from host driver to " Dave Jiang
2020-04-21 23:35 ` [PATCH RFC 15/15] dmaengine: idxd: add ABI documentation for mediated device support Dave Jiang
2020-04-21 23:54 ` [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver Jason Gunthorpe
2020-04-22  0:53   ` Tian, Kevin
2020-04-22 11:50     ` Jason Gunthorpe
2020-04-22 21:14       ` Raj, Ashok
2020-04-23 19:12         ` Jason Gunthorpe
2020-04-24  3:27           ` Tian, Kevin
2020-04-24 12:44             ` Jason Gunthorpe
2020-04-24 16:25               ` Tian, Kevin
2020-04-24 18:12                 ` Jason Gunthorpe
2020-04-26  5:18                   ` Tian, Kevin
2020-04-26 19:13                     ` Jason Gunthorpe
2020-04-27  3:43                       ` Alex Williamson
2020-04-27 11:58                         ` Jason Gunthorpe
2020-04-27 13:19                           ` Alex Williamson
2020-04-27 13:22                             ` Jason Gunthorpe
2020-04-27 14:18                               ` Alex Williamson
2020-04-27 14:25                                 ` Jason Gunthorpe
2020-04-27 15:41                                   ` Alex Williamson
2020-04-27 16:16                                     ` Jason Gunthorpe
2020-04-27 16:25                                       ` Dave Jiang
2020-04-27 21:56                                         ` Jason Gunthorpe
2020-04-29  9:42                               ` Tian, Kevin
2020-05-08 20:47                                 ` Raj, Ashok
2020-05-08 23:16                                   ` Jason Gunthorpe
2020-05-08 23:52                                     ` Dave Jiang
2020-05-09  0:09                                     ` Raj, Ashok
2020-05-09 12:21                                       ` Jason Gunthorpe
2020-05-13  2:29                                         ` Jason Wang
2020-05-13  8:30                                         ` Tian, Kevin
2020-05-13 12:40                                           ` Jason Gunthorpe
2020-04-27 12:13                       ` Tian, Kevin
2020-04-27 12:55                         ` Jason Gunthorpe
2020-04-22 21:24   ` Dan Williams
2020-04-23 19:17     ` Dan Williams
2020-04-23 19:49       ` Jason Gunthorpe
2020-05-01 22:31         ` Dey, Megha
2020-05-03 22:21           ` Jason Gunthorpe
2020-05-03 22:32             ` Dey, Megha
2020-04-23 19:18     ` Jason Gunthorpe
2020-05-01 22:31       ` Dey, Megha
2020-05-03 22:22         ` Jason Gunthorpe
2020-05-03 22:31           ` Dey, Megha
2020-05-03 22:36             ` Jason Gunthorpe
2020-05-04  0:20               ` Dey, Megha
2020-04-22 23:04   ` Dey, Megha
2020-04-23 19:44     ` Jason Gunthorpe
2020-05-01 22:32       ` Dey, Megha
2020-04-24  6:31   ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).