linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
@ 2020-10-30 18:50 Dave Jiang
  2020-10-30 18:50 ` [PATCH v4 01/17] irqchip: Add IMS (Interrupt Message Store) driver Dave Jiang
                   ` (18 more replies)
  0 siblings, 19 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:50 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: Megha Dey, dmaengine, linux-kernel, linux-pci, kvm

- Would like to acquire Reviewed-by tags from Thomas for MSI and IMS related bits.
- Would like to acquire for Reviewed-by tags from Alex and/or Kirti for the VFIO mdev driver bits.
- Would like to acquire for Reviewed-by tag from Bjorn for PCI common bits
- Would like to acquire 5.11 kernel acceptance through dmaengine (Vinod) with the review tags. 

v4:
dev-msi:
- Make interrupt remapping code more readable (Thomas)
- Add flush writes to unmask/write and reset ims slots (Thomas)
- Interrupt Message Storm-> Interrupt Message Store (Thomas)
- Merge in pasid programming code. (Thomas)

mdev:
- Fixed up domain assignment (Thomas)
- Define magic numbers (Thomas)
- Move siov detection code to PCI common (Thomas)
- Remove duplicated MSI entry info (Thomas)
- Convert code to use ims_slot (Thomas)
- Add explanation of pasid programming for IMS entry (Thomas)
- Add release int handle release support due to DSA spec 1.1 update.

v3:
Dev-msi:
- No need to add support for 2 different dev-msi irq domains, a common
  once can be used for both the cases(with IR enabled/disabled)
- Add arch specific function to specify additions to msi_prepare callback
  instead of making the callback a weak function
- Call platform ops directly instead of a wrapper function
- Make mask/unmask callbacks as void functions
  dev->msi_domain should be updated at the device driver level before
  calling dev_msi_alloc_irqs()
  dev_msi_alloc/free_irqs() cannot be used for PCI devices
  Followed the generic layering scheme: infrastructure bits->arch bits->enabling bits

Mdev:
- Remove set kvm group notifier (Yan Zhao)
- Fix VFIO irq trigger removal (Yan Zhao)
- Add mmio read flush to ims mask (Jason)

v2:
IMS (now dev-msi):
- With recommendations from Jason/Thomas/Dan on making IMS more generic:
- Pass a non-pci generic device(struct device) for IMS management instead of mdev
- Remove all references to mdev and symbol_get/put
- Remove all references to IMS in common code and replace with dev-msi
- Remove dynamic allocation of platform-msi interrupts: no groups,no
  new msi list or list helpers
- Create a generic dev-msi domain with and without interrupt remapping enabled.
- Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis

mdev: 
- Removing unrelated bits from SVA enabling that’s not necessary for
  the submission. (Kevin)
- Restructured entire mdev driver series to make reviewing easier (Kevin)
- Made rw emulation more robust (Kevin)
- Removed uuid wq type and added single dedicated wq type (Kevin)
- Locking fixes for vdev (Yan Zhao)
- VFIO MSIX trigger fixes (Yan Zhao)

This code series will match the support of the 5.6 kernel (stage 1) driver but on guest.

The code has dependency on Thomas’s MSI restructuring patch series:
https://lore.kernel.org/lkml/20200826111628.794979401@linutronix.de/

The code has dependency on Baolu’s mdev domain patches:
https://lore.kernel.org/lkml/20201030045809.957927-1-baolu.lu@linux.intel.com/

The code has dependency on David Box’s dvsec definition patch:
https://lore.kernel.org/linux-pci/bc5f059c5bae957daebde699945c80808286bf45.camel@linux.intel.com/T/#m1d0dc12e3b2c739e2c37106a45f325bb8f001774

Stage 1 of the driver has been accepted in v5.6 kernel. It supports dedicated workqueue (wq)
without Shared Virtual Memory (SVM) support. 

Stage 2 of the driver supports shared wq and SVM. It should be pending for 5.11 and in
dmaengine/next.

VFIO mediated device framework allows vendor drivers to wrap a portion of
device resources into virtual devices (mdev). Each mdev can be assigned
to different guest using the same set of VFIO uAPIs as assigning a
physical device. Accessing to the mdev resource is served with mixed
policies. For example, vendor drivers typically mark data-path interface
as pass-through for fast guest operations, and then trap-and-mediate the
control-path interface to avoid undesired interference between mdevs. Some
level of emulation is necessary behind vfio mdev to compose the virtual
device interface.

This series brings mdev to idxd driver to enable Intel Scalable IOV
(SIOV), a hardware-assisted mediated pass-through technology. SIOV makes
each DSA wq independently assignable through PASID-granular resource/DMA
isolation. It helps improve scalability and reduces mediation complexity
against purely software-based mdev implementations. Each assigned wq is
configured by host and exposed to the guest in a read-only configuration
mode, which allows the guest to use the wq w/o additional setup. This
design greatly reduces the emulation bits to focus on handling commands
from guests.

There are two possible avenues to support virtual device composition:
1. VFIO mediated device (mdev) or 2. User space DMA through char device
(or UACCE). Given the small portion of emulation to satisfy our needs
and VFIO mdev having the infrastructure already to support the device
passthrough, we feel that VFIO mdev is the better route. For more in depth
explanation, see documentation in Documents/driver-api/vfio/mdev-idxd.rst.

Introducing mdev types “1dwq-v1” type. This mdev type allows
allocation of a single dedicated wq from available dedicated wqs. After
a workqueue (wq) is enabled, the user will generate an uuid. On mdev
creation, the mdev driver code will find a dwq depending on the mdev
type. When the create operation is successful, the user generated uuid
can be passed to qemu. When the guest boots up, it should discover a
DSA device when doing PCI discovery.

For example of “1dwq-v1” type:
1. Enable wq with “mdev” wq type
2. A user generated uuid.
3. The uuid is written to the mdev class sysfs path:
echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
4. Pass the following parameter to qemu:
"-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
 
The wq exported through mdev will have the read only config bit set
for configuration. This means that the device does not require the
typical configuration. After enabling the device, the user must set the
WQ type and name. That is all is necessary to enable the WQ and start
using it. The single wq configuration is not the only way to create the
mdev. Multi wqs support for mdev will be in the future works.
 
The mdev utilizes Interrupt Message Store or IMS[3], a device-specific
MSI implementation, instead of MSIX for interrupts for the guest. This
preserves MSIX for host usages and also allows a significantly larger
number of interrupt vectors for guest usage.

The idxd driver implements IMS as on-device memory mapped unified
storage. Each interrupt message is stored as a DWORD size data payload
and a 64-bit address (same as MSI-X). Access to the IMS is through the
host idxd driver.

The idxd driver makes use of the generic IMS irq chip and domain which
stores the interrupt messages as an array in device memory. Allocation and
freeing of interrupts happens via the generic msi_domain_alloc/free_irqs()
interface. One only needs to ensure the interrupt domain is stored in
the underlying device struct.

[1]: https://lore.kernel.org/lkml/157965011794.73301.15960052071729101309.stgit@djiang5-desk3.ch.intel.com/
[2]: https://software.intel.com/en-us/articles/intel-sdm
[3]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[4]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
[5]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
[6]: https://intel.github.io/idxd/
[7]: https://github.com/intel/idxd-driver idxd-stage2.5

---

Dave Jiang (15):
      dmaengine: idxd: add theory of operation documentation for idxd mdev
      dmaengine: idxd: add support for readonly config devices
      dmaengine: idxd: add interrupt handle request support
      PCI: add SIOV and IMS capability detection
      dmaengine: idxd: add IMS support in base driver
      dmaengine: idxd: add device support functions in prep for mdev
      dmaengine: idxd: add basic mdev registration and helper functions
      dmaengine: idxd: add emulation rw routines
      dmaengine: idxd: prep for virtual device commands
      dmaengine: idxd: virtual device commands emulation
      dmaengine: idxd: ims setup for the vdcm
      dmaengine: idxd: add mdev type as a new wq type
      dmaengine: idxd: add dedicated wq mdev type
      dmaengine: idxd: add new wq state for mdev
      dmaengine: idxd: add error notification from host driver to mediated device

Megha Dey (1):
      iommu/vt-d: Add DEV-MSI support

Thomas Gleixner (1):
      irqchip: Add IMS (Interrupt Message Store) driver


 .../ABI/stable/sysfs-driver-dma-idxd          |    6 +
 Documentation/driver-api/vfio/mdev-idxd.rst   |  404 ++++++
 MAINTAINERS                                   |    1 +
 drivers/dma/Kconfig                           |    9 +
 drivers/dma/idxd/Makefile                     |    2 +
 drivers/dma/idxd/cdev.c                       |    6 +-
 drivers/dma/idxd/device.c                     |  294 ++++-
 drivers/dma/idxd/idxd.h                       |   67 +-
 drivers/dma/idxd/init.c                       |   86 ++
 drivers/dma/idxd/irq.c                        |    6 +-
 drivers/dma/idxd/mdev.c                       | 1121 +++++++++++++++++
 drivers/dma/idxd/mdev.h                       |  116 ++
 drivers/dma/idxd/registers.h                  |   38 +-
 drivers/dma/idxd/submit.c                     |   37 +-
 drivers/dma/idxd/sysfs.c                      |   52 +-
 drivers/dma/idxd/vdev.c                       |  976 ++++++++++++++
 drivers/dma/idxd/vdev.h                       |   28 +
 drivers/iommu/intel/iommu.c                   |   31 +-
 drivers/iommu/intel/irq_remapping.c           |   34 +-
 drivers/pci/Kconfig                           |   15 +
 drivers/pci/Makefile                          |    2 +
 drivers/pci/dvsec.c                           |   40 +
 drivers/pci/siov.c                            |   50 +
 include/linux/pci-siov.h                      |   18 +
 include/linux/pci.h                           |    3 +
 include/uapi/linux/idxd.h                     |    2 +
 include/uapi/linux/pci_regs.h                 |    4 +
 kernel/irq/msi.c                              |    2 +
 28 files changed, 3352 insertions(+), 98 deletions(-)
 create mode 100644 Documentation/driver-api/vfio/mdev-idxd.rst
 create mode 100644 drivers/dma/idxd/mdev.c
 create mode 100644 drivers/dma/idxd/mdev.h
 create mode 100644 drivers/dma/idxd/vdev.c
 create mode 100644 drivers/dma/idxd/vdev.h
 create mode 100644 drivers/pci/dvsec.c
 create mode 100644 drivers/pci/siov.c
 create mode 100644 include/linux/pci-siov.h

--


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 01/17] irqchip: Add IMS (Interrupt Message Store) driver
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
@ 2020-10-30 18:50 ` Dave Jiang
  2020-10-30 22:01   ` Thomas Gleixner
  2020-10-30 18:51 ` [PATCH v4 02/17] iommu/vt-d: Add DEV-MSI support Dave Jiang
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:50 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

From: Thomas Gleixner <tglx@linutronix.de>

Generic IMS(Interrupt Message Store) irq chips and irq domain
implementations for IMS based devices which store the interrupt
messages in an array in device memory.

Allocation and freeing of interrupts happens via the generic
msi_domain_alloc/free_irqs() interface. No special purpose IMS magic
required as long as the interrupt domain is stored in the underlying
device struct.

Provide storage and a setter for an Address Space Identifier. The
identifier is stored in the top level irq_data and it only can be
modified when the interrupt is not active. Add the necessary storage
and helper functions and validate that interrupts which require an
ASID have one assigned.

[Megha : Fixed compile time errors
         Added necessary dependencies to IMS_MSI_ARRAY config
         Fixed polarity of IMS_VECTOR_CTRL
         Added reads after writes to flush writes to device
         Tested the IMS infrastructure with the IDXD driver]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/irqchip/Kconfig             |   14 ++
 drivers/irqchip/Makefile            |    1 
 drivers/irqchip/irq-ims-msi.c       |  204 +++++++++++++++++++++++++++++++++++
 include/linux/interrupt.h           |    2 
 include/linux/irq.h                 |    4 +
 include/linux/irqchip/irq-ims-msi.h |   68 ++++++++++++
 kernel/irq/manage.c                 |   32 +++++
 7 files changed, 325 insertions(+)
 create mode 100644 drivers/irqchip/irq-ims-msi.c
 create mode 100644 include/linux/irqchip/irq-ims-msi.h

diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index c6098eee0c7c..862ea81a69a0 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -597,4 +597,18 @@ config MST_IRQ
 	help
 	  Support MStar Interrupt Controller.
 
+config IMS_MSI
+	depends on PCI
+	select DEVICE_MSI
+	bool
+
+config IMS_MSI_ARRAY
+	bool "IMS Interrupt Message Store MSI controller for device memory storage arrays"
+	depends on PCI
+	select IMS_MSI
+	select GENERIC_MSI_IRQ_DOMAIN
+	help
+	  Support for IMS Interrupt Message Store MSI controller
+	  with IMS slot storage in a slot array in device memory
+
 endmenu
diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
index 94c2885882ee..a7d54605060a 100644
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -114,3 +114,4 @@ obj-$(CONFIG_LOONGSON_PCH_PIC)		+= irq-loongson-pch-pic.o
 obj-$(CONFIG_LOONGSON_PCH_MSI)		+= irq-loongson-pch-msi.o
 obj-$(CONFIG_MST_IRQ)			+= irq-mst-intc.o
 obj-$(CONFIG_SL28CPLD_INTC)		+= irq-sl28cpld.o
+obj-$(CONFIG_IMS_MSI)			+= irq-ims-msi.o
diff --git a/drivers/irqchip/irq-ims-msi.c b/drivers/irqchip/irq-ims-msi.c
new file mode 100644
index 000000000000..d54a54f5fdcc
--- /dev/null
+++ b/drivers/irqchip/irq-ims-msi.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: GPL-2.0
+// (C) Copyright 2020 Thomas Gleixner <tglx@linutronix.de>
+/*
+ * Shared interrupt chips and irq domains for IMS devices
+ */
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/msi.h>
+#include <linux/irq.h>
+#include <linux/irqdomain.h>
+
+#include <linux/irqchip/irq-ims-msi.h>
+
+#ifdef CONFIG_IMS_MSI_ARRAY
+
+struct ims_array_data {
+	struct ims_array_info	info;
+	unsigned long		map[0];
+};
+
+static inline void iowrite32_and_flush(u32 value, void __iomem *addr)
+{
+	iowrite32(value, addr);
+	ioread32(addr);
+}
+
+static void ims_array_mask_irq(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_slot __iomem *slot = desc->device_msi.priv_iomem;
+	u32 __iomem *ctrl = &slot->ctrl;
+
+	iowrite32_and_flush(ioread32(ctrl) | IMS_CTRL_VECTOR_MASKBIT, ctrl);
+}
+
+static void ims_array_unmask_irq(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_slot __iomem *slot = desc->device_msi.priv_iomem;
+	u32 __iomem *ctrl = &slot->ctrl;
+
+	iowrite32_and_flush(ioread32(ctrl) & ~IMS_CTRL_VECTOR_MASKBIT, ctrl);
+}
+
+static void ims_array_write_msi_msg(struct irq_data *data, struct msi_msg *msg)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_slot __iomem *slot = desc->device_msi.priv_iomem;
+
+	iowrite32(msg->address_lo, &slot->address_lo);
+	iowrite32(msg->address_hi, &slot->address_hi);
+	iowrite32_and_flush(msg->data, &slot->data);
+}
+
+static int ims_array_set_auxdata(struct irq_data *data, unsigned int which,
+				 u64 auxval)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_slot __iomem *slot = desc->device_msi.priv_iomem;
+	u32 val, __iomem *ctrl = &slot->ctrl;
+
+	if (which != IMS_AUXDATA_CONTROL_WORD)
+		return -EINVAL;
+	if (auxval & ~(u64)IMS_CONTROL_WORD_AUXMASK)
+		return -EINVAL;
+
+	val = ioread32(ctrl) & IMS_CONTROL_WORD_IRQMASK;
+	iowrite32_and_flush(val | (u32) auxval, ctrl);
+	return 0;
+}
+
+static const struct irq_chip ims_array_msi_controller = {
+	.name			= "IMS",
+	.irq_mask		= ims_array_mask_irq,
+	.irq_unmask		= ims_array_unmask_irq,
+	.irq_write_msi_msg	= ims_array_write_msi_msg,
+	.irq_set_auxdata	= ims_array_set_auxdata,
+	.irq_retrigger		= irq_chip_retrigger_hierarchy,
+	.flags			= IRQCHIP_SKIP_SET_WAKE,
+};
+
+static void ims_array_reset_slot(struct ims_slot __iomem *slot)
+{
+	iowrite32(0, &slot->address_lo);
+	iowrite32(0, &slot->address_hi);
+	iowrite32(0, &slot->data);
+	iowrite32_and_flush(IMS_CTRL_VECTOR_MASKBIT, &slot->ctrl);
+}
+
+static void ims_array_free_msi_store(struct irq_domain *domain,
+				     struct device *dev)
+{
+	struct msi_domain_info *info = domain->host_data;
+	struct ims_array_data *ims = info->data;
+	struct msi_desc *entry;
+
+	for_each_msi_entry(entry, dev) {
+		if (entry->device_msi.priv_iomem) {
+			clear_bit(entry->device_msi.hwirq, ims->map);
+			ims_array_reset_slot(entry->device_msi.priv_iomem);
+			entry->device_msi.priv_iomem = NULL;
+			entry->device_msi.hwirq = 0;
+		}
+	}
+}
+
+static int ims_array_alloc_msi_store(struct irq_domain *domain,
+				     struct device *dev, int nvec)
+{
+	struct msi_domain_info *info = domain->host_data;
+	struct ims_array_data *ims = info->data;
+	struct msi_desc *entry;
+
+	for_each_msi_entry(entry, dev) {
+		unsigned int idx;
+
+		idx = find_first_zero_bit(ims->map, ims->info.max_slots);
+		if (idx >= ims->info.max_slots)
+			goto fail;
+		set_bit(idx, ims->map);
+		entry->device_msi.priv_iomem = &ims->info.slots[idx];
+		ims_array_reset_slot(entry->device_msi.priv_iomem);
+		entry->device_msi.hwirq = idx;
+	}
+	return 0;
+
+fail:
+	ims_array_free_msi_store(domain, dev);
+	return -ENOSPC;
+}
+
+struct ims_array_domain_template {
+	struct msi_domain_ops	ops;
+	struct msi_domain_info	info;
+};
+
+static const struct ims_array_domain_template ims_array_domain_template = {
+	.ops = {
+		.msi_alloc_store	= ims_array_alloc_msi_store,
+		.msi_free_store		= ims_array_free_msi_store,
+	},
+	.info = {
+		.flags		= MSI_FLAG_USE_DEF_DOM_OPS |
+				  MSI_FLAG_USE_DEF_CHIP_OPS,
+		.handler	= handle_edge_irq,
+		.handler_name	= "edge",
+	},
+};
+
+struct irq_domain *
+pci_ims_array_create_msi_irq_domain(struct pci_dev *pdev,
+				    struct ims_array_info *ims_info)
+{
+	struct ims_array_domain_template *info;
+	struct ims_array_data *data;
+	struct irq_domain *domain;
+	struct irq_chip *chip;
+	unsigned int size;
+
+	/* Allocate new domain storage */
+	info = kmemdup(&ims_array_domain_template,
+		       sizeof(ims_array_domain_template), GFP_KERNEL);
+	if (!info)
+		return NULL;
+	/* Link the ops */
+	info->info.ops = &info->ops;
+
+	/* Allocate ims_info along with the bitmap */
+	size = sizeof(*data);
+	size += BITS_TO_LONGS(ims_info->max_slots) * sizeof(unsigned long);
+	data = kzalloc(size, GFP_KERNEL);
+	if (!data)
+		goto err_info;
+
+	data->info = *ims_info;
+	info->info.data = data;
+
+	/*
+	 * Allocate an interrupt chip because the core needs to be able to
+	 * update it with default callbacks.
+	 */
+	chip = kmemdup(&ims_array_msi_controller,
+		       sizeof(ims_array_msi_controller), GFP_KERNEL);
+	if (!chip)
+		goto err_data;
+	info->info.chip = chip;
+
+	domain = pci_subdevice_msi_create_irq_domain(pdev, &info->info);
+	if (!domain)
+		goto err_chip;
+
+	return domain;
+
+err_chip:
+	kfree(chip);
+err_data:
+	kfree(data);
+err_info:
+	kfree(info);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_ims_array_create_msi_irq_domain);
+
+#endif /* CONFIG_IMS_MSI_ARRAY */
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index ee8299eb1f52..43a8d1e9647e 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -487,6 +487,8 @@ extern int irq_get_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
 extern int irq_set_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
 				 bool state);
 
+int irq_set_auxdata(unsigned int irq, unsigned int which, u64 val);
+
 #ifdef CONFIG_IRQ_FORCED_THREADING
 # ifdef CONFIG_PREEMPT_RT
 #  define force_irqthreads	(true)
diff --git a/include/linux/irq.h b/include/linux/irq.h
index c54365309e97..fd162aea0c3f 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -491,6 +491,8 @@ static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
  *				irq_request_resources
  * @irq_compose_msi_msg:	optional to compose message content for MSI
  * @irq_write_msi_msg:	optional to write message content for MSI
+ * @irq_set_auxdata:	Optional function to update auxiliary data e.g. in
+ *			shared registers
  * @irq_get_irqchip_state:	return the internal state of an interrupt
  * @irq_set_irqchip_state:	set the internal state of a interrupt
  * @irq_set_vcpu_affinity:	optional to target a vCPU in a virtual machine
@@ -538,6 +540,8 @@ struct irq_chip {
 	void		(*irq_compose_msi_msg)(struct irq_data *data, struct msi_msg *msg);
 	void		(*irq_write_msi_msg)(struct irq_data *data, struct msi_msg *msg);
 
+	int		(*irq_set_auxdata)(struct irq_data *data, unsigned int which, u64 auxval);
+
 	int		(*irq_get_irqchip_state)(struct irq_data *data, enum irqchip_irq_state which, bool *state);
 	int		(*irq_set_irqchip_state)(struct irq_data *data, enum irqchip_irq_state which, bool state);
 
diff --git a/include/linux/irqchip/irq-ims-msi.h b/include/linux/irqchip/irq-ims-msi.h
new file mode 100644
index 000000000000..a9e43e1f7890
--- /dev/null
+++ b/include/linux/irqchip/irq-ims-msi.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* (C) Copyright 2020 Thomas Gleixner <tglx@linutronix.de> */
+
+#ifndef _LINUX_IRQCHIP_IRQ_IMS_MSI_H
+#define _LINUX_IRQCHIP_IRQ_IMS_MSI_H
+
+#include <linux/types.h>
+#include <linux/bits.h>
+
+/**
+ * ims_hw_slot - The hardware layout of an IMS based MSI message
+ * @address_lo:	Lower 32bit address
+ * @address_hi:	Upper 32bit address
+ * @data:	Message data
+ * @ctrl:	Control word
+ *
+ * This structure is used by both the device memory array and the queue
+ * memory variants of IMS.
+ */
+struct ims_slot {
+	u32	address_lo;
+	u32	address_hi;
+	u32	data;
+	u32	ctrl;
+} __packed;
+
+/*
+ * The IMS control word utilizes bit 0-2 for interrupt control. The remaining
+ * bits can contain auxiliary data.
+ */
+#define IMS_CONTROL_WORD_IRQMASK	GENMASK(2, 0)
+#define IMS_CONTROL_WORD_AUXMASK	GENMASK(31, 3)
+
+/* Bit to mask the interrupt in ims_hw_slot::ctrl */
+#define IMS_CTRL_VECTOR_MASKBIT		BIT(0)
+
+/* Auxiliary control word data related defines */
+enum {
+	IMS_AUXDATA_CONTROL_WORD,
+};
+
+#define IMS_CTRL_PASID_ENABLE		BIT(3)
+#define IMS_CTRL_PASID_SHIFT		12
+
+static inline u32 ims_ctrl_pasid_aux(unsigned int pasid, bool enable)
+{
+	u32 auxval = pasid << IMS_CTRL_PASID_SHIFT;
+
+	return enable ? auxval | IMS_CTRL_PASID_ENABLE : auxval;
+}
+
+/**
+ * struct ims_array_info - Information to create an IMS array domain
+ * @slots:	Pointer to the start of the array
+ * @max_slots:	Maximum number of slots in the array
+ */
+struct ims_array_info {
+	struct ims_slot		__iomem *slots;
+	unsigned int		max_slots;
+};
+
+struct pci_dev;
+struct irq_domain;
+
+struct irq_domain *pci_ims_array_create_msi_irq_domain(struct pci_dev *pdev,
+						       struct ims_array_info *ims_info);
+
+#endif
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dc65d90108db..d7bf2ae67170 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2752,3 +2752,35 @@ int irq_set_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
 	return err;
 }
 EXPORT_SYMBOL_GPL(irq_set_irqchip_state);
+
+/**
+ * irq_set_auxdata - Set auxiliary data
+ * @irq:	Interrupt to update
+ * @which:	Selector which data to update
+ * @auxval:	Auxiliary data value
+ *
+ * Function to update auxiliary data for an interrupt, e.g. to update data
+ * which is stored in a shared register or data storage (e.g. IMS).
+ */
+int irq_set_auxdata(unsigned int irq, unsigned int which, u64 val)
+{
+	struct irq_desc *desc;
+	struct irq_data *data;
+	unsigned long flags;
+	int res = -ENODEV;
+
+	desc = irq_get_desc_buslock(irq, &flags, 0);
+	if (!desc)
+		return -EINVAL;
+
+	for (data = &desc->irq_data; data; data = irqd_get_parent_data(data)) {
+		if (data->chip->irq_set_auxdata) {
+			res = data->chip->irq_set_auxdata(data, which, val);
+			break;
+		}
+	}
+
+	irq_put_desc_busunlock(desc, flags);
+	return res;
+}
+EXPORT_SYMBOL_GPL(irq_set_auxdata);



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 02/17] iommu/vt-d: Add DEV-MSI support
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
  2020-10-30 18:50 ` [PATCH v4 01/17] irqchip: Add IMS (Interrupt Message Store) driver Dave Jiang
@ 2020-10-30 18:51 ` Dave Jiang
  2020-10-30 20:31   ` Thomas Gleixner
  2020-10-30 18:51 ` [PATCH v4 03/17] dmaengine: idxd: add theory of operation documentation for idxd mdev Dave Jiang
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:51 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

From: Megha Dey <megha.dey@intel.com>

Add required support in the interrupt remapping driver for devices
which generate dev-msi interrupts and use the intel remapping
domain as the parent domain.

Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/iommu/intel/irq_remapping.c |   34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 0cfce1d3b7bb..0e8d106d34c0 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1260,6 +1260,16 @@ static struct irq_chip intel_ir_chip = {
 	.irq_set_vcpu_affinity	= intel_ir_set_vcpu_affinity,
 };
 
+static void irte_prepare_msg(struct msi_msg *msg, int index, int subhandle)
+{
+	msg->address_hi = MSI_ADDR_BASE_HI;
+	msg->data = subhandle;
+	msg->address_lo = MSI_ADDR_BASE_LO | MSI_ADDR_IR_EXT_INT |
+			  MSI_ADDR_IR_SHV |
+			  MSI_ADDR_IR_INDEX1(index) |
+			  MSI_ADDR_IR_INDEX2(index);
+}
+
 static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 					     struct irq_cfg *irq_cfg,
 					     struct irq_alloc_info *info,
@@ -1301,19 +1311,18 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 		break;
 
 	case X86_IRQ_ALLOC_TYPE_HPET:
+		set_hpet_sid(irte, info->devid);
+		irte_prepare_msg(msg, index, sub_handle);
+		break;
+
 	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
-		if (info->type == X86_IRQ_ALLOC_TYPE_HPET)
-			set_hpet_sid(irte, info->devid);
-		else
-			set_msi_sid(irte, msi_desc_to_pci_dev(info->desc));
-
-		msg->address_hi = MSI_ADDR_BASE_HI;
-		msg->data = sub_handle;
-		msg->address_lo = MSI_ADDR_BASE_LO | MSI_ADDR_IR_EXT_INT |
-				  MSI_ADDR_IR_SHV |
-				  MSI_ADDR_IR_INDEX1(index) |
-				  MSI_ADDR_IR_INDEX2(index);
+		set_msi_sid(irte, msi_desc_to_pci_dev(info->desc));
+		irte_prepare_msg(msg, index, sub_handle);
+		break;
+
+	case X86_IRQ_ALLOC_TYPE_DEV_MSI:
+		irte_prepare_msg(msg, index, sub_handle);
 		break;
 
 	default:
@@ -1358,7 +1367,8 @@ static int intel_irq_remapping_alloc(struct irq_domain *domain,
 	if (!info || !iommu)
 		return -EINVAL;
 	if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI &&
-	    info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX)
+	    info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX &&
+	    info->type != X86_IRQ_ALLOC_TYPE_DEV_MSI)
 		return -EINVAL;
 
 	/*



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 03/17] dmaengine: idxd: add theory of operation documentation for idxd mdev
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
  2020-10-30 18:50 ` [PATCH v4 01/17] irqchip: Add IMS (Interrupt Message Store) driver Dave Jiang
  2020-10-30 18:51 ` [PATCH v4 02/17] iommu/vt-d: Add DEV-MSI support Dave Jiang
@ 2020-10-30 18:51 ` Dave Jiang
  2020-10-30 18:51 ` [PATCH v4 04/17] dmaengine: idxd: add support for readonly config devices Dave Jiang
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:51 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Add idxd vfio mediated device theory of operation documentation.
Provide description on mdev design, usage, and why vfio mdev was chosen.

Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 Documentation/driver-api/vfio/mdev-idxd.rst |  404 +++++++++++++++++++++++++++
 MAINTAINERS                                 |    1 
 2 files changed, 405 insertions(+)
 create mode 100644 Documentation/driver-api/vfio/mdev-idxd.rst

diff --git a/Documentation/driver-api/vfio/mdev-idxd.rst b/Documentation/driver-api/vfio/mdev-idxd.rst
new file mode 100644
index 000000000000..c75b7d88ef6b
--- /dev/null
+++ b/Documentation/driver-api/vfio/mdev-idxd.rst
@@ -0,0 +1,404 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+IDXD Overview
+=============
+IDXD (Intel Data Accelerator Driver) is the driver for the Intel Data
+Streaming Accelerator (DSA).  Intel DSA is a high performance data copy
+and transformation accelerator. In addition to data move operations,
+the device also supports data fill, CRC generation, Data Integrity Field
+(DIF), and memory compare and delta generation. Intel DSA supports
+a variety of PCI-SIG defined capabilities such as Address Translation
+Services (ATS), Process address Space ID (PASID), Page Request Interface
+(PRI), Message Signalled Interrupts Extended (MSI-X), and Advanced Error
+Reporting (AER). Some of those capabilities enable the device to support
+Shared Virtual Memory (SVM), or also known as Shared Virtual Addressing
+(SVA). Intel DSA also supports Intel Scalable I/O Virtualization (SIOV)
+to improve scalability of device assignment.
+
+
+The Intel DSA device contains the following basic components:
+* Work queue (WQ)
+
+  A WQ is an on device storage to queue descriptors to the
+  device. Requests are added to a WQ by using new CPU instructions
+  (MOVDIR64B and ENQCMD(S)) to write the memory mapped “portal”
+  associated with each WQ.
+
+* Engine
+
+  Operation unit that pulls descriptors from WQs and processes them.
+
+* Group
+
+  Abstract container to associate one or more engines with one or more WQs.
+
+
+Two types of WQs are supported:
+* Dedicated WQ (DWQ)
+
+  A single client should owns this exclusively and can submit work
+  to it. The MOVDIR64B instruction is used to submit descriptors to
+  this type of WQ. The instruction is a posted write, therefore the
+  submitter must ensure not exceed the WQ length for submission. The
+  use of PASID is optional with DWQ. Multiple clients can submit to
+  a DWQ, but sychronization is required due to when the WQ is full,
+  the submission is silently dropped.
+
+* Shared WQ (SWQ)
+
+  Multiple clients can submit work to this WQ. The submitter must use
+  ENQMCDS (from supervisor mode) or ENQCMD (from user mode). These
+  instructions will indicate via EFLAGS.ZF bit whether a submission
+  succeeds. The use of PASID is mandatory to identify the address space
+  of each client.
+
+
+For more information about the new instructions [1][2].
+
+The IDXD driver is broken down into following usages:
+* In kernel interface through dmaengine subsystem API.
+* Userspace DMA support through character device. mmap(2) is utilized
+  to map directly to mmio address (or portals) for descriptor submission.
+* VFIO Mediated device (mdev) supporting device passthrough usages.
+
+
+=================================
+Assignable Device Interface (ADI)
+=================================
+The term ADI is used to represent the minimal unit of assignment for
+Intel Scalable IOV device. Each ADI instance refers to the set of device
+backend resources that are allocated, configured and organized as an
+isolated unit.
+
+Intel DSA defines each WQ as an ADI. The MMIO registers of each work queue
+are partitioned into two categories:
+* MMIO registers accessed for data-path operations. 
+* MMIO registers accessed for control-path operations.
+
+Data-path MMIO registers of each WQ are contained within
+one or more system page size aligned regions and can be mapped in the
+CPU page table for direct access from the guest. Control-path MMIO
+registers of all WQs are located together but segregated from data-path
+MMIO regions. Therefore, guest updates to control-path registers must
+be intercepted and then go through the host driver to be reflected in
+the device.
+
+Data-path MMIO registers of DSA WQ are portals for submitting descriptors
+to the device. There are four portals per WQ, each being 64 bytes
+in size and located on a separate 4KB page in BAR2. Each portal has
+different implications regarding interrupt message type (MSI vs. IMS)
+and occupancy control (limited vs. unlimited). It is not necessary to
+map all portals to the guest.
+
+Control-path MMIO registers of DSA WQ include global configurations
+(shared by all WQs) and WQ-specific configurations. The owner
+(e.g. the guest) of the WQ is expected to only change WQ-specific
+configurations. Intel DSA spec introduces a “Configuration Support”
+capability which, if cleared, indicates that some fields of WQ
+configuration registers are read-only and the WQ configuration is
+pre-configured by the host. 
+
+
+Interrupt Message Store (IMS)
+=============================
+The ADI utilizes Interrupt Message Store (IMS), a device-specific MSI
+implementation, instead of MSIX for interrupts for the guest. This
+preserves MSIX for host usages and also allows a significantly larger
+number of interrupt vectors for large number of guests usage.
+
+Intel DSA device implements IMS as on-device memory mapped unified
+storage. Each interrupt message is stored as a DWORD size data payload
+and a 64-bit address (same as MSI-X). Access to the IMS is through the
+host idxd driver.
+
+The idxd driver makes use of the generic IMS irq chip and domain which
+stores the interrupt messages in an array in device memory. Allocation and
+freeing of interrupts happens via the generic msi_domain_alloc/free_irqs()
+interface. Driver only needs to ensure the interrupt domain is stored in
+the underlying device struct.
+
+
+ADI Isolation
+=============
+Operations or functioning of one ADI must not affect the functioning
+of another ADI or the physical device. Upstream memory requests from
+different ADIs are distinguished using a Process Address Space Identifier
+(PASID). With the support of PASID-granular address translation in Intel
+VT-d, the address space targeted by a request from ADI can be a Host
+Virtual Address (HVA), Host I/O Virtual Address (HIOVA), Guest Physical
+Address (GPA), Guest Virtual Address (GVA), Guest I/O Virtual Address
+(GIOVA), etc. The PASID identity for an ADI is expected to be accessed
+or modified by privileged software through the host driver.
+
+=========================
+Virtual DSA (vDSA) Device
+=========================
+The DSA WQ itself is not a PCI device thus must be composed into a
+virtual DSA device to the guest.
+
+The composition logic needs to handle four main requirements:
+* Emulate PCI config space.
+* Map data-path portals for direct access from the guest.
+* Emulate control-path MMIO registers and selectively forward WQ
+  configuration requests through host driver to the device.
+* Forward and emulate WQ interrupts to the guest.
+
+The composition logic tells the guest aspects of WQ which are configurable
+through a combination of capability fields, e.g.:
+* Configuration Support (if cleared, most aspects are not modifiable).
+* WQ Mode Support (if cleared, cannot change between dedicated and
+  shared mode).
+* Dedicated Mode Support.
+* Shared Mode Support.
+* ...
+
+The virtual capability fields are set according to the vDSA
+type. Following are examples of vDSA types and related WQ configurability:
+* Type ‘1DWQ_v1’
+   * One DSA gen 1 WQ dedicated to this guest
+   * Guest cannot share the WQ between its clients (no guest SVA)
+   * Guest cannot change any WQ configuration
+* Type ‘1SWQ_v1’
+   * One DSA gen 1 WQ shared between multiple VMs
+   * Guest can further share the WQ between its clients (guest SVA is required)
+   * Guest cannot change any WQ configuration
+* Type ‘1WQfull_v1’
+   * One DSA gen 1 WQ dedicated to this guest
+   * Guest is allowed to do limited WQ configurations (thru WQCFG
+     register), including WQ mode (dedicated/shared), privilege,
+     threshold, PASID enable, PASID value, etc.
+
+Besides, the composition logic also needs to serve administrative commands
+(thru virtual CMD register) through host driver, including:
+* Drain/abort all descriptors submitted by this guest.
+* Drain/abort descriptors associated with a PASID.
+* Enable/disable/reset the WQ (when it’s not shared by multiple VMs).
+* Request interrupt handle.
+
+With this design, vDSA emulation is **greatly simplified**. Most
+registers are emulated in simple READ-ONLY flavor, and handling limited
+configurability is required only for a few registers.
+
+===========================
+VFIO mdev vs. userspace DMA
+===========================
+There are two avenues to support vDSA composition.
+1. VFIO mediated device (mdev)
+2. Userspace DMA through char device
+
+VFIO mdev provides a generic subdevice passthrough framework. Unified
+uAPIs are used for both device and subdevice passthrough, thus any
+userspace VMM which already supports VFIO device passthrough would
+naturally support mdev/subdevice passthrough. The implication of VFIO
+mdev is putting emulation of device interface in the kernel (part of
+host driver) which must be carefully scrutinized. Fortunately, vDSA
+composition includes only a small portion of emulation code, due to the
+fact that most registers are simply READ-ONLY to the guest. The majority
+logic of handling limited configurability and administrative commands
+is anyway required to sit in the kernel, regardless of which kernel uAPI
+is pursued. In this regard, VFIO mdev is a nice fit for vDSA composition.
+
+IDXD driver provides a char device interface for applications to
+map the WQ portal and directly submit descriptors to do DMA. This
+interface provides only data-path access to userspace and relies on
+the host driver to handle control-path configurations. Expanding such
+interface to support subdevice passthrough allows moving the emulation
+code to userspace. However, quite some work is required to grow it from
+an application-oriented interface into a passthrough-oriented interface:
+new uAPIs to handle guest WQ configurability and administrative commands,
+and new uAPIs to handle passthrough specific requirements (e.g. DMA map,
+guest SVA, live migration, posted interrupt, etc.). And once it is done,
+every userspace VMM has to explicitly bind to IDXD specific uAPI, even
+though the real user is in the guest (instead of the VMM itself) in the
+passthrough scenario.
+
+Although some generalization might be possible to reduce the work of
+handling passthrough, we feel the difference between userspace DMA
+and subdevice passthrough is distinct in IDXD. Therefore, we choose to
+build vDSA composition on top of VFIO mdev framework and leave userspace
+DMA intact after discussion at LPC 2020.
+
+=============================
+Host Registration and Release
+=============================
+
+Intel DSA reports support for Intel Scalable IOV via a PCI Express
+Designated Vendor Specific Extended Capability (DVSEC). In addition,
+PASID-granular address translation capability is required in the
+IOMMU. During host initialization, the IDXD driver should check the
+presence of both capabilities before calling mdev_register_device()
+to register with the VFIO mdev framework and provide a set of ops
+(struct mdev_parent_ops). The IOMMU capability is indicated by the
+IOMMU_DEV_FEAT_AUX feature flag with iommu_dev_has_feature() and enabled
+with iommu_dev_enable_feature().
+
+On release, iommu_dev_disable_feature() is called after
+mdev_unregister_device() to disable the IOMMU_DEV_FEAT_AUX flag that
+the driver enabled during host initialization.
+
+The mdev_parent_ops data structure is filled out by the driver to provide
+a number of ops called by VFIO mdev framework::
+
+        struct mdev_parent_ops {
+                .supported_type_groups
+                .create
+                .remove
+                .open
+                .release
+                .read
+                .write
+                .mmap
+                .ioctl
+        };
+
+Supported_type_groups
+---------------------
+At the moment only one vDSA type is supported.
+
+“1DWQ_v1”:
+  Single dedicated WQ (DSA 1.0) with read-only configuration exposed to
+  the guest. On the guest kernel, a vDSA device shows up with a single
+  WQ that is pre-configured by the host. The configuration for the WQ
+  is entirely read-only and cannot be reconfigured. There is no support
+  of guest SVA on this WQ.
+
+  The interrupt vector 0 is emulated by the driver to support the admin
+  command completion and error reporting. A second interrupt vector is
+  bound to the IMS and used for I/O operation.
+
+
+create
+------
+API function to create the mdev. mdev_set_iommu_device() is called to
+associate the mdev device to the parent PCI device. This function is
+where the driver sets up and initializes the resources to support a single
+mdev device. This is triggered through sysfs to initiate the creation.
+
+remove
+------
+API function that mirrors the create() function and releases all the
+resources backing the mdev.  This is also triggered through sysfs.
+
+open
+----
+API function that is called down from VFIO userspace to indicate to the
+driver that the upper layers are ready to claim and utilize the mdev. IMS
+entries are allocated and setup here.
+
+release
+-------
+The mirror function to open that releases the mdev by VFIO userspace.
+
+read / write
+------------
+This is where the Intel IDXD driver provides read/write emulation of
+PCI config space and MMIO registers. These paths are the “slow” path
+of the mediated device and emulation is used rather than direct access
+to the hardware resources. Typically configuration and administrative
+commands go through this path. This allows the mdev to show up as a
+virtual PCI device on the guest kernel.
+
+The emulation of PCI config space is nothing special, which is simply
+copied from kvmgt. In the future this part might be consolidated to
+reduce duplication.
+
+Emulating MMIO reads are simply memory copies. There is no side-effect
+to be emulated upon guest read.
+
+Emulating MMIO writes are required only for a few registers, due to
+read-only configuration on the ‘1DWQ-v1’ type. Majority of composition
+logic is hooked in the CMD register for performing administrative commands
+such as WQ drain, abort, enable, disable and reset operations. The rest of
+the emulation is about handling errors (GENCTRL/SWERROR) and interrupts
+(INTCAUSE/MSIXPERM) on the vDSA device. Future mdev types might allow
+limited WQ configurability, which then requires additional emulation of
+the WQCFG register.
+
+mmap
+----
+This is the function that provides the setup to expose a portion of the
+hardware, also known as portals, for direct access for “fast” path
+operations through the mmap() syscall. A limited region of the hardware
+is mapped to the guest for direct I/O submission.
+
+There are four portals per WQ: unlimited MSI-X, limited MSI-X, unlimited
+IMS, limited IMS.  Descriptors submitted to limited portals are subject
+to threshold configuration limitations for shared WQs. The MSI-X portals
+are used for host submissions, and the IMS portals are mapped to vm for
+guest submission.
+
+ioctl
+-----
+This API function does several things
+* Provides general device information to VFIO userspace.
+* Provides device region information (PCI, mmio, etc).
+* Get interrupts information
+* Setup interrupts for the mediated device.
+* Mdev device reset
+
+For the Intel idxd driver, Interrupt Message Store (IMS) vectors are being
+used for mdev interrupts rather than MSIX vectors. IMS provides additional
+interrupt vectors outside of PCI MSIX specification in order to support
+significantly more vectors. The emulated interrupt (0) is connected through
+kernel eventfd. When interrupt 0 needs to be asserted, the driver will
+signal the eventfd to trigger the MSIX vector 0 interrupt on the guest.
+The IMS interrupts are setup via eventfd as well. However, it utilizes
+irq bypass manager to directly inject the interrupt in the guest.
+
+To allocate IMS, we utilize the IMS array APIs. On host init, we need
+to create the MSI domain::
+
+        struct ims_array_info ims_info;
+        struct device *dev = &pci_dev->dev;
+
+
+        /* assign the device IMS size */
+        ims_info.max_slots = max_ims_size;
+        /* assign the MMIO base address for the IMS table */
+        ims_info.slots = mmio_base + ims_offset;
+        /* assign the MSI domain to the device */
+        dev->msi_domain = pci_ims_array_create_msi_irq_domain(pci_dev, &ims_info);
+
+When we are ready to allocate the interrupts::
+
+        struct device *dev = mdev_dev(mdev);
+
+        irq_domain = pci_dev->dev.msi_domain;
+        /* the irqs are allocated against device of mdev */
+        rc = msi_domain_alloc_irqs(irq_domain, dev, num_vecs);
+
+
+        /* we can retrieve the slot index from msi_entry */
+        for_each_msi_entry(entry, dev) {
+                slot_index = entry->device_msi.hwirq;
+                irq = entry->irq;
+        }
+
+        request_irq(irq, interrupt_handler_function, 0, “ims”, context);
+
+
+The DSA device is structured such that MSI-X table entry 0 is used for
+admin commands completion, error reporting, and other misc commands. The
+remaining MSI-X table entries are used for WQ completion. For vm support,
+the virtual device also presents a similar layout. Therefore, vector 0
+is emulated by the software. Additional vector(s) are associated with IMS.
+
+The index (slot) for the per device IMS entry is managed by the MSI
+core. The index is the “interrupt handle” that the guest kernel
+needs to program into a DMA descriptor. That interrupt handle tells the
+hardware which IMS vector to trigger the interrupt on for the host.
+
+The virtual device presents an admin command called “request interrupt
+handle” that is not supported by the physical device. On probe of
+the DSA device on the guest kernel, the guest driver will issue the
+“request interrupt handle” command in order to get the interrupt
+handle for descriptor programming. The host driver will return the
+assigned slot for the IMS entry table to the guest.
+
+References
+==========
+[1] https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
+[2] https://software.intel.com/en-us/articles/intel-sdm
+[3] https://software.intel.com/sites/default/files/managed/cc/0e/intel-scalable-io-virtualization-technical-specification.pdf
+[4] https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
diff --git a/MAINTAINERS b/MAINTAINERS
index e73636b75f29..af04e674853c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8888,6 +8888,7 @@ INTEL IADX DRIVER
 M:	Dave Jiang <dave.jiang@intel.com>
 L:	dmaengine@vger.kernel.org
 S:	Supported
+F:	Documentation/driver-api/vfio/mdev-idxd.rst
 F:	drivers/dma/idxd/*
 F:	include/uapi/linux/idxd.h
 



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 04/17] dmaengine: idxd: add support for readonly config devices
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (2 preceding siblings ...)
  2020-10-30 18:51 ` [PATCH v4 03/17] dmaengine: idxd: add theory of operation documentation for idxd mdev Dave Jiang
@ 2020-10-30 18:51 ` Dave Jiang
  2020-10-30 18:51 ` [PATCH v4 05/17] dmaengine: idxd: add interrupt handle request support Dave Jiang
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:51 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

The VFIO mediated device for idxd driver will provide a virtual DSA
device by backing it with a workqueue. The virtual device will be limited
with the wq configuration registers set to read-only. Add support and
helper functions for the handling of a DSA device with the configuration
registers marked as read-only.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/device.c |  116 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/idxd.h   |    1 
 drivers/dma/idxd/init.c   |    8 +++
 drivers/dma/idxd/sysfs.c  |   20 +++++---
 4 files changed, 137 insertions(+), 8 deletions(-)

diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index d6f551dcbcb6..7003884cd8ad 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -778,3 +778,119 @@ int idxd_device_config(struct idxd_device *idxd)
 
 	return 0;
 }
+
+static int idxd_wq_load_config(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	int wqcfg_offset;
+	int i;
+
+	wqcfg_offset = WQCFG_OFFSET(idxd, wq->id, 0);
+	memcpy_fromio(wq->wqcfg, idxd->reg_base + wqcfg_offset, idxd->wqcfg_size);
+
+	wq->size = wq->wqcfg->wq_size;
+	wq->threshold = wq->wqcfg->wq_thresh;
+	if (wq->wqcfg->priv)
+		wq->type = IDXD_WQT_KERNEL;
+
+	/* The driver does not support shared WQ mode in read-only config yet */
+	if (wq->wqcfg->mode == 0 || wq->wqcfg->pasid_en)
+		return -EOPNOTSUPP;
+
+	set_bit(WQ_FLAG_DEDICATED, &wq->flags);
+
+	wq->priority = wq->wqcfg->priority;
+
+	for (i = 0; i < WQCFG_STRIDES(idxd); i++) {
+		wqcfg_offset = WQCFG_OFFSET(idxd, wq->id, i);
+		dev_dbg(dev, "WQ[%d][%d][%#x]: %#x\n", wq->id, i, wqcfg_offset, wq->wqcfg->bits[i]);
+	}
+
+	return 0;
+}
+
+static void idxd_group_load_config(struct idxd_group *group)
+{
+	struct idxd_device *idxd = group->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	int i, j, grpcfg_offset;
+
+	/*
+	 * Load WQS bit fields
+	 * Iterate through all 256 bits 64 bits at a time
+	 */
+	for (i = 0; i < GRPWQCFG_STRIDES; i++) {
+		struct idxd_wq *wq;
+
+		grpcfg_offset = GRPWQCFG_OFFSET(idxd, group->id, i);
+		group->grpcfg.wqs[i] = ioread64(idxd->reg_base + grpcfg_offset);
+		dev_dbg(dev, "GRPCFG wq[%d:%d: %#x]: %#llx\n",
+			group->id, i, grpcfg_offset, group->grpcfg.wqs[i]);
+
+		if (i * 64 >= idxd->max_wqs)
+			break;
+
+		/* Iterate through all 64 bits and check for wq set */
+		for (j = 0; j < 64; j++) {
+			int id = i * 64 + j;
+
+			/* No need to check beyond max wqs */
+			if (id >= idxd->max_wqs)
+				break;
+
+			/* Set group assignment for wq if wq bit is set */
+			if (group->grpcfg.wqs[i] & BIT(j)) {
+				wq = &idxd->wqs[id];
+				wq->group = group;
+			}
+		}
+	}
+
+	grpcfg_offset = GRPENGCFG_OFFSET(idxd, group->id);
+	group->grpcfg.engines = ioread64(idxd->reg_base + grpcfg_offset);
+	dev_dbg(dev, "GRPCFG engs[%d: %#x]: %#llx\n", group->id,
+		grpcfg_offset, group->grpcfg.engines);
+
+	/* Iterate through all 64 bits to check engines set */
+	for (i = 0; i < 64; i++) {
+		if (i >= idxd->max_engines)
+			break;
+
+		if (group->grpcfg.engines & BIT(i)) {
+			struct idxd_engine *engine = &idxd->engines[i];
+
+			engine->group = group;
+		}
+	}
+
+	grpcfg_offset = GRPFLGCFG_OFFSET(idxd, group->id);
+	group->grpcfg.flags.bits = ioread32(idxd->reg_base + grpcfg_offset);
+	dev_dbg(dev, "GRPFLAGS flags[%d: %#x]: %#x\n",
+		group->id, grpcfg_offset, group->grpcfg.flags.bits);
+}
+
+int idxd_device_load_config(struct idxd_device *idxd)
+{
+	union gencfg_reg reg;
+	int i, rc;
+
+	reg.bits = ioread32(idxd->reg_base + IDXD_GENCFG_OFFSET);
+	idxd->token_limit = reg.token_limit;
+
+	for (i = 0; i < idxd->max_groups; i++) {
+		struct idxd_group *group = &idxd->groups[i];
+
+		idxd_group_load_config(group);
+	}
+
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq = &idxd->wqs[i];
+
+		rc = idxd_wq_load_config(wq);
+		if (rc < 0)
+			return rc;
+	}
+
+	return 0;
+}
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 7e54209c433a..1afc34be4ed0 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -317,6 +317,7 @@ void idxd_device_cleanup(struct idxd_device *idxd);
 int idxd_device_config(struct idxd_device *idxd);
 void idxd_device_wqs_clear_state(struct idxd_device *idxd);
 void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
+int idxd_device_load_config(struct idxd_device *idxd);
 
 /* work queue control */
 int idxd_wq_alloc_resources(struct idxd_wq *wq);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 45b0eac640c3..98b1091181bb 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -349,6 +349,14 @@ static int idxd_probe(struct idxd_device *idxd)
 	if (rc)
 		goto err_setup;
 
+	/* If the configs are readonly, then load them from device */
+	if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags)) {
+		dev_dbg(dev, "Loading RO device config\n");
+		rc = idxd_device_load_config(idxd);
+		if (rc < 0)
+			goto err_setup;
+	}
+
 	rc = idxd_setup_interrupts(idxd);
 	if (rc)
 		goto err_setup;
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 6d292eb79bf3..304eb2cf532e 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -102,7 +102,7 @@ static int idxd_config_bus_match(struct device *dev,
 
 static int idxd_config_bus_probe(struct device *dev)
 {
-	int rc;
+	int rc = 0;
 	unsigned long flags;
 
 	dev_dbg(dev, "%s called\n", __func__);
@@ -120,7 +120,8 @@ static int idxd_config_bus_probe(struct device *dev)
 
 		/* Perform IDXD configuration and enabling */
 		spin_lock_irqsave(&idxd->dev_lock, flags);
-		rc = idxd_device_config(idxd);
+		if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+			rc = idxd_device_config(idxd);
 		spin_unlock_irqrestore(&idxd->dev_lock, flags);
 		if (rc < 0) {
 			module_put(THIS_MODULE);
@@ -207,7 +208,8 @@ static int idxd_config_bus_probe(struct device *dev)
 		}
 
 		spin_lock_irqsave(&idxd->dev_lock, flags);
-		rc = idxd_device_config(idxd);
+		if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+			rc = idxd_device_config(idxd);
 		spin_unlock_irqrestore(&idxd->dev_lock, flags);
 		if (rc < 0) {
 			mutex_unlock(&wq->wq_lock);
@@ -328,12 +330,14 @@ static int idxd_config_bus_remove(struct device *dev)
 
 		idxd_unregister_dma_device(idxd);
 		rc = idxd_device_disable(idxd);
-		for (i = 0; i < idxd->max_wqs; i++) {
-			struct idxd_wq *wq = &idxd->wqs[i];
+		if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags)) {
+			for (i = 0; i < idxd->max_wqs; i++) {
+				struct idxd_wq *wq = &idxd->wqs[i];
 
-			mutex_lock(&wq->wq_lock);
-			idxd_wq_disable_cleanup(wq);
-			mutex_unlock(&wq->wq_lock);
+				mutex_lock(&wq->wq_lock);
+				idxd_wq_disable_cleanup(wq);
+				mutex_unlock(&wq->wq_lock);
+			}
 		}
 		module_put(THIS_MODULE);
 		if (rc < 0)



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 05/17] dmaengine: idxd: add interrupt handle request support
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (3 preceding siblings ...)
  2020-10-30 18:51 ` [PATCH v4 04/17] dmaengine: idxd: add support for readonly config devices Dave Jiang
@ 2020-10-30 18:51 ` Dave Jiang
  2020-10-30 18:51 ` [PATCH v4 06/17] PCI: add SIOV and IMS capability detection Dave Jiang
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:51 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Add support for requesting interrupt handle from the device. The interrupt
handle is put in the interrupt handle field of a descriptor for the device
to determine which interrupt vector to use be it MSI-X or IMS. On the host
device, the interrupt handle is indexed to the MSI-X table. This allows a
descriptor to program the interrupt handle 1:1 with the MSI-X index without
getting it from the request interrupt handle device command. For a guest
device, the index can be any index that the host assigned for the IMS
table, and therefore it must be requested from the virtual device during
MSI-X setup by the driver running on the guest.

On the actual hardware the MSIX vector 0 is misc interrupt and handles
events such as administrative command completion, error reporting,
performance monitor overflow, and etc. The MSIX vectors 1...N
are used for descriptor completion interrupts. On the guest kernel,
the MSIX interrupts are backed by the mediated device through emulation
or IMS vectors. Vector 0 is handled through emulation by the host vdcm.
It only requires the host driver to send the signal to qemu. The vector 1
(and more may be supported later) is backed by IMS.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/device.c    |   58 ++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/idxd.h      |   13 +++++++++
 drivers/dma/idxd/init.c      |   48 +++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/registers.h |    9 ++++++-
 drivers/dma/idxd/submit.c    |   29 ++++++++++++++++-----
 5 files changed, 149 insertions(+), 8 deletions(-)

diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 7003884cd8ad..a9ae970db0a4 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -532,6 +532,64 @@ void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid)
 	dev_dbg(dev, "pasid %d drained\n", pasid);
 }
 
+int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
+				   enum idxd_interrupt_type irq_type)
+{
+	struct device *dev = &idxd->pdev->dev;
+	u32 operand, status;
+
+	if (!(idxd->hw.cmd_cap & BIT(IDXD_CMD_REQUEST_INT_HANDLE)))
+		return -EOPNOTSUPP;
+
+	dev_dbg(dev, "get int handle, idx %d\n", idx);
+
+	operand = idx & GENMASK(15, 0);
+	if (irq_type == IDXD_IRQ_IMS)
+		operand |= CMD_INT_HANDLE_IMS;
+
+	dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_REQUEST_INT_HANDLE, operand);
+
+	idxd_cmd_exec(idxd, IDXD_CMD_REQUEST_INT_HANDLE, operand, &status);
+
+	if ((status & IDXD_CMDSTS_ERR_MASK) != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "request int handle failed: %#x\n", status);
+		return -ENXIO;
+	}
+
+	*handle = (status >> IDXD_CMDSTS_RES_SHIFT) & GENMASK(15, 0);
+
+	dev_dbg(dev, "int handle acquired: %u\n", *handle);
+	return 0;
+}
+
+int idxd_device_release_int_handle(struct idxd_device *idxd, int handle,
+				   enum idxd_interrupt_type irq_type)
+{
+	struct device *dev = &idxd->pdev->dev;
+	u32 operand, status;
+
+	if (!(idxd->hw.cmd_cap & BIT(IDXD_CMD_RELEASE_INT_HANDLE)))
+		return -EOPNOTSUPP;
+
+	dev_dbg(dev, "release int handle, handle %d\n", handle);
+
+	operand = handle & GENMASK(15, 0);
+	if (irq_type == IDXD_IRQ_IMS)
+		operand |= CMD_INT_HANDLE_IMS;
+
+	dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_RELEASE_INT_HANDLE, operand);
+
+	idxd_cmd_exec(idxd, IDXD_CMD_RELEASE_INT_HANDLE, operand, &status);
+
+	if ((status & IDXD_CMDSTS_ERR_MASK) != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "release int handle failed: %#x\n", status);
+		return -ENXIO;
+	}
+
+	dev_dbg(dev, "int handle released.\n");
+	return 0;
+}
+
 /* Device configuration bits */
 static void idxd_group_config_write(struct idxd_group *group)
 {
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 1afc34be4ed0..a506a16c83ee 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -140,6 +140,7 @@ struct idxd_hw {
 	union group_cap_reg group_cap;
 	union engine_cap_reg engine_cap;
 	struct opcap opcap;
+	u32 cmd_cap;
 };
 
 enum idxd_device_state {
@@ -205,6 +206,8 @@ struct idxd_device {
 	struct dma_device dma_dev;
 	struct workqueue_struct *wq;
 	struct work_struct work;
+
+	int *int_handles;
 };
 
 /* IDXD software descriptor */
@@ -218,6 +221,7 @@ struct idxd_desc {
 	struct list_head list;
 	int id;
 	int cpu;
+	unsigned int vector;
 	struct idxd_wq *wq;
 };
 
@@ -253,6 +257,11 @@ enum idxd_portal_prot {
 	IDXD_PORTAL_LIMITED,
 };
 
+enum idxd_interrupt_type {
+	IDXD_IRQ_MSIX = 0,
+	IDXD_IRQ_IMS,
+};
+
 static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
 {
 	return prot * 0x1000;
@@ -318,6 +327,10 @@ int idxd_device_config(struct idxd_device *idxd);
 void idxd_device_wqs_clear_state(struct idxd_device *idxd);
 void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
 int idxd_device_load_config(struct idxd_device *idxd);
+int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
+				   enum idxd_interrupt_type irq_type);
+int idxd_device_release_int_handle(struct idxd_device *idxd, int handle,
+				   enum idxd_interrupt_type irq_type);
 
 /* work queue control */
 int idxd_wq_alloc_resources(struct idxd_wq *wq);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 98b1091181bb..c136216e19e8 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -133,6 +133,22 @@ static int idxd_setup_interrupts(struct idxd_device *idxd)
 		}
 		dev_dbg(dev, "Allocated idxd-msix %d for vector %d\n",
 			i, msix->vector);
+
+		if (idxd->hw.cmd_cap & BIT(IDXD_CMD_REQUEST_INT_HANDLE)) {
+			/*
+			 * The MSIX vector enumeration starts at 1 with vector 0 being the
+			 * misc interrupt that handles non I/O completion events. The
+			 * interrupt handles are for IMS enumeration on guest. The misc
+			 * interrupt vector does not require a handle and therefore we start
+			 * the int_handles at index 0. Since 'i' starts at 1, the first
+			 * int_handles index will be 0.
+			 */
+			rc = idxd_device_request_int_handle(idxd, i, &idxd->int_handles[i - 1],
+							    IDXD_IRQ_MSIX);
+			if (rc < 0)
+				goto err_no_irq;
+			dev_dbg(dev, "int handle requested: %u\n", idxd->int_handles[i - 1]);
+		}
 	}
 
 	idxd_unmask_error_interrupts(idxd);
@@ -160,6 +176,13 @@ static int idxd_setup_internals(struct idxd_device *idxd)
 	int i;
 
 	init_waitqueue_head(&idxd->cmd_waitq);
+
+	if (idxd->hw.cmd_cap & BIT(IDXD_CMD_REQUEST_INT_HANDLE)) {
+		idxd->int_handles = devm_kcalloc(dev, idxd->max_wqs, sizeof(int), GFP_KERNEL);
+		if (!idxd->int_handles)
+			return -ENOMEM;
+	}
+
 	idxd->groups = devm_kcalloc(dev, idxd->max_groups,
 				    sizeof(struct idxd_group), GFP_KERNEL);
 	if (!idxd->groups)
@@ -233,6 +256,12 @@ static void idxd_read_caps(struct idxd_device *idxd)
 	/* reading generic capabilities */
 	idxd->hw.gen_cap.bits = ioread64(idxd->reg_base + IDXD_GENCAP_OFFSET);
 	dev_dbg(dev, "gen_cap: %#llx\n", idxd->hw.gen_cap.bits);
+
+	if (idxd->hw.gen_cap.cmd_cap) {
+		idxd->hw.cmd_cap = ioread32(idxd->reg_base + IDXD_CMDCAP_OFFSET);
+		dev_dbg(dev, "cmd_cap: %#x\n", idxd->hw.cmd_cap);
+	}
+
 	idxd->max_xfer_bytes = 1ULL << idxd->hw.gen_cap.max_xfer_shift;
 	dev_dbg(dev, "max xfer size: %llu bytes\n", idxd->max_xfer_bytes);
 	idxd->max_batch_size = 1U << idxd->hw.gen_cap.max_batch_shift;
@@ -471,6 +500,24 @@ static void idxd_flush_work_list(struct idxd_irq_entry *ie)
 	}
 }
 
+static void idxd_release_int_handles(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int i, rc;
+
+	for (i = 0; i < idxd->num_wq_irqs; i++) {
+		if (idxd->hw.cmd_cap & BIT(IDXD_CMD_RELEASE_INT_HANDLE)) {
+			rc = idxd_device_release_int_handle(idxd, idxd->int_handles[i],
+							    IDXD_IRQ_MSIX);
+			if (rc < 0)
+				dev_warn(dev, "irq handle %d release failed\n",
+					 idxd->int_handles[i]);
+			else
+				dev_dbg(dev, "int handle requested: %u\n", idxd->int_handles[i]);
+		}
+	}
+}
+
 static void idxd_shutdown(struct pci_dev *pdev)
 {
 	struct idxd_device *idxd = pci_get_drvdata(pdev);
@@ -495,6 +542,7 @@ static void idxd_shutdown(struct pci_dev *pdev)
 		idxd_flush_work_list(irq_entry);
 	}
 
+	idxd_release_int_handles(idxd);
 	destroy_workqueue(idxd->wq);
 }
 
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index d29a58ee2651..d02fd59a8e39 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -23,8 +23,8 @@ union gen_cap_reg {
 		u64 overlap_copy:1;
 		u64 cache_control_mem:1;
 		u64 cache_control_cache:1;
+		u64 cmd_cap:1;
 		u64 rsvd:3;
-		u64 int_handle_req:1;
 		u64 dest_readback:1;
 		u64 drain_readback:1;
 		u64 rsvd2:6;
@@ -179,8 +179,11 @@ enum idxd_cmd {
 	IDXD_CMD_DRAIN_PASID,
 	IDXD_CMD_ABORT_PASID,
 	IDXD_CMD_REQUEST_INT_HANDLE,
+	IDXD_CMD_RELEASE_INT_HANDLE,
 };
 
+#define CMD_INT_HANDLE_IMS		0x10000
+
 #define IDXD_CMDSTS_OFFSET		0xa8
 union cmdsts_reg {
 	struct {
@@ -192,6 +195,8 @@ union cmdsts_reg {
 	u32 bits;
 } __packed;
 #define IDXD_CMDSTS_ACTIVE		0x80000000
+#define IDXD_CMDSTS_ERR_MASK		0xff
+#define IDXD_CMDSTS_RES_SHIFT		8
 
 enum idxd_cmdsts_err {
 	IDXD_CMDSTS_SUCCESS = 0,
@@ -227,6 +232,8 @@ enum idxd_cmdsts_err {
 	IDXD_CMDSTS_ERR_NO_HANDLE,
 };
 
+#define IDXD_CMDCAP_OFFSET		0xb0
+
 #define IDXD_SWERR_OFFSET		0xc0
 #define IDXD_SWERR_VALID		0x00000001
 #define IDXD_SWERR_OVERFLOW		0x00000002
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index efca5d8468a6..cdea5d37ef24 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -22,11 +22,17 @@ static struct idxd_desc *__get_desc(struct idxd_wq *wq, int idx, int cpu)
 		desc->hw->pasid = idxd->pasid;
 
 	/*
-	 * Descriptor completion vectors are 1-8 for MSIX. We will round
-	 * robin through the 8 vectors.
+	 * Descriptor completion vectors are 1...N for MSIX. We will round
+	 * robin through the N vectors.
 	 */
 	wq->vec_ptr = (wq->vec_ptr % idxd->num_wq_irqs) + 1;
-	desc->hw->int_handle = wq->vec_ptr;
+	if (!idxd->int_handles) {
+		desc->hw->int_handle = wq->vec_ptr;
+	} else {
+		desc->vector = wq->vec_ptr;
+		desc->hw->int_handle = idxd->int_handles[desc->vector];
+	}
+
 	return desc;
 }
 
@@ -79,7 +85,6 @@ void idxd_free_desc(struct idxd_wq *wq, struct idxd_desc *desc)
 int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc)
 {
 	struct idxd_device *idxd = wq->idxd;
-	int vec = desc->hw->int_handle;
 	void __iomem *portal;
 	int rc;
 
@@ -112,9 +117,19 @@ int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc)
 	 * Pending the descriptor to the lockless list for the irq_entry
 	 * that we designated the descriptor to.
 	 */
-	if (desc->hw->flags & IDXD_OP_FLAG_RCI)
-		llist_add(&desc->llnode,
-			  &idxd->irq_entries[vec].pending_llist);
+	if (desc->hw->flags & IDXD_OP_FLAG_RCI) {
+		int vec;
+
+		/*
+		 * If the driver is on host kernel, it would be the value
+		 * assigned to interrupt handle, which is index for MSIX
+		 * vector. If it's guest then can't use the int_handle since
+		 * that is the index to IMS for the entire device. The guest
+		 * device local index will be used.
+		 */
+		vec = !idxd->int_handles ? desc->hw->int_handle : desc->vector;
+		llist_add(&desc->llnode, &idxd->irq_entries[vec].pending_llist);
+	}
 
 	return 0;
 }



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (4 preceding siblings ...)
  2020-10-30 18:51 ` [PATCH v4 05/17] dmaengine: idxd: add interrupt handle request support Dave Jiang
@ 2020-10-30 18:51 ` Dave Jiang
  2020-10-30 19:51   ` Bjorn Helgaas
  2020-10-30 18:51 ` [PATCH v4 07/17] dmaengine: idxd: add IMS support in base driver Dave Jiang
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:51 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Intel Scalable I/O Virtualization (SIOV) enables sharing of I/O devices
across isolated domains through PASID based sub-device partitioning.
Interrupt Message Storage (IMS) enables devices to store the interrupt
messages in a device-specific optimized manner without the scalability
restrictions of the PCIe defined MSI-X capability. IMS is one of the
features supported under SIOV.

Move SIOV detection code from Intel iommu driver code to common PCI. Making
the detection code common allows supported accelerator drivers to query the
PCI core for SIOV and IMS capabilities. The support code will add the
ability to query the PCI DVSEC capabilities for the SIOV cap.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Baolu Lu <baolu.lu@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel/iommu.c   |   31 ++-----------------------
 drivers/pci/Kconfig           |   15 ++++++++++++
 drivers/pci/Makefile          |    2 ++
 drivers/pci/dvsec.c           |   40 +++++++++++++++++++++++++++++++++
 drivers/pci/siov.c            |   50 +++++++++++++++++++++++++++++++++++++++++
 include/linux/pci-siov.h      |   18 +++++++++++++++
 include/linux/pci.h           |    3 ++
 include/uapi/linux/pci_regs.h |    4 +++
 8 files changed, 134 insertions(+), 29 deletions(-)
 create mode 100644 drivers/pci/dvsec.c
 create mode 100644 drivers/pci/siov.c
 create mode 100644 include/linux/pci-siov.h

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 3e77a88b236c..d9335f590b42 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -36,6 +36,7 @@
 #include <linux/tboot.h>
 #include <linux/dmi.h>
 #include <linux/pci-ats.h>
+#include <linux/pci-siov.h>
 #include <linux/memblock.h>
 #include <linux/dma-map-ops.h>
 #include <linux/dma-direct.h>
@@ -5883,34 +5884,6 @@ static int intel_iommu_disable_auxd(struct device *dev)
 	return 0;
 }
 
-/*
- * A PCI express designated vendor specific extended capability is defined
- * in the section 3.7 of Intel scalable I/O virtualization technical spec
- * for system software and tools to detect endpoint devices supporting the
- * Intel scalable IO virtualization without host driver dependency.
- *
- * Returns the address of the matching extended capability structure within
- * the device's PCI configuration space or 0 if the device does not support
- * it.
- */
-static int siov_find_pci_dvsec(struct pci_dev *pdev)
-{
-	int pos;
-	u16 vendor, id;
-
-	pos = pci_find_next_ext_capability(pdev, 0, 0x23);
-	while (pos) {
-		pci_read_config_word(pdev, pos + 4, &vendor);
-		pci_read_config_word(pdev, pos + 8, &id);
-		if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
-			return pos;
-
-		pos = pci_find_next_ext_capability(pdev, pos, 0x23);
-	}
-
-	return 0;
-}
-
 static bool
 intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
 {
@@ -5925,7 +5898,7 @@ intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
 		if (ret < 0)
 			return false;
 
-		return !!siov_find_pci_dvsec(to_pci_dev(dev));
+		return pci_siov_supported(to_pci_dev(dev));
 	}
 
 	if (feat == IOMMU_DEV_FEAT_SVA) {
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 0c473d75e625..cf7f4d17d8cc 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -161,6 +161,21 @@ config PCI_PASID
 
 	  If unsure, say N.
 
+config PCI_DVSEC
+	bool
+
+config PCI_SIOV
+	select PCI_PASID
+	select PCI_DVSEC
+	bool "PCI SIOV support"
+	help
+	  Scalable I/O Virtualzation enables sharing of I/O devices across isolated
+	  domains through PASID based sub-device partitioning. One of the sub features
+	  supported by SIOV is Inetrrupt Message Storage (IMS). Select this option if
+	  you want to compile the support into your kernel.
+
+	  If unsure, say N.
+
 config PCI_P2PDMA
 	bool "PCI peer-to-peer transfer support"
 	depends on ZONE_DEVICE
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 522d2b974e91..653a1d69b0fc 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -20,6 +20,8 @@ obj-$(CONFIG_PCI_QUIRKS)	+= quirks.o
 obj-$(CONFIG_HOTPLUG_PCI)	+= hotplug/
 obj-$(CONFIG_PCI_MSI)		+= msi.o
 obj-$(CONFIG_PCI_ATS)		+= ats.o
+obj-$(CONFIG_PCI_DVSEC)		+= dvsec.o
+obj-$(CONFIG_PCI_SIOV)		+= siov.o
 obj-$(CONFIG_PCI_IOV)		+= iov.o
 obj-$(CONFIG_PCI_BRIDGE_EMUL)	+= pci-bridge-emul.o
 obj-$(CONFIG_PCI_LABEL)		+= pci-label.o
diff --git a/drivers/pci/dvsec.c b/drivers/pci/dvsec.c
new file mode 100644
index 000000000000..e49b079f0717
--- /dev/null
+++ b/drivers/pci/dvsec.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI DVSEC helper functions
+ * Copyright (C) 2020 Intel Corp.
+ */
+
+#include <linux/export.h>
+#include <linux/pci.h>
+#include <uapi/linux/pci_regs.h>
+#include "pci.h"
+
+/**
+ * pci_find_dvsec - return position of DVSEC with provided vendor and dvsec id
+ * @dev: the PCI device
+ * @vendor: Vendor for the DVSEC
+ * @id: the DVSEC cap id
+ *
+ * Return the offset of DVSEC on success or -ENOTSUPP if not found
+ */
+int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
+{
+	u16 dev_vendor, dev_id;
+	int pos;
+
+	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
+	if (!pos)
+		return -ENOTSUPP;
+
+	while (pos) {
+		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &dev_vendor);
+		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &dev_id);
+		if (dev_vendor == vendor && dev_id == id)
+			return pos;
+
+		pos = pci_find_next_ext_capability(dev, pos, PCI_EXT_CAP_ID_DVSEC);
+	}
+
+	return -ENOTSUPP;
+}
+EXPORT_SYMBOL_GPL(pci_find_dvsec);
diff --git a/drivers/pci/siov.c b/drivers/pci/siov.c
new file mode 100644
index 000000000000..6147e6ae5832
--- /dev/null
+++ b/drivers/pci/siov.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Intel Scalable I/O Virtualization support
+ * Copyright (C) 2020 Intel Corp.
+ */
+
+#include <linux/export.h>
+#include <linux/pci.h>
+#include <linux/pci-siov.h>
+#include <uapi/linux/pci_regs.h>
+#include "pci.h"
+
+/*
+ * A PCI express designated vendor specific extended capability is defined
+ * in the section 3.7 of Intel scalable I/O virtualization technical spec
+ * for system software and tools to detect endpoint devices supporting the
+ * Intel scalable IO virtualization without host driver dependency.
+ */
+
+/**
+ * pci_siov_supported - check if the device can use SIOV
+ * @dev: the PCI device
+ *
+ * Returns true if the device supports SIOV,  false otherwise.
+ */
+bool pci_siov_supported(struct pci_dev *dev)
+{
+	return pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV) < 0 ? false : true;
+}
+EXPORT_SYMBOL_GPL(pci_siov_supported);
+
+/**
+ * pci_ims_supported - check if the device can use IMS
+ * @dev: the PCI device
+ *
+ * Returns true if the device supports IMS, false otherwise.
+ */
+bool pci_ims_supported(struct pci_dev *dev)
+{
+	int pos;
+	u32 caps;
+
+	pos = pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV);
+	if (pos < 0)
+		return false;
+
+	pci_read_config_dword(dev, pos + PCI_DVSEC_INTEL_SIOV_CAP, &caps);
+	return (caps & PCI_DVSEC_INTEL_SIOV_CAP_IMS) ? true : false;
+}
+EXPORT_SYMBOL_GPL(pci_ims_supported);
diff --git a/include/linux/pci-siov.h b/include/linux/pci-siov.h
new file mode 100644
index 000000000000..a8a4eb5f4634
--- /dev/null
+++ b/include/linux/pci-siov.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_PCI_SIOV_H
+#define LINUX_PCI_SIOV_H
+
+#include <linux/pci.h>
+
+#ifdef CONFIG_PCI_SIOV
+/* Scalable I/O Virtualization */
+bool pci_siov_supported(struct pci_dev *dev);
+bool pci_ims_supported(struct pci_dev *dev);
+#else /* CONFIG_PCI_SIOV */
+static inline bool pci_siov_supported(struct pci_dev *d)
+{ return false; }
+static inline bool pci_ims_supported(struct pci_dev *d)
+{ return false; }
+#endif /* CONFIG_PCI_SIOV */
+
+#endif /* LINUX_PCI_SIOV_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 22207a79762c..4710f09b43b1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1070,6 +1070,7 @@ int pci_find_next_ext_capability(struct pci_dev *dev, int pos, int cap);
 int pci_find_ht_capability(struct pci_dev *dev, int ht_cap);
 int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int ht_cap);
 struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
+int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id);
 
 u64 pci_get_dsn(struct pci_dev *dev);
 
@@ -1726,6 +1727,8 @@ static inline int pci_find_next_capability(struct pci_dev *dev, u8 post,
 { return 0; }
 static inline int pci_find_ext_capability(struct pci_dev *dev, int cap)
 { return 0; }
+static inline int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
+{ return 0; }
 
 static inline u64 pci_get_dsn(struct pci_dev *dev)
 { return 0; }
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 8f8bd2318c6c..3532528441ef 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1071,6 +1071,10 @@
 #define PCI_DVSEC_HEADER1		0x4 /* Designated Vendor-Specific Header1 */
 #define PCI_DVSEC_HEADER2		0x8 /* Designated Vendor-Specific Header2 */
 
+#define PCI_DVSEC_ID_INTEL_SIOV		0x5
+#define PCI_DVSEC_INTEL_SIOV_CAP	0x14
+#define PCI_DVSEC_INTEL_SIOV_CAP_IMS	0x1
+
 /* Data Link Feature */
 #define PCI_DLF_CAP		0x04	/* Capabilities Register */
 #define  PCI_DLF_EXCHANGE_ENABLE	0x80000000  /* Data Link Feature Exchange Enable */



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 07/17] dmaengine: idxd: add IMS support in base driver
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (5 preceding siblings ...)
  2020-10-30 18:51 ` [PATCH v4 06/17] PCI: add SIOV and IMS capability detection Dave Jiang
@ 2020-10-30 18:51 ` Dave Jiang
  2020-10-30 18:51 ` [PATCH v4 08/17] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:51 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

In preparation for support of VFIO mediated device for idxd driver, the
enabling for Interrupt Message Store (IMS) interrupts is added for the idxd
With IMS support the idxd driver can dynamically allocate interrupts on a
per mdev basis based on how many IMS vectors that are mapped to the mdev
device. This commit only provides the support functions in the base driver
and not the VFIO mdev code utilization.

The commit has some portal related changes. A "portal" is a special
location within the MMIO BAR2 of the DSA device where descriptors are
submitted via the CPU command MOVDIR64B or ENQCMD(S). The offset for the
portal address determines whether the submitted descriptor is for MSI-X
or IMS notification.

See Intel SIOV spec for more details:
https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 Documentation/ABI/stable/sysfs-driver-dma-idxd |    6 ++++++
 drivers/dma/idxd/cdev.c                        |    4 ++--
 drivers/dma/idxd/idxd.h                        |   13 +++++++++----
 drivers/dma/idxd/init.c                        |   19 +++++++++++++++++++
 drivers/dma/idxd/submit.c                      |   10 ++++++++--
 drivers/dma/idxd/sysfs.c                       |    9 +++++++++
 6 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-driver-dma-idxd b/Documentation/ABI/stable/sysfs-driver-dma-idxd
index 5ea81ffd3c1a..ed5aeecf7015 100644
--- a/Documentation/ABI/stable/sysfs-driver-dma-idxd
+++ b/Documentation/ABI/stable/sysfs-driver-dma-idxd
@@ -129,6 +129,12 @@ KernelVersion:	5.10.0
 Contact:	dmaengine@vger.kernel.org
 Description:	The last executed device administrative command's status/error.
 
+What:		/sys/bus/dsa/devices/dsa<m>/ims_size
+Date:		Oct 15, 2020
+KernelVersion:	5.11.0
+Contact:	dmaengine@vger.kernel.org
+Description:	The total number of vectors available for Interrupt Message Store.
+
 What:		/sys/bus/dsa/devices/wq<m>.<n>/block_on_fault
 Date:		Oct 27, 2020
 KernelVersion:	5.11.0
diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index 010b820d8f74..b774bf336347 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -204,8 +204,8 @@ static int idxd_cdev_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_flags |= VM_DONTCOPY;
-	pfn = (base + idxd_get_wq_portal_full_offset(wq->id,
-				IDXD_PORTAL_LIMITED)) >> PAGE_SHIFT;
+	pfn = (base + idxd_get_wq_portal_full_offset(wq->id, IDXD_PORTAL_LIMITED,
+						     IDXD_IRQ_MSIX)) >> PAGE_SHIFT;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_private_data = ctx;
 
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index a506a16c83ee..549426bfb443 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -154,6 +154,7 @@ enum idxd_device_flag {
 	IDXD_FLAG_CONFIGURABLE = 0,
 	IDXD_FLAG_CMD_RUNNING,
 	IDXD_FLAG_PASID_ENABLED,
+	IDXD_FLAG_SIOV_SUPPORTED,
 };
 
 struct idxd_device {
@@ -181,6 +182,7 @@ struct idxd_device {
 
 	int num_groups;
 
+	u32 ims_offset;
 	u32 msix_perm_offset;
 	u32 wqcfg_offset;
 	u32 grpcfg_offset;
@@ -188,6 +190,7 @@ struct idxd_device {
 
 	u64 max_xfer_bytes;
 	u32 max_batch_size;
+	int ims_size;
 	int max_groups;
 	int max_engines;
 	int max_tokens;
@@ -262,15 +265,17 @@ enum idxd_interrupt_type {
 	IDXD_IRQ_IMS,
 };
 
-static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
+static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot,
+					    enum idxd_interrupt_type irq_type)
 {
-	return prot * 0x1000;
+	return prot * 0x1000 + irq_type * 0x2000;
 }
 
 static inline int idxd_get_wq_portal_full_offset(int wq_id,
-						 enum idxd_portal_prot prot)
+						 enum idxd_portal_prot prot,
+						 enum idxd_interrupt_type irq_type)
 {
-	return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot);
+	return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot, irq_type);
 }
 
 static inline void idxd_set_type(struct idxd_device *idxd)
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index c136216e19e8..4a21c2a17a62 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -16,6 +16,7 @@
 #include <linux/idr.h>
 #include <linux/intel-svm.h>
 #include <linux/iommu.h>
+#include <linux/pci-siov.h>
 #include <uapi/linux/idxd.h>
 #include <linux/dmaengine.h>
 #include "../dmaengine.h"
@@ -244,10 +245,27 @@ static void idxd_read_table_offsets(struct idxd_device *idxd)
 	dev_dbg(dev, "IDXD Work Queue Config Offset: %#x\n", idxd->wqcfg_offset);
 	idxd->msix_perm_offset = offsets.msix_perm * IDXD_TABLE_MULT;
 	dev_dbg(dev, "IDXD MSIX Permission Offset: %#x\n", idxd->msix_perm_offset);
+	idxd->ims_offset = offsets.ims * IDXD_TABLE_MULT;
+	dev_dbg(dev, "IDXD IMS Offset: %#x\n", idxd->ims_offset);
 	idxd->perfmon_offset = offsets.perfmon * IDXD_TABLE_MULT;
 	dev_dbg(dev, "IDXD Perfmon Offset: %#x\n", idxd->perfmon_offset);
 }
 
+static void idxd_check_siov(struct idxd_device *idxd)
+{
+	struct pci_dev *pdev = idxd->pdev;
+
+	if (pci_ims_supported(idxd->pdev) && idxd->hw.gen_cap.max_ims_mult) {
+		idxd->ims_size = idxd->hw.gen_cap.max_ims_mult * 256ULL;
+		dev_dbg(&pdev->dev, "IMS size: %u\n", idxd->ims_size);
+		set_bit(IDXD_FLAG_SIOV_SUPPORTED, &idxd->flags);
+		dev_dbg(&pdev->dev, "IMS supported for device\n");
+		return;
+	}
+
+	dev_dbg(&pdev->dev, "SIOV unsupported for device\n");
+}
+
 static void idxd_read_caps(struct idxd_device *idxd)
 {
 	struct device *dev = &idxd->pdev->dev;
@@ -266,6 +284,7 @@ static void idxd_read_caps(struct idxd_device *idxd)
 	dev_dbg(dev, "max xfer size: %llu bytes\n", idxd->max_xfer_bytes);
 	idxd->max_batch_size = 1U << idxd->hw.gen_cap.max_batch_shift;
 	dev_dbg(dev, "max batch size: %u\n", idxd->max_batch_size);
+	idxd_check_siov(idxd);
 	if (idxd->hw.gen_cap.config_en)
 		set_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags);
 
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index cdea5d37ef24..f76d154d1dbd 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -30,7 +30,13 @@ static struct idxd_desc *__get_desc(struct idxd_wq *wq, int idx, int cpu)
 		desc->hw->int_handle = wq->vec_ptr;
 	} else {
 		desc->vector = wq->vec_ptr;
-		desc->hw->int_handle = idxd->int_handles[desc->vector];
+		/*
+		 * int_handles are only for descriptor completion. However for device
+		 * MSIX enumeration, vec 0 is used for misc interrupts. Therefore even
+		 * though we are rotating through 1...N for descriptor interrupts, we
+		 * need to acqurie the int_handles from 0..N-1.
+		 */
+		desc->hw->int_handle = idxd->int_handles[desc->vector - 1];
 	}
 
 	return desc;
@@ -91,7 +97,7 @@ int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc)
 	if (idxd->state != IDXD_DEV_ENABLED)
 		return -EIO;
 
-	portal = wq->portal + idxd_get_wq_portal_offset(IDXD_PORTAL_LIMITED);
+	portal = wq->portal + idxd_get_wq_portal_offset(IDXD_PORTAL_LIMITED, IDXD_IRQ_MSIX);
 
 	/*
 	 * The wmb() flushes writes to coherent DMA data before
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 304eb2cf532e..17f13ebae028 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -1353,6 +1353,14 @@ static ssize_t numa_node_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(numa_node);
 
+static ssize_t ims_size_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct idxd_device *idxd = container_of(dev, struct idxd_device, conf_dev);
+
+	return sprintf(buf, "%u\n", idxd->ims_size);
+}
+static DEVICE_ATTR_RO(ims_size);
+
 static ssize_t max_batch_size_show(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
@@ -1548,6 +1556,7 @@ static struct attribute *idxd_device_attributes[] = {
 	&dev_attr_max_work_queues_size.attr,
 	&dev_attr_max_engines.attr,
 	&dev_attr_numa_node.attr,
+	&dev_attr_ims_size.attr,
 	&dev_attr_max_batch_size.attr,
 	&dev_attr_max_transfer_size.attr,
 	&dev_attr_op_cap.attr,



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 08/17] dmaengine: idxd: add device support functions in prep for mdev
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (6 preceding siblings ...)
  2020-10-30 18:51 ` [PATCH v4 07/17] dmaengine: idxd: add IMS support in base driver Dave Jiang
@ 2020-10-30 18:51 ` Dave Jiang
  2020-10-30 18:51 ` [PATCH v4 09/17] dmaengine: idxd: add basic mdev registration and helper functions Dave Jiang
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:51 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Add device support helper functions in preparation of adding VFIO
mdev support.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/device.c    |   61 ++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/idxd.h      |    4 +++
 drivers/dma/idxd/registers.h |    3 +-
 3 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index a9ae970db0a4..8aff07b1acb4 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -287,6 +287,30 @@ void idxd_wq_unmap_portal(struct idxd_wq *wq)
 	devm_iounmap(dev, wq->portal);
 }
 
+int idxd_wq_abort(struct idxd_wq *wq)
+{
+	struct idxd_device *idxd = wq->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	u32 operand, status;
+
+	dev_dbg(dev, "Abort WQ %d\n", wq->id);
+	if (wq->state != IDXD_WQ_ENABLED) {
+		dev_dbg(dev, "WQ %d not active\n", wq->id);
+		return -ENXIO;
+	}
+
+	operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
+	dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_WQ, operand);
+	idxd_cmd_exec(idxd, IDXD_CMD_ABORT_WQ, operand, &status);
+	if (status != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "WQ abort failed: %#x\n", status);
+		return -ENXIO;
+	}
+
+	dev_dbg(dev, "WQ %d aborted\n", wq->id);
+	return 0;
+}
+
 int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
 {
 	struct idxd_device *idxd = wq->idxd;
@@ -366,6 +390,32 @@ void idxd_wq_disable_cleanup(struct idxd_wq *wq)
 	}
 }
 
+void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid)
+{
+	struct idxd_device *idxd = wq->idxd;
+	int offset;
+
+	lockdep_assert_held(&idxd->dev_lock);
+
+	/* PASID fields are 8 bytes into the WQCFG register */
+	offset = WQCFG_OFFSET(idxd, wq->id, WQCFG_PASID_IDX);
+	wq->wqcfg->pasid = pasid;
+	iowrite32(wq->wqcfg->bits[WQCFG_PASID_IDX], idxd->reg_base + offset);
+}
+
+void idxd_wq_setup_priv(struct idxd_wq *wq, int priv)
+{
+	struct idxd_device *idxd = wq->idxd;
+	int offset;
+
+	lockdep_assert_held(&idxd->dev_lock);
+
+	/* priv field is 8 bytes into the WQCFG register */
+	offset = WQCFG_OFFSET(idxd, wq->id, WQCFG_PRIV_IDX);
+	wq->wqcfg->priv = !!priv;
+	iowrite32(wq->wqcfg->bits[WQCFG_PRIV_IDX], idxd->reg_base + offset);
+}
+
 /* Device control bits */
 static inline bool idxd_is_enabled(struct idxd_device *idxd)
 {
@@ -532,6 +582,17 @@ void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid)
 	dev_dbg(dev, "pasid %d drained\n", pasid);
 }
 
+void idxd_device_abort_pasid(struct idxd_device *idxd, int pasid)
+{
+	struct device *dev = &idxd->pdev->dev;
+	u32 operand;
+
+	operand = pasid;
+	dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_PASID, operand);
+	idxd_cmd_exec(idxd, IDXD_CMD_ABORT_PASID, operand, NULL);
+	dev_dbg(dev, "pasid %d aborted\n", pasid);
+}
+
 int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
 				   enum idxd_interrupt_type irq_type)
 {
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 549426bfb443..eb8552d32a0a 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -331,6 +331,7 @@ void idxd_device_cleanup(struct idxd_device *idxd);
 int idxd_device_config(struct idxd_device *idxd);
 void idxd_device_wqs_clear_state(struct idxd_device *idxd);
 void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
+void idxd_device_abort_pasid(struct idxd_device *idxd, int pasid);
 int idxd_device_load_config(struct idxd_device *idxd);
 int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
 				   enum idxd_interrupt_type irq_type);
@@ -348,6 +349,9 @@ void idxd_wq_unmap_portal(struct idxd_wq *wq);
 void idxd_wq_disable_cleanup(struct idxd_wq *wq);
 int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid);
 int idxd_wq_disable_pasid(struct idxd_wq *wq);
+int idxd_wq_abort(struct idxd_wq *wq);
+void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid);
+void idxd_wq_setup_priv(struct idxd_wq *wq, int priv);
 
 /* submission */
 int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc);
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index d02fd59a8e39..acc071df48eb 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -345,7 +345,8 @@ union wqcfg {
 	u32 bits[8];
 } __packed;
 
-#define WQCFG_PASID_IDX                2
+#define WQCFG_PASID_IDX		2
+#define WQCFG_PRIV_IDX		2
 
 /*
  * This macro calculates the offset into the WQCFG register



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 09/17] dmaengine: idxd: add basic mdev registration and helper functions
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (7 preceding siblings ...)
  2020-10-30 18:51 ` [PATCH v4 08/17] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
@ 2020-10-30 18:51 ` Dave Jiang
  2020-10-30 18:51 ` [PATCH v4 10/17] dmaengine: idxd: add emulation rw routines Dave Jiang
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:51 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Create a mediated device through the VFIO mediated device framework. The
mdev framework allows creation of an mediated device by the driver with
portion of the device's resources. The driver will emulate the slow path
such as the PCI config space, MMIO bar, and the command registers. The
descriptor submission portal(s) will be mmaped to the guest in order to
submit descriptors directly by the guest kernel or apps. The mediated
device support code in the idxd will be referred to as the Virtual
Device Composition Module (vdcm). Add basic plumbing to fill out the
mdev_parent_ops struct that VFIO mdev requires to support a mediated
device.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/Kconfig       |    7 
 drivers/dma/idxd/Makefile |    2 
 drivers/dma/idxd/idxd.h   |   14 +
 drivers/dma/idxd/init.c   |   11 +
 drivers/dma/idxd/mdev.c   |  968 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dma/idxd/mdev.h   |  115 +++++
 drivers/dma/idxd/vdev.c   |   75 +++
 drivers/dma/idxd/vdev.h   |   19 +
 8 files changed, 1211 insertions(+)
 create mode 100644 drivers/dma/idxd/mdev.c
 create mode 100644 drivers/dma/idxd/mdev.h
 create mode 100644 drivers/dma/idxd/vdev.c
 create mode 100644 drivers/dma/idxd/vdev.h

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 6a908785a5f7..c5970e4a3a2c 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -306,6 +306,13 @@ config INTEL_IDXD_SVM
 	depends on PCI_PASID
 	depends on PCI_IOV
 
+config INTEL_IDXD_MDEV
+	bool "IDXD VFIO Mediated Device Support"
+	depends on INTEL_IDXD
+	depends on VFIO_MDEV
+	depends on VFIO_MDEV_DEVICE
+	select PCI_SIOV
+
 config INTEL_IOATDMA
 	tristate "Intel I/OAT DMA support"
 	depends on PCI && X86_64
diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index 8978b898d777..30cad704a95a 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_INTEL_IDXD) += idxd.o
 idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o
+
+idxd-$(CONFIG_INTEL_IDXD_MDEV) += mdev.o vdev.o
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index eb8552d32a0a..ab28a1bffb7c 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -8,6 +8,7 @@
 #include <linux/percpu-rwsem.h>
 #include <linux/wait.h>
 #include <linux/cdev.h>
+#include <linux/mdev.h>
 #include "registers.h"
 
 #define IDXD_DRIVER_VERSION	"1.00"
@@ -123,6 +124,7 @@ struct idxd_wq {
 	char name[WQ_NAME_SIZE + 1];
 	u64 max_xfer_bytes;
 	u32 max_batch_size;
+	struct list_head vdcm_list;
 };
 
 struct idxd_engine {
@@ -155,6 +157,7 @@ enum idxd_device_flag {
 	IDXD_FLAG_CMD_RUNNING,
 	IDXD_FLAG_PASID_ENABLED,
 	IDXD_FLAG_SIOV_SUPPORTED,
+	IDXD_FLAG_MDEV_ENABLED,
 };
 
 struct idxd_device {
@@ -250,11 +253,17 @@ static inline bool device_pasid_enabled(struct idxd_device *idxd)
 	return test_bit(IDXD_FLAG_PASID_ENABLED, &idxd->flags);
 }
 
+
 static inline bool device_swq_supported(struct idxd_device *idxd)
 {
 	return (support_enqcmd && device_pasid_enabled(idxd));
 }
 
+static inline bool device_mdev_enabled(struct idxd_device *idxd)
+{
+	return test_bit(IDXD_FLAG_MDEV_ENABLED, &idxd->flags);
+}
+
 enum idxd_portal_prot {
 	IDXD_PORTAL_UNLIMITED = 0,
 	IDXD_PORTAL_LIMITED,
@@ -375,4 +384,9 @@ int idxd_cdev_get_major(struct idxd_device *idxd);
 int idxd_wq_add_cdev(struct idxd_wq *wq);
 void idxd_wq_del_cdev(struct idxd_wq *wq);
 
+/* mdev */
+int idxd_mdev_host_init(struct idxd_device *idxd);
+void idxd_mdev_host_release(struct idxd_device *idxd);
+int idxd_mdev_get_pasid(struct mdev_device *mdev);
+
 #endif
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 4a21c2a17a62..ab91293aedb9 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -218,6 +218,7 @@ static int idxd_setup_internals(struct idxd_device *idxd)
 		wq->wqcfg = devm_kzalloc(dev, idxd->wqcfg_size, GFP_KERNEL);
 		if (!wq->wqcfg)
 			return -ENOMEM;
+		INIT_LIST_HEAD(&wq->vdcm_list);
 	}
 
 	for (i = 0; i < idxd->max_engines; i++) {
@@ -479,6 +480,14 @@ static int idxd_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return -ENODEV;
 	}
 
+	if (IS_ENABLED(CONFIG_INTEL_IDXD_MDEV)) {
+		rc = idxd_mdev_host_init(idxd);
+		if (rc < 0)
+			dev_warn(dev, "VFIO mdev not setup: %d\n", rc);
+		else
+			set_bit(IDXD_FLAG_MDEV_ENABLED, &idxd->flags);
+	}
+
 	rc = idxd_setup_sysfs(idxd);
 	if (rc) {
 		dev_err(dev, "IDXD sysfs setup failed\n");
@@ -572,6 +581,8 @@ static void idxd_remove(struct pci_dev *pdev)
 	dev_dbg(&pdev->dev, "%s called\n", __func__);
 	idxd_cleanup_sysfs(idxd);
 	idxd_shutdown(pdev);
+	if (IS_ENABLED(CONFIG_INTEL_IDXD_MDEV) && device_mdev_enabled(idxd))
+		idxd_mdev_host_release(idxd);
 	if (device_pasid_enabled(idxd))
 		idxd_disable_system_pasid(idxd);
 	mutex_lock(&idxd_idr_lock);
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
new file mode 100644
index 000000000000..3b6febe22a0e
--- /dev/null
+++ b/drivers/dma/idxd/mdev.c
@@ -0,0 +1,968 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <linux/circ_buf.h>
+#include <linux/irqchip/irq-ims-msi.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "../../vfio/pci/vfio_pci_private.h"
+#include "mdev.h"
+#include "vdev.h"
+
+static u64 idxd_pci_config[] = {
+	0x001000000b258086ULL,
+	0x0080000008800000ULL,
+	0x000000000000000cULL,
+	0x000000000000000cULL,
+	0x0000000000000000ULL,
+	0x2010808600000000ULL,
+	0x0000004000000000ULL,
+	0x000000ff00000000ULL,
+	0x0000060000015011ULL, /* MSI-X capability, hardcoded 2 entries, Encoded as N-1 */
+	0x0000070000000000ULL,
+	0x0000000000920010ULL, /* PCIe capability */
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+	0x0000000000000000ULL,
+};
+
+static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags, unsigned int index,
+			      unsigned int start, unsigned int count, void *data);
+
+int idxd_mdev_get_pasid(struct mdev_device *mdev)
+{
+	struct iommu_domain *domain;
+	struct device *dev = mdev_dev(mdev);
+
+	domain = iommu_get_domain_for_dev(dev);
+	if (!domain)
+		return -ENODEV;
+
+	return iommu_aux_get_pasid(domain, dev->parent);
+}
+
+static inline void reset_vconfig(struct vdcm_idxd *vidxd)
+{
+	memset(vidxd->cfg, 0, VIDXD_MAX_CFG_SPACE_SZ);
+	memcpy(vidxd->cfg, idxd_pci_config, sizeof(idxd_pci_config));
+}
+
+static inline void reset_vmmio(struct vdcm_idxd *vidxd)
+{
+	memset(&vidxd->bar0, 0, VIDXD_MAX_MMIO_SPACE_SZ);
+}
+
+static void idxd_vdcm_init(struct vdcm_idxd *vidxd)
+{
+	struct idxd_wq *wq = vidxd->wq;
+
+	reset_vconfig(vidxd);
+	reset_vmmio(vidxd);
+
+	vidxd->bar_size[0] = VIDXD_BAR0_SIZE;
+	vidxd->bar_size[1] = VIDXD_BAR2_SIZE;
+
+	vidxd_mmio_init(vidxd);
+
+	if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
+		idxd_wq_disable(wq);
+}
+
+static void idxd_vdcm_release(struct mdev_device *mdev)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "vdcm_idxd_release %d\n", vidxd->type->type);
+	mutex_lock(&vidxd->dev_lock);
+	if (!vidxd->refcount)
+		goto out;
+
+        idxd_vdcm_set_irqs(vidxd, VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+			   VFIO_PCI_MSIX_IRQ_INDEX, 0, 0, NULL);
+
+	vidxd_free_ims_entries(vidxd);
+
+	/* Re-initialize the VIDXD to a pristine state for re-use */
+	idxd_vdcm_init(vidxd);
+	vidxd->refcount--;
+
+ out:
+	mutex_unlock(&vidxd->dev_lock);
+}
+
+static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev_device *mdev,
+					   struct vdcm_idxd_type *type)
+{
+	struct vdcm_idxd *vidxd;
+	struct idxd_wq *wq = NULL;
+
+	/* PLACEHOLDER, wq matching comes later */
+
+	if (!wq)
+		return ERR_PTR(-ENODEV);
+
+	vidxd = kzalloc(sizeof(*vidxd), GFP_KERNEL);
+	if (!vidxd)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&vidxd->dev_lock);
+	vidxd->idxd = idxd;
+	vidxd->vdev.mdev = mdev;
+	vidxd->wq = wq;
+	mdev_set_drvdata(mdev, vidxd);
+	vidxd->type = type;
+	vidxd->num_wqs = VIDXD_MAX_WQS;
+
+	idxd_vdcm_init(vidxd);
+	mutex_lock(&wq->wq_lock);
+	idxd_wq_get(wq);
+	mutex_unlock(&wq->wq_lock);
+
+	return vidxd;
+}
+
+static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES];
+
+static struct vdcm_idxd_type *idxd_vdcm_find_vidxd_type(struct device *dev,
+							const char *name)
+{
+	int i;
+	char dev_name[IDXD_MDEV_NAME_LEN];
+
+	for (i = 0; i < IDXD_MDEV_TYPES; i++) {
+		snprintf(dev_name, IDXD_MDEV_NAME_LEN, "idxd-%s",
+			 idxd_mdev_types[i].name);
+
+		if (!strncmp(name, dev_name, IDXD_MDEV_NAME_LEN))
+			return &idxd_mdev_types[i];
+	}
+
+	return NULL;
+}
+
+static int idxd_vdcm_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+	struct vdcm_idxd *vidxd;
+	struct vdcm_idxd_type *type;
+	struct device *dev, *parent;
+	struct idxd_device *idxd;
+	struct idxd_wq *wq;
+
+	parent = mdev_parent_dev(mdev);
+	idxd = dev_get_drvdata(parent);
+	dev = mdev_dev(mdev);
+
+	type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+	if (!type) {
+		dev_err(dev, "failed to find type %s to create\n",
+			kobject_name(kobj));
+		return -EINVAL;
+	}
+
+	vidxd = vdcm_vidxd_create(idxd, mdev, type);
+	if (IS_ERR(vidxd)) {
+		dev_err(dev, "failed to create vidxd: %ld\n", PTR_ERR(vidxd));
+		return PTR_ERR(vidxd);
+	}
+
+	wq = vidxd->wq;
+	mutex_lock(&wq->wq_lock);
+	list_add(&vidxd->list, &wq->vdcm_list);
+	mutex_unlock(&wq->wq_lock);
+	dev_dbg(dev, "mdev creation success: %s\n", dev_name(mdev_dev(mdev)));
+
+	return 0;
+}
+
+static int idxd_vdcm_remove(struct mdev_device *mdev)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	struct idxd_device *idxd = vidxd->idxd;
+	struct device *dev = &idxd->pdev->dev;
+	struct idxd_wq *wq = vidxd->wq;
+
+	dev_dbg(dev, "%s: removing for wq %d\n", __func__, vidxd->wq->id);
+
+	mutex_lock(&wq->wq_lock);
+	list_del(&vidxd->list);
+	idxd_wq_put(wq);
+	mutex_unlock(&wq->wq_lock);
+
+	kfree(vidxd);
+	return 0;
+}
+
+static int idxd_vdcm_open(struct mdev_device *mdev)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	int rc;
+	struct vdcm_idxd_type *type = vidxd->type;
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "%s: type: %d\n", __func__, type->type);
+
+	mutex_lock(&vidxd->dev_lock);
+	if (vidxd->refcount)
+		goto out;
+
+	/* allocate and setup IMS entries */
+	rc = vidxd_setup_ims_entries(vidxd);
+	if (rc < 0)
+		goto out;
+
+	vidxd->refcount++;
+	mutex_unlock(&vidxd->dev_lock);
+
+	return rc;
+
+ out:
+	mutex_unlock(&vidxd->dev_lock);
+	return rc;
+}
+
+static ssize_t idxd_vdcm_rw(struct mdev_device *mdev, char *buf, size_t count, loff_t *ppos,
+			    enum idxd_vdcm_rw mode)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	struct device *dev = mdev_dev(mdev);
+	int rc = -EINVAL;
+
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		dev_err(dev, "invalid index: %u\n", index);
+		return -EINVAL;
+	}
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		if (mode == IDXD_VDCM_WRITE)
+			rc = vidxd_cfg_write(vidxd, pos, buf, count);
+		else
+			rc = vidxd_cfg_read(vidxd, pos, buf, count);
+		break;
+	case VFIO_PCI_BAR0_REGION_INDEX:
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		if (mode == IDXD_VDCM_WRITE)
+			rc = vidxd_mmio_write(vidxd, vidxd->bar_val[0] + pos, buf, count);
+		else
+			rc = vidxd_mmio_read(vidxd, vidxd->bar_val[0] + pos, buf, count);
+		break;
+	case VFIO_PCI_BAR2_REGION_INDEX:
+	case VFIO_PCI_BAR3_REGION_INDEX:
+	case VFIO_PCI_BAR4_REGION_INDEX:
+	case VFIO_PCI_BAR5_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+	case VFIO_PCI_ROM_REGION_INDEX:
+	default:
+		dev_err(dev, "unsupported region: %u\n", index);
+	}
+
+	return rc == 0 ? count : rc;
+}
+
+static ssize_t idxd_vdcm_read(struct mdev_device *mdev, char __user *buf, size_t count,
+			      loff_t *ppos)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	unsigned int done = 0;
+	int rc;
+
+	mutex_lock(&vidxd->dev_lock);
+	while (count) {
+		size_t filled;
+
+		if (count >= 4 && !(*ppos % 4)) {
+			u32 val;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+					  ppos, IDXD_VDCM_READ);
+			if (rc <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 4;
+		} else if (count >= 2 && !(*ppos % 2)) {
+			u16 val;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+					  ppos, IDXD_VDCM_READ);
+			if (rc <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			rc = idxd_vdcm_rw(mdev, &val, sizeof(val), ppos,
+					  IDXD_VDCM_READ);
+			if (rc <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		*ppos += filled;
+		buf += filled;
+	}
+
+	mutex_unlock(&vidxd->dev_lock);
+	return done;
+
+ read_err:
+	mutex_unlock(&vidxd->dev_lock);
+	return -EFAULT;
+}
+
+static ssize_t idxd_vdcm_write(struct mdev_device *mdev, const char __user *buf, size_t count,
+			       loff_t *ppos)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	unsigned int done = 0;
+	int rc;
+
+	mutex_lock(&vidxd->dev_lock);
+	while (count) {
+		size_t filled;
+
+		if (count >= 4 && !(*ppos % 4)) {
+			u32 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+					  ppos, IDXD_VDCM_WRITE);
+			if (rc <= 0)
+				goto write_err;
+
+			filled = 4;
+		} else if (count >= 2 && !(*ppos % 2)) {
+			u16 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			rc = idxd_vdcm_rw(mdev, (char *)&val,
+					  sizeof(val), ppos, IDXD_VDCM_WRITE);
+			if (rc <= 0)
+				goto write_err;
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			rc = idxd_vdcm_rw(mdev, &val, sizeof(val),
+					  ppos, IDXD_VDCM_WRITE);
+			if (rc <= 0)
+				goto write_err;
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		*ppos += filled;
+		buf += filled;
+	}
+
+	mutex_unlock(&vidxd->dev_lock);
+	return done;
+
+write_err:
+	mutex_unlock(&vidxd->dev_lock);
+	return -EFAULT;
+}
+
+static int check_vma(struct idxd_wq *wq, struct vm_area_struct *vma)
+{
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int idxd_vdcm_mmap(struct mdev_device *mdev, struct vm_area_struct *vma)
+{
+	unsigned int wq_idx, rc;
+	unsigned long req_size, pgoff = 0, offset;
+	pgprot_t pg_prot;
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	struct idxd_wq *wq = vidxd->wq;
+	struct idxd_device *idxd = vidxd->idxd;
+	enum idxd_portal_prot virt_portal, phys_portal;
+	phys_addr_t base = pci_resource_start(idxd->pdev, IDXD_WQ_BAR);
+	struct device *dev = mdev_dev(mdev);
+
+	rc = check_vma(wq, vma);
+	if (rc)
+		return rc;
+
+	pg_prot = vma->vm_page_prot;
+	req_size = vma->vm_end - vma->vm_start;
+	vma->vm_flags |= VM_DONTCOPY;
+
+	offset = (vma->vm_pgoff << PAGE_SHIFT) &
+		 ((1ULL << VFIO_PCI_OFFSET_SHIFT) - 1);
+
+	wq_idx = offset >> (PAGE_SHIFT + 2);
+	if (wq_idx >= 1) {
+		dev_err(dev, "mapping invalid wq %d off %lx\n",
+			wq_idx, offset);
+		return -EINVAL;
+	}
+
+	/*
+	 * Check and see if the guest wants to map to the limited or unlimited portal.
+	 * The driver will allow mapping to unlimited portal only if the the wq is a
+	 * dedicated wq. Otherwise, it goes to limited.
+	 */
+	virt_portal = ((offset >> PAGE_SHIFT) & 0x3) == 1;
+	phys_portal = IDXD_PORTAL_LIMITED;
+	if (virt_portal == IDXD_PORTAL_UNLIMITED && wq_dedicated(wq))
+		phys_portal = IDXD_PORTAL_UNLIMITED;
+
+	/* We always map IMS portals to the guest */
+	pgoff = (base + idxd_get_wq_portal_full_offset(wq->id, phys_portal,
+						       IDXD_IRQ_IMS)) >> PAGE_SHIFT;
+
+	dev_dbg(dev, "mmap %lx %lx %lx %lx\n", vma->vm_start, pgoff, req_size,
+		pgprot_val(pg_prot));
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_private_data = mdev;
+	vma->vm_pgoff = pgoff;
+
+	return remap_pfn_range(vma, vma->vm_start, pgoff, req_size, pg_prot);
+}
+
+static int idxd_vdcm_get_irq_count(struct vdcm_idxd *vidxd, int type)
+{
+	/*
+	 * Even though the number of MSIX vectors supported are not tied to number of
+	 * wqs being exported, the current design is to allow 1 vector per WQ for guest.
+	 * So here we end up with num of wqs plus 1 that handles the misc interrupts.
+	 */
+	if (type == VFIO_PCI_MSI_IRQ_INDEX || type == VFIO_PCI_MSIX_IRQ_INDEX)
+		return VIDXD_MAX_MSIX_VECS;
+
+	return 0;
+}
+
+static irqreturn_t idxd_guest_wq_completion(int irq, void *data)
+{
+	struct ims_irq_entry *irq_entry = data;
+
+	/*
+	 * WQ irq_entry 0 is actually MSIX vector 1 for guest. MSIX vector 0
+	 * is emulated.
+	 */
+	vidxd_send_interrupt(irq_entry->vidxd, irq_entry->id + 1);
+	return IRQ_HANDLED;
+}
+
+static int msix_trigger_unregister(struct vdcm_idxd *vidxd, int index)
+{
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	struct ims_irq_entry *irq_entry;
+	int rc;
+
+	if (!vidxd->vdev.msix_trigger[index])
+		return 0;
+
+	dev_dbg(dev, "disable MSIX trigger %d\n", index);
+	if (index) {
+		u32 auxval;
+
+		irq_entry = &vidxd->irq_entries[index - 1];
+		if (irq_entry->irq_set) {
+			free_irq(irq_entry->entry->irq, irq_entry);
+			irq_entry->irq_set = false;
+		}
+
+		auxval = ims_ctrl_pasid_aux(0, false);
+		rc = irq_set_auxdata(irq_entry->entry->irq, IMS_AUXDATA_CONTROL_WORD, auxval);
+		if (rc)
+			return rc;
+	}
+	eventfd_ctx_put(vidxd->vdev.msix_trigger[index]);
+	vidxd->vdev.msix_trigger[index] = NULL;
+
+	return 0;
+}
+
+static int msix_trigger_register(struct vdcm_idxd *vidxd, u32 fd, int index)
+{
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	struct ims_irq_entry *irq_entry;
+	struct eventfd_ctx *trigger;
+	int rc;
+
+	rc = msix_trigger_unregister(vidxd, index);
+	if (rc < 0)
+		return rc;
+
+	dev_dbg(dev, "enable MSIX trigger %d\n", index);
+	trigger = eventfd_ctx_fdget(fd);
+	if (IS_ERR(trigger)) {
+		dev_warn(dev, "eventfd_ctx_fdget failed %d\n", index);
+		return PTR_ERR(trigger);
+	}
+
+	/*
+	 * The MSIX vector 0 is emulated by the mdev. Starting with vector 1
+	 * the interrupt is backed by IMS and needs to be set up, but we
+	 * will be setting up entry 0 of the IMS vectors. So here we pass
+	 * in i - 1 to the host setup and irq_entries.
+	 */
+	if (index) {
+		int pasid;
+		u32 auxval;
+
+		irq_entry = &vidxd->irq_entries[index - 1];
+		pasid = idxd_mdev_get_pasid(mdev);
+		if (pasid < 0)
+			return pasid;
+
+		/*
+		 * Program and enable the pasid field in the IMS entry. The programmed pasid and
+		 * enabled field is checked against the  pasid and enable field for the work queue
+		 * configuration and the pasid for the descriptor. A mismatch will result in blocked
+		 * IMS interrupt.
+		 */
+		auxval = ims_ctrl_pasid_aux(pasid, true);
+		rc = irq_set_auxdata(irq_entry->entry->irq, IMS_AUXDATA_CONTROL_WORD, auxval);
+		if (rc < 0)
+			return rc;
+
+		rc = request_irq(irq_entry->entry->irq, idxd_guest_wq_completion, 0, "idxd-ims",
+				 irq_entry);
+		if (rc) {
+			dev_warn(dev, "failed to request ims irq\n");
+			eventfd_ctx_put(trigger);
+			auxval = ims_ctrl_pasid_aux(0, false);
+			irq_set_auxdata(irq_entry->entry->irq, IMS_AUXDATA_CONTROL_WORD, auxval);
+			return rc;
+		}
+		irq_entry->irq_set = true;
+	}
+
+	vidxd->vdev.msix_trigger[index] = trigger;
+	return 0;
+}
+
+static int vdcm_idxd_set_msix_trigger(struct vdcm_idxd *vidxd,
+				      unsigned int index, unsigned int start,
+				      unsigned int count, uint32_t flags,
+				      void *data)
+{
+	int i, rc = 0;
+
+	if (count > VIDXD_MAX_MSIX_ENTRIES - 1)
+		count = VIDXD_MAX_MSIX_ENTRIES - 1;
+
+	/*
+	 * The MSIX vector 0 is emulated by the mdev. Starting with vector 1
+	 * the interrupt is backed by IMS and needs to be set up, but we
+	 * will be setting up entry 0 of the IMS vectors. So here we pass
+	 * in i - 1 to the host setup and irq_entries.
+	 */
+	if (count == 0 && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+		/* Disable all MSIX entries */
+		for (i = 0; i < VIDXD_MAX_MSIX_ENTRIES; i++) {
+			rc = msix_trigger_unregister(vidxd, i);
+			if (rc < 0)
+				return rc;
+		}
+		return 0;
+	}
+
+	for (i = 0; i < count; i++) {
+		if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+			u32 fd = *(u32 *)(data + i * sizeof(u32));
+
+			rc = msix_trigger_register(vidxd, fd, i);
+			if (rc < 0)
+				return rc;
+		} else if (flags & VFIO_IRQ_SET_DATA_NONE) {
+			rc = msix_trigger_unregister(vidxd, i);
+			if (rc < 0)
+				return rc;
+		}
+	}
+	return rc;
+}
+
+static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags,
+			      unsigned int index, unsigned int start,
+			      unsigned int count, void *data)
+{
+	int (*func)(struct vdcm_idxd *vidxd, unsigned int index,
+		    unsigned int start, unsigned int count, uint32_t flags,
+		    void *data) = NULL;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	switch (index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+		dev_warn(dev, "intx interrupts not supported.\n");
+		break;
+	case VFIO_PCI_MSI_IRQ_INDEX:
+		dev_dbg(dev, "msi interrupt.\n");
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vdcm_idxd_set_msix_trigger;
+			break;
+		}
+		break;
+	case VFIO_PCI_MSIX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vdcm_idxd_set_msix_trigger;
+			break;
+		}
+		break;
+	default:
+		return -ENOTTY;
+	}
+
+	if (!func)
+		return -ENOTTY;
+
+	return func(vidxd, index, start, count, flags, data);
+}
+
+static void vidxd_vdcm_reset(struct vdcm_idxd *vidxd)
+{
+	vidxd_reset(vidxd);
+}
+
+static long idxd_vdcm_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			    unsigned long arg)
+{
+	struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+	unsigned long minsz;
+	int rc = -EINVAL;
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "vidxd %p ioctl, cmd: %d\n", vidxd, cmd);
+
+	mutex_lock(&vidxd->dev_lock);
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz)) {
+			rc = -EFAULT;
+			goto out;
+		}
+
+		if (info.argsz < minsz) {
+			rc = -EINVAL;
+			goto out;
+		}
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.flags |= VFIO_DEVICE_FLAGS_RESET;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		if (copy_to_user((void __user *)arg, &info, minsz))
+			rc = -EFAULT;
+		else
+			rc = 0;
+		goto out;
+	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		struct vfio_region_info_cap_sparse_mmap *sparse = NULL;
+		size_t size;
+		int nr_areas = 1;
+		int cap_type_id = 0;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz)) {
+			rc = -EFAULT;
+			goto out;
+		}
+
+		if (info.argsz < minsz) {
+			rc = -EINVAL;
+			goto out;
+		}
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = VIDXD_MAX_CFG_SPACE_SZ;
+			info.flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+			break;
+		case VFIO_PCI_BAR0_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vidxd->bar_size[info.index];
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+			break;
+		case VFIO_PCI_BAR1_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0;
+			info.flags = 0;
+			break;
+		case VFIO_PCI_BAR2_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.flags = VFIO_REGION_INFO_FLAG_CAPS | VFIO_REGION_INFO_FLAG_MMAP |
+				     VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+			info.size = vidxd->bar_size[1];
+
+			/*
+			 * Every WQ has two areas for unlimited and limited
+			 * MSI-X portals. IMS portals are not reported
+			 */
+			nr_areas = 2;
+
+			size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+			sparse = kzalloc(size, GFP_KERNEL);
+			if (!sparse) {
+				rc = -ENOMEM;
+				goto out;
+			}
+
+			sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+			sparse->header.version = 1;
+			sparse->nr_areas = nr_areas;
+			cap_type_id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+
+			sparse->areas[0].offset = 0;
+			sparse->areas[0].size = PAGE_SIZE;
+
+			sparse->areas[1].offset = PAGE_SIZE;
+			sparse->areas[1].size = PAGE_SIZE;
+			break;
+
+		case VFIO_PCI_BAR3_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0;
+			info.flags = 0;
+			dev_dbg(dev, "get region info bar:%d\n", info.index);
+			break;
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		case VFIO_PCI_VGA_REGION_INDEX:
+			dev_dbg(dev, "get region info index:%d\n", info.index);
+			break;
+		default: {
+			if (info.index >= VFIO_PCI_NUM_REGIONS)
+				rc = -EINVAL;
+			else
+				rc = 0;
+			goto out;
+		} /* default */
+		} /* info.index switch */
+
+		if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && sparse) {
+			if (cap_type_id == VFIO_REGION_INFO_CAP_SPARSE_MMAP) {
+				rc = vfio_info_add_capability(&caps, &sparse->header,
+							      sizeof(*sparse) + (sparse->nr_areas *
+							      sizeof(*sparse->areas)));
+				kfree(sparse);
+				if (rc)
+					goto out;
+			}
+		}
+
+		if (caps.size) {
+			if (info.argsz < sizeof(info) + caps.size) {
+				info.argsz = sizeof(info) + caps.size;
+				info.cap_offset = 0;
+			} else {
+				vfio_info_cap_shift(&caps, sizeof(info));
+				if (copy_to_user((void __user *)arg + sizeof(info),
+						 caps.buf, caps.size)) {
+					kfree(caps.buf);
+					rc = -EFAULT;
+					goto out;
+				}
+				info.cap_offset = sizeof(info);
+			}
+
+			kfree(caps.buf);
+		}
+		if (copy_to_user((void __user *)arg, &info, minsz))
+			rc = -EFAULT;
+		else
+			rc = 0;
+		goto out;
+	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz)) {
+			rc = -EFAULT;
+			goto out;
+		}
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS) {
+			rc = -EINVAL;
+			goto out;
+		}
+
+		switch (info.index) {
+		case VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			rc = -EINVAL;
+			goto out;
+		} /* switch(info.index) */
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_NORESIZE;
+		info.count = idxd_vdcm_get_irq_count(vidxd, info.index);
+
+		if (copy_to_user((void __user *)arg, &info, minsz))
+			rc = -EFAULT;
+		else
+			rc = 0;
+		goto out;
+	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
+		struct vfio_irq_set hdr;
+		u8 *data = NULL;
+		size_t data_size = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz)) {
+			rc = -EFAULT;
+			goto out;
+		}
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			int max = idxd_vdcm_get_irq_count(vidxd, hdr.index);
+
+			rc = vfio_set_irqs_validate_and_prepare(&hdr, max, VFIO_PCI_NUM_IRQS,
+								&data_size);
+			if (rc) {
+				dev_err(dev, "intel:vfio_set_irqs_validate_and_prepare failed\n");
+				rc = -EINVAL;
+				goto out;
+			}
+			if (data_size) {
+				data = memdup_user((void __user *)(arg + minsz), data_size);
+				if (IS_ERR(data)) {
+					rc = PTR_ERR(data);
+					goto out;
+				}
+			}
+		}
+
+		if (!data) {
+			rc = -EINVAL;
+			goto out;
+		}
+
+		rc = idxd_vdcm_set_irqs(vidxd, hdr.flags, hdr.index, hdr.start, hdr.count, data);
+		kfree(data);
+		goto out;
+	} else if (cmd == VFIO_DEVICE_RESET) {
+		vidxd_vdcm_reset(vidxd);
+	}
+
+ out:
+	mutex_unlock(&vidxd->dev_lock);
+	return rc;
+}
+
+static const struct mdev_parent_ops idxd_vdcm_ops = {
+	.create			= idxd_vdcm_create,
+	.remove			= idxd_vdcm_remove,
+	.open			= idxd_vdcm_open,
+	.release		= idxd_vdcm_release,
+	.read			= idxd_vdcm_read,
+	.write			= idxd_vdcm_write,
+	.mmap			= idxd_vdcm_mmap,
+	.ioctl			= idxd_vdcm_ioctl,
+};
+
+int idxd_mdev_host_init(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int rc;
+
+	if (!test_bit(IDXD_FLAG_SIOV_SUPPORTED, &idxd->flags))
+		return -EOPNOTSUPP;
+
+	if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
+		rc = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_AUX);
+		if (rc < 0) {
+			dev_warn(dev, "Failed to enable aux-domain: %d\n", rc);
+			return rc;
+		}
+	} else {
+		dev_warn(dev, "No aux-domain feature.\n");
+		return -EOPNOTSUPP;
+	}
+
+	return mdev_register_device(dev, &idxd_vdcm_ops);
+}
+
+void idxd_mdev_host_release(struct idxd_device *idxd)
+{
+	struct device *dev = &idxd->pdev->dev;
+	int rc;
+
+	mdev_unregister_device(dev);
+	if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
+		rc = iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
+		if (rc < 0)
+			dev_warn(dev, "Failed to disable aux-domain: %d\n",
+				 rc);
+	}
+}
diff --git a/drivers/dma/idxd/mdev.h b/drivers/dma/idxd/mdev.h
new file mode 100644
index 000000000000..b474f2303ba0
--- /dev/null
+++ b/drivers/dma/idxd/mdev.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_MDEV_H_
+#define _IDXD_MDEV_H_
+
+/* two 64-bit BARs implemented */
+#define VIDXD_MAX_BARS 2
+#define VIDXD_MAX_CFG_SPACE_SZ 4096
+#define VIDXD_MAX_MMIO_SPACE_SZ 8192
+#define VIDXD_MSIX_TBL_SZ_OFFSET 0x42
+#define VIDXD_CAP_CTRL_SZ 0x100
+#define VIDXD_GRP_CTRL_SZ 0x100
+#define VIDXD_WQ_CTRL_SZ 0x100
+#define VIDXD_WQ_OCPY_INT_SZ 0x20
+#define VIDXD_MSIX_TBL_SZ 0x90
+#define VIDXD_MSIX_PERM_TBL_SZ 0x48
+
+#define VIDXD_MSIX_TABLE_OFFSET 0x600
+#define VIDXD_MSIX_PERM_OFFSET 0x300
+#define VIDXD_GRPCFG_OFFSET 0x400
+#define VIDXD_WQCFG_OFFSET 0x500
+#define VIDXD_IMS_OFFSET 0x1000
+
+#define VIDXD_BAR0_SIZE  0x2000
+#define VIDXD_BAR2_SIZE  0x20000
+#define VIDXD_MAX_MSIX_ENTRIES  (VIDXD_MSIX_TBL_SZ / 0x10)
+#define VIDXD_MAX_WQS	1
+#define VIDXD_MAX_MSIX_VECS	2
+
+#define	VIDXD_ATS_OFFSET 0x100
+#define	VIDXD_PRS_OFFSET 0x110
+#define VIDXD_PASID_OFFSET 0x120
+#define VIDXD_MSIX_PBA_OFFSET 0x700
+
+struct ims_irq_entry {
+	struct vdcm_idxd *vidxd;
+	struct msi_desc *entry;
+	bool irq_set;
+	int id;
+};
+
+struct idxd_vdev {
+	struct mdev_device *mdev;
+	struct eventfd_ctx *msix_trigger[VIDXD_MAX_MSIX_ENTRIES];
+};
+
+struct vdcm_idxd {
+	struct idxd_device *idxd;
+	struct idxd_wq *wq;
+	struct idxd_vdev vdev;
+	struct vdcm_idxd_type *type;
+	int num_wqs;
+	struct ims_irq_entry irq_entries[VIDXD_MAX_MSIX_ENTRIES];
+
+	/* For VM use case */
+	u64 bar_val[VIDXD_MAX_BARS];
+	u64 bar_size[VIDXD_MAX_BARS];
+	u8 cfg[VIDXD_MAX_CFG_SPACE_SZ];
+	u8 bar0[VIDXD_MAX_MMIO_SPACE_SZ];
+	struct list_head list;
+	struct mutex dev_lock; /* lock for vidxd resources */
+
+	int refcount;
+};
+
+static inline struct vdcm_idxd *to_vidxd(struct idxd_vdev *vdev)
+{
+	return container_of(vdev, struct vdcm_idxd, vdev);
+}
+
+#define IDXD_MDEV_NAME_LEN 16
+#define IDXD_MDEV_DESCRIPTION_LEN 64
+
+enum idxd_mdev_type {
+	IDXD_MDEV_TYPE_1_DWQ = 0,
+};
+
+#define IDXD_MDEV_TYPES 1
+
+struct vdcm_idxd_type {
+	char name[IDXD_MDEV_NAME_LEN];
+	char description[IDXD_MDEV_DESCRIPTION_LEN];
+	enum idxd_mdev_type type;
+	unsigned int avail_instance;
+};
+
+enum idxd_vdcm_rw {
+	IDXD_VDCM_READ = 0,
+	IDXD_VDCM_WRITE,
+};
+
+static inline u64 get_reg_val(void *buf, int size)
+{
+	u64 val = 0;
+
+	switch (size) {
+	case 8:
+		val = *(u64 *)buf;
+		break;
+	case 4:
+		val = *(u32 *)buf;
+		break;
+	case 2:
+		val = *(u16 *)buf;
+		break;
+	case 1:
+		val = *(u8 *)buf;
+		break;
+	}
+
+	return val;
+}
+
+#endif
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
new file mode 100644
index 000000000000..6cc097edc6e9
--- /dev/null
+++ b/drivers/dma/idxd/vdev.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "../../vfio/pci/vfio_pci_private.h"
+#include "mdev.h"
+#include "vdev.h"
+
+int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx)
+{
+	/* PLACE HOLDER */
+	return 0;
+}
+
+int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+	/* PLACEHOLDER */
+	return 0;
+}
+
+int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+	/* PLACEHOLDER */
+	return 0;
+}
+
+int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count)
+{
+	/* PLACEHOLDER */
+	return 0;
+}
+
+int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size)
+{
+	/* PLACEHOLDER */
+	return 0;
+}
+
+void vidxd_mmio_init(struct vdcm_idxd *vidxd)
+{
+	/* PLACEHOLDER */
+}
+
+void vidxd_reset(struct vdcm_idxd *vidxd)
+{
+	/* PLACEHOLDER */
+}
+
+int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
+{
+	/* PLACEHOLDER */
+	return 0;
+}
+
+void vidxd_free_ims_entries(struct vdcm_idxd *vidxd)
+{
+	/* PLACEHOLDER */
+}
diff --git a/drivers/dma/idxd/vdev.h b/drivers/dma/idxd/vdev.h
new file mode 100644
index 000000000000..baa30d98f9cb
--- /dev/null
+++ b/drivers/dma/idxd/vdev.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_VDEV_H_
+#define _IDXD_VDEV_H_
+
+#include "mdev.h"
+
+int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
+int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
+int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count);
+int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size);
+void vidxd_mmio_init(struct vdcm_idxd *vidxd);
+void vidxd_reset(struct vdcm_idxd *vidxd);
+int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx);
+int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd);
+void vidxd_free_ims_entries(struct vdcm_idxd *vidxd);
+
+#endif



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 10/17] dmaengine: idxd: add emulation rw routines
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (8 preceding siblings ...)
  2020-10-30 18:51 ` [PATCH v4 09/17] dmaengine: idxd: add basic mdev registration and helper functions Dave Jiang
@ 2020-10-30 18:51 ` Dave Jiang
  2020-10-30 18:52 ` [PATCH v4 11/17] dmaengine: idxd: prep for virtual device commands Dave Jiang
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:51 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Add emulation routines for PCI config read/write, MMIO read/write, and
interrupt handling routine for the emulated device. The rw routines are
called when PCI config read/writes or BAR0 mmio read/writes and being
issued by the guest kernel through KVM/qemu.

Because we are supporting read-only configuration, most of the MMIO
emulations are simple memory copy except for cases such as handling device
commands and interrupts.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/registers.h |   10 +
 drivers/dma/idxd/vdev.c      |  427 +++++++++++++++++++++++++++++++++++++++++-
 drivers/dma/idxd/vdev.h      |    8 +
 include/uapi/linux/idxd.h    |    2 
 4 files changed, 439 insertions(+), 8 deletions(-)

diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index acc071df48eb..5a76fd0ab6ad 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -194,7 +194,8 @@ union cmdsts_reg {
 	};
 	u32 bits;
 } __packed;
-#define IDXD_CMDSTS_ACTIVE		0x80000000
+#define IDXD_CMDS_ACTIVE_BIT		31
+#define IDXD_CMDSTS_ACTIVE		BIT(IDXD_CMDS_ACTIVE_BIT)
 #define IDXD_CMDSTS_ERR_MASK		0xff
 #define IDXD_CMDSTS_RES_SHIFT		8
 
@@ -277,6 +278,11 @@ union msix_perm {
 	u32 bits;
 } __packed;
 
+#define IDXD_MSIX_PERM_MASK	0xfffff00c
+#define IDXD_MSIX_PERM_IGNORE	0x3
+#define MSIX_ENTRY_MASK_INT	0x1
+#define MSIX_ENTRY_CTRL_BYTE	12
+
 union group_flags {
 	struct {
 		u32 tc_a:3;
@@ -347,6 +353,8 @@ union wqcfg {
 
 #define WQCFG_PASID_IDX		2
 #define WQCFG_PRIV_IDX		2
+#define WQCFG_MODE_DEDICATED	1
+#define WQCFG_MODE_SHARED	0
 
 /*
  * This macro calculates the offset into the WQCFG register
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index 6cc097edc6e9..b38bb676e604 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -25,35 +25,443 @@
 
 int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx)
 {
-	/* PLACE HOLDER */
+	int rc = -1;
+	struct device *dev = &vidxd->idxd->pdev->dev;
+
+	dev_dbg(dev, "%s interrput %d\n", __func__, msix_idx);
+
+	if (!vidxd->vdev.msix_trigger[msix_idx]) {
+		dev_warn(dev, "%s: intr evtfd not found %d\n", __func__, msix_idx);
+		return -EINVAL;
+	}
+
+	rc = eventfd_signal(vidxd->vdev.msix_trigger[msix_idx], 1);
+	if (rc != 1)
+		dev_err(dev, "eventfd signal failed (%d)\n", rc);
+	else
+		dev_dbg(dev, "vidxd interrupt triggered wq(%d) %d\n", vidxd->wq->id, msix_idx);
+
+	return rc;
+}
+
+static void vidxd_report_error(struct vdcm_idxd *vidxd, unsigned int error)
+{
+	u8 *bar0 = vidxd->bar0;
+	union sw_err_reg *swerr = (union sw_err_reg *)(bar0 + IDXD_SWERR_OFFSET);
+	union genctrl_reg *genctrl;
+	bool send = false;
+
+	if (!swerr->valid) {
+		memset(swerr, 0, sizeof(*swerr));
+		swerr->valid = 1;
+		swerr->error = error;
+		send = true;
+	} else if (swerr->valid && !swerr->overflow) {
+		swerr->overflow = 1;
+	}
+
+	genctrl = (union genctrl_reg *)(bar0 + IDXD_GENCTRL_OFFSET);
+	if (send && genctrl->softerr_int_en) {
+		u32 *intcause = (u32 *)(bar0 + IDXD_INTCAUSE_OFFSET);
+
+		*intcause |= IDXD_INTC_ERR;
+		vidxd_send_interrupt(vidxd, 0);
+	}
+}
+
+int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+	u32 offset = pos & (vidxd->bar_size[0] - 1);
+	u8 *bar0 = vidxd->bar0;
+	struct device *dev = mdev_dev(vidxd->vdev.mdev);
+
+	dev_dbg(dev, "vidxd mmio W %d %x %x: %llx\n", vidxd->wq->id, size,
+		offset, get_reg_val(buf, size));
+
+	if (((size & (size - 1)) != 0) || (offset & (size - 1)) != 0)
+		return -EINVAL;
+
+	/* If we don't limit this, we potentially can write out of bound */
+	if (size > sizeof(u32))
+		return -EINVAL;
+
+	switch (offset) {
+	case IDXD_GENCFG_OFFSET ... IDXD_GENCFG_OFFSET + 3:
+		/* Write only when device is disabled. */
+		if (vidxd_state(vidxd) == IDXD_DEVICE_STATE_DISABLED)
+			memcpy(bar0 + offset, buf, size);
+		break;
+
+	case IDXD_GENCTRL_OFFSET:
+		memcpy(bar0 + offset, buf, size);
+		break;
+
+	case IDXD_INTCAUSE_OFFSET:
+		bar0[offset] &= ~(get_reg_val(buf, 1) & GENMASK(4, 0));
+		break;
+
+	case IDXD_CMD_OFFSET: {
+		u32 *cmdsts = (u32 *)(bar0 + IDXD_CMDSTS_OFFSET);
+		u32 val = get_reg_val(buf, size);
+
+		if (size != sizeof(u32))
+			return -EINVAL;
+
+		/* Check and set command in progress */
+		if (test_and_set_bit(IDXD_CMDS_ACTIVE_BIT, (unsigned long *)cmdsts) == 0)
+			vidxd_do_command(vidxd, val);
+		else
+			vidxd_report_error(vidxd, DSA_ERR_CMD_REG);
+		break;
+	}
+
+	case IDXD_SWERR_OFFSET:
+		/* W1C */
+		bar0[offset] &= ~(get_reg_val(buf, 1) & GENMASK(1, 0));
+		break;
+
+	case VIDXD_WQCFG_OFFSET ... VIDXD_WQCFG_OFFSET + VIDXD_WQ_CTRL_SZ - 1:
+	case VIDXD_GRPCFG_OFFSET ...  VIDXD_GRPCFG_OFFSET + VIDXD_GRP_CTRL_SZ - 1:
+		/* Nothing is written. Should be all RO */
+		break;
+
+	case VIDXD_MSIX_TABLE_OFFSET ...  VIDXD_MSIX_TABLE_OFFSET + VIDXD_MSIX_TBL_SZ - 1: {
+		int index = (offset - VIDXD_MSIX_TABLE_OFFSET) / 0x10;
+		u8 *msix_entry = &bar0[VIDXD_MSIX_TABLE_OFFSET + index * 0x10];
+		u64 *pba = (u64 *)(bar0 + VIDXD_MSIX_PBA_OFFSET);
+		u8 ctrl;
+
+		ctrl = msix_entry[MSIX_ENTRY_CTRL_BYTE];
+		memcpy(bar0 + offset, buf, size);
+		/* Handle clearing of UNMASK bit */
+		if (!(msix_entry[MSIX_ENTRY_CTRL_BYTE] & MSIX_ENTRY_MASK_INT) &&
+		    ctrl & MSIX_ENTRY_MASK_INT)
+			if (test_and_clear_bit(index, (unsigned long *)pba))
+				vidxd_send_interrupt(vidxd, index);
+		break;
+	}
+
+	case VIDXD_MSIX_PERM_OFFSET ...  VIDXD_MSIX_PERM_OFFSET + VIDXD_MSIX_PERM_TBL_SZ - 1:
+		memcpy(bar0 + offset, buf, size);
+		break;
+	} /* offset */
+
 	return 0;
 }
 
 int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
 {
-	/* PLACEHOLDER */
+	u32 offset = pos & (vidxd->bar_size[0] - 1);
+	struct device *dev = mdev_dev(vidxd->vdev.mdev);
+
+	memcpy(buf, vidxd->bar0 + offset, size);
+
+	dev_dbg(dev, "vidxd mmio R %d %x %x: %llx\n",
+		vidxd->wq->id, size, offset, get_reg_val(buf, size));
 	return 0;
 }
 
-int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count)
 {
-	/* PLACEHOLDER */
+	u32 offset = pos & 0xfff;
+	struct device *dev = mdev_dev(vidxd->vdev.mdev);
+
+	memcpy(buf, &vidxd->cfg[offset], count);
+
+	dev_dbg(dev, "vidxd pci R %d %x %x: %llx\n",
+		vidxd->wq->id, count, offset, get_reg_val(buf, count));
+
 	return 0;
 }
 
-int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count)
+/*
+ * Much of the emulation code has been borrowed from Intel i915 cfg space
+ * emulation code.
+ * drivers/gpu/drm/i915/gvt/cfg_space.c:
+ */
+
+/*
+ * Bitmap for writable bits (RW or RW1C bits, but cannot co-exist in one
+ * byte) byte by byte in standard pci configuration space. (not the full
+ * 256 bytes.)
+ */
+static const u8 pci_cfg_space_rw_bmp[PCI_INTERRUPT_LINE + 4] = {
+	[PCI_COMMAND]		= 0xff, 0x07,
+	[PCI_STATUS]		= 0x00, 0xf9, /* the only one RW1C byte */
+	[PCI_CACHE_LINE_SIZE]	= 0xff,
+	[PCI_BASE_ADDRESS_0 ... PCI_CARDBUS_CIS - 1] = 0xff,
+	[PCI_ROM_ADDRESS]	= 0x01, 0xf8, 0xff, 0xff,
+	[PCI_INTERRUPT_LINE]	= 0xff,
+};
+
+static void _pci_cfg_mem_write(struct vdcm_idxd *vidxd, unsigned int off, u8 *src,
+			       unsigned int bytes)
 {
-	/* PLACEHOLDER */
+	u8 *cfg_base = vidxd->cfg;
+	u8 mask, new, old;
+	int i = 0;
+
+	for (; i < bytes && (off + i < sizeof(pci_cfg_space_rw_bmp)); i++) {
+		mask = pci_cfg_space_rw_bmp[off + i];
+		old = cfg_base[off + i];
+		new = src[i] & mask;
+
+		/**
+		 * The PCI_STATUS high byte has RW1C bits, here
+		 * emulates clear by writing 1 for these bits.
+		 * Writing a 0b to RW1C bits has no effect.
+		 */
+		if (off + i == PCI_STATUS + 1)
+			new = (~new & old) & mask;
+
+		cfg_base[off + i] = (old & ~mask) | new;
+	}
+
+	/* For other configuration space directly copy as it is. */
+	if (i < bytes)
+		memcpy(cfg_base + off + i, src + i, bytes - i);
+}
+
+static inline void _write_pci_bar(struct vdcm_idxd *vidxd, u32 offset, u32 val, bool low)
+{
+	u32 *pval;
+
+	/* BAR offset should be 32 bits algiend */
+	offset = rounddown(offset, 4);
+	pval = (u32 *)(vidxd->cfg + offset);
+
+	if (low) {
+		/*
+		 * only update bit 31 - bit 4,
+		 * leave the bit 3 - bit 0 unchanged.
+		 */
+		*pval = (val & GENMASK(31, 4)) | (*pval & GENMASK(3, 0));
+	} else {
+		*pval = val;
+	}
+}
+
+static int _pci_cfg_bar_write(struct vdcm_idxd *vidxd, unsigned int offset, void *p_data,
+			      unsigned int bytes)
+{
+	u32 new = *(u32 *)(p_data);
+	bool lo = IS_ALIGNED(offset, 8);
+	u64 size;
+	unsigned int bar_id;
+
+	/*
+	 * Power-up software can determine how much address
+	 * space the device requires by writing a value of
+	 * all 1's to the register and then reading the value
+	 * back. The device will return 0's in all don't-care
+	 * address bits.
+	 */
+	if (new == 0xffffffff) {
+		switch (offset) {
+		case PCI_BASE_ADDRESS_0:
+		case PCI_BASE_ADDRESS_1:
+		case PCI_BASE_ADDRESS_2:
+		case PCI_BASE_ADDRESS_3:
+			bar_id = (offset - PCI_BASE_ADDRESS_0) / 8;
+			size = vidxd->bar_size[bar_id];
+			_write_pci_bar(vidxd, offset, size >> (lo ? 0 : 32), lo);
+			break;
+		default:
+			/* Unimplemented BARs */
+			_write_pci_bar(vidxd, offset, 0x0, false);
+		}
+	} else {
+		switch (offset) {
+		case PCI_BASE_ADDRESS_0:
+		case PCI_BASE_ADDRESS_1:
+		case PCI_BASE_ADDRESS_2:
+		case PCI_BASE_ADDRESS_3:
+			_write_pci_bar(vidxd, offset, new, lo);
+			break;
+		default:
+			break;
+		}
+	}
 	return 0;
 }
 
 int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size)
 {
-	/* PLACEHOLDER */
+	struct device *dev = &vidxd->idxd->pdev->dev;
+
+	if (size > 4)
+		return -EINVAL;
+
+	if (pos + size > VIDXD_MAX_CFG_SPACE_SZ)
+		return -EINVAL;
+
+	dev_dbg(dev, "vidxd pci W %d %x %x: %llx\n", vidxd->wq->id, size, pos,
+		get_reg_val(buf, size));
+
+	/* First check if it's PCI_COMMAND */
+	if (IS_ALIGNED(pos, 2) && pos == PCI_COMMAND) {
+		bool new_bme;
+		bool bme;
+
+		if (size > 2)
+			return -EINVAL;
+
+		new_bme = !!(get_reg_val(buf, 2) & PCI_COMMAND_MASTER);
+		bme = !!(vidxd->cfg[pos] & PCI_COMMAND_MASTER);
+		_pci_cfg_mem_write(vidxd, pos, buf, size);
+
+		/* Flag error if turning off BME while device is enabled */
+		if ((bme && !new_bme) && vidxd_state(vidxd) == IDXD_DEVICE_STATE_ENABLED)
+			vidxd_report_error(vidxd, DSA_ERR_PCI_CFG);
+		return 0;
+	}
+
+	switch (pos) {
+	case PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5:
+		if (!IS_ALIGNED(pos, 4))
+			return -EINVAL;
+		return _pci_cfg_bar_write(vidxd, pos, buf, size);
+
+	default:
+		_pci_cfg_mem_write(vidxd, pos, buf, size);
+	}
 	return 0;
 }
 
+static void vidxd_mmio_init_grpcap(struct vdcm_idxd *vidxd)
+{
+	u8 *bar0 = vidxd->bar0;
+	union group_cap_reg *grp_cap = (union group_cap_reg *)(bar0 + IDXD_GRPCAP_OFFSET);
+
+	/* single group for current implementation */
+	grp_cap->token_en = 0;
+	grp_cap->token_limit = 0;
+	grp_cap->num_groups = 1;
+}
+
+static void vidxd_mmio_init_grpcfg(struct vdcm_idxd *vidxd)
+{
+	u8 *bar0 = vidxd->bar0;
+	struct grpcfg *grpcfg = (struct grpcfg *)(bar0 + VIDXD_GRPCFG_OFFSET);
+	struct idxd_wq *wq = vidxd->wq;
+	struct idxd_group *group = wq->group;
+	int i;
+
+	/*
+	 * At this point, we are only exporting a single workqueue for
+	 * each mdev. So we need to just fake it as first workqueue
+	 * and also mark the available engines in this group.
+	 */
+
+	/* Set single workqueue and the first one */
+	grpcfg->wqs[0] = BIT(0);
+	grpcfg->engines = 0;
+	for (i = 0; i < group->num_engines; i++)
+		grpcfg->engines |= BIT(i);
+	grpcfg->flags.bits = group->grpcfg.flags.bits;
+}
+
+static void vidxd_mmio_init_wqcap(struct vdcm_idxd *vidxd)
+{
+	u8 *bar0 = vidxd->bar0;
+	struct idxd_wq *wq = vidxd->wq;
+	union wq_cap_reg *wq_cap = (union wq_cap_reg *)(bar0 + IDXD_WQCAP_OFFSET);
+
+	wq_cap->occupancy_int = 0;
+	wq_cap->occupancy = 0;
+	wq_cap->priority = 0;
+	wq_cap->total_wq_size = wq->size;
+	wq_cap->num_wqs = VIDXD_MAX_WQS;
+	if (wq_dedicated(wq))
+		wq_cap->dedicated_mode = 1;
+}
+
+static void vidxd_mmio_init_wqcfg(struct vdcm_idxd *vidxd)
+{
+	struct idxd_device *idxd = vidxd->idxd;
+	struct idxd_wq *wq = vidxd->wq;
+	u8 *bar0 = vidxd->bar0;
+	union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+
+	wqcfg->wq_size = wq->size;
+	wqcfg->wq_thresh = wq->threshold;
+
+	if (wq_dedicated(wq))
+		wqcfg->mode = WQCFG_MODE_DEDICATED;
+
+	if (idxd->hw.gen_cap.block_on_fault &&
+	    test_bit(WQ_FLAG_BLOCK_ON_FAULT, &wq->flags))
+		wqcfg->bof = 1;
+
+	wqcfg->priority = wq->priority;
+	wqcfg->max_xfer_shift = idxd->hw.gen_cap.max_xfer_shift;
+	wqcfg->max_batch_shift = idxd->hw.gen_cap.max_batch_shift;
+	/* make mode change read-only */
+	wqcfg->mode_support = 0;
+}
+
+static void vidxd_mmio_init_engcap(struct vdcm_idxd *vidxd)
+{
+	u8 *bar0 = vidxd->bar0;
+	union engine_cap_reg *engcap = (union engine_cap_reg *)(bar0 + IDXD_ENGCAP_OFFSET);
+	struct idxd_wq *wq = vidxd->wq;
+	struct idxd_group *group = wq->group;
+
+	engcap->num_engines = group->num_engines;
+}
+
+static void vidxd_mmio_init_gencap(struct vdcm_idxd *vidxd)
+{
+	struct idxd_device *idxd = vidxd->idxd;
+	u8 *bar0 = vidxd->bar0;
+	union gen_cap_reg *gencap = (union gen_cap_reg *)(bar0 + IDXD_GENCAP_OFFSET);
+
+	gencap->bits = idxd->hw.gen_cap.bits;
+	gencap->config_en = 0;
+	gencap->max_ims_mult = 0;
+	gencap->cmd_cap = 1;
+}
+
+static void vidxd_mmio_init_cmdcap(struct vdcm_idxd *vidxd)
+{
+	struct idxd_device *idxd = vidxd->idxd;
+	u8 *bar0 = vidxd->bar0;
+	u32 *cmdcap = (u32 *)(bar0 + IDXD_CMDCAP_OFFSET);
+
+	if (idxd->hw.cmd_cap)
+		*cmdcap = idxd->hw.cmd_cap;
+	else
+		*cmdcap = 0x1ffe;
+
+	*cmdcap |= BIT(IDXD_CMD_REQUEST_INT_HANDLE) | BIT(IDXD_CMD_RELEASE_INT_HANDLE);
+}
+
 void vidxd_mmio_init(struct vdcm_idxd *vidxd)
+{
+	struct idxd_device *idxd = vidxd->idxd;
+	u8 *bar0 = vidxd->bar0;
+	union offsets_reg *offsets;
+
+	/* Copy up to where table offset is */
+	memcpy_fromio(vidxd->bar0, idxd->reg_base, IDXD_TABLE_OFFSET);
+
+	vidxd_mmio_init_gencap(vidxd);
+	vidxd_mmio_init_cmdcap(vidxd);
+	vidxd_mmio_init_wqcap(vidxd);
+	vidxd_mmio_init_wqcfg(vidxd);
+	vidxd_mmio_init_grpcap(vidxd);
+	vidxd_mmio_init_grpcfg(vidxd);
+	vidxd_mmio_init_engcap(vidxd);
+
+	offsets = (union offsets_reg *)(bar0 + IDXD_TABLE_OFFSET);
+	offsets->grpcfg = VIDXD_GRPCFG_OFFSET / 0x100;
+	offsets->wqcfg = VIDXD_WQCFG_OFFSET / 0x100;
+	offsets->msix_perm = VIDXD_MSIX_PERM_OFFSET / 0x100;
+
+	memset(bar0 + VIDXD_MSIX_PERM_OFFSET, 0, VIDXD_MSIX_PERM_TBL_SZ);
+}
+
+static void idxd_complete_command(struct vdcm_idxd *vidxd, enum idxd_cmdsts_err val)
 {
 	/* PLACEHOLDER */
 }
@@ -63,6 +471,11 @@ void vidxd_reset(struct vdcm_idxd *vidxd)
 	/* PLACEHOLDER */
 }
 
+void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
+{
+	/* PLACEHOLDER */
+}
+
 int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
 {
 	/* PLACEHOLDER */
diff --git a/drivers/dma/idxd/vdev.h b/drivers/dma/idxd/vdev.h
index baa30d98f9cb..d23e63eb7f43 100644
--- a/drivers/dma/idxd/vdev.h
+++ b/drivers/dma/idxd/vdev.h
@@ -6,6 +6,13 @@
 
 #include "mdev.h"
 
+static inline u8 vidxd_state(struct vdcm_idxd *vidxd)
+{
+	union gensts_reg *gensts = (union gensts_reg *)(vidxd->bar0 + IDXD_GENSTATS_OFFSET);
+
+	return gensts->state;
+}
+
 int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
 int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
 int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count);
@@ -15,5 +22,6 @@ void vidxd_reset(struct vdcm_idxd *vidxd);
 int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx);
 int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd);
 void vidxd_free_ims_entries(struct vdcm_idxd *vidxd);
+void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val);
 
 #endif
diff --git a/include/uapi/linux/idxd.h b/include/uapi/linux/idxd.h
index fdcdfe414223..a0c0475a4626 100644
--- a/include/uapi/linux/idxd.h
+++ b/include/uapi/linux/idxd.h
@@ -78,6 +78,8 @@ enum dsa_completion_status {
 	DSA_COMP_HW_ERR1,
 	DSA_COMP_HW_ERR_DRB,
 	DSA_COMP_TRANSLATION_FAIL,
+	DSA_ERR_PCI_CFG = 0x51,
+	DSA_ERR_CMD_REG,
 };
 
 #define DSA_COMP_STATUS_MASK		0x7f



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 11/17] dmaengine: idxd: prep for virtual device commands
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (9 preceding siblings ...)
  2020-10-30 18:51 ` [PATCH v4 10/17] dmaengine: idxd: add emulation rw routines Dave Jiang
@ 2020-10-30 18:52 ` Dave Jiang
  2020-10-30 18:52 ` [PATCH v4 12/17] dmaengine: idxd: virtual device commands emulation Dave Jiang
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:52 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Update some of the device commands in order to support usage by the virtual
device commands emulated by the vdcm. Expose some of the commands' raw
status so the virtual commands can utilize them accordingly.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/cdev.c   |    2 +
 drivers/dma/idxd/device.c |   69 +++++++++++++++++++++++++++++----------------
 drivers/dma/idxd/idxd.h   |    8 +++--
 drivers/dma/idxd/irq.c    |    2 +
 drivers/dma/idxd/mdev.c   |    2 +
 drivers/dma/idxd/sysfs.c  |    8 +++--
 6 files changed, 56 insertions(+), 35 deletions(-)

diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index b774bf336347..1f504d1f0c42 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -159,7 +159,7 @@ static int idxd_cdev_release(struct inode *node, struct file *filep)
 			if (rc < 0)
 				dev_err(dev, "wq disable pasid failed.\n");
 		} else {
-			idxd_wq_drain(wq);
+			idxd_wq_drain(wq, NULL);
 		}
 	}
 
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 8aff07b1acb4..52fc8e64c5fc 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -197,22 +197,25 @@ void idxd_wq_free_resources(struct idxd_wq *wq)
 	sbitmap_queue_free(&wq->sbq);
 }
 
-int idxd_wq_enable(struct idxd_wq *wq)
+int idxd_wq_enable(struct idxd_wq *wq, u32 *status)
 {
 	struct idxd_device *idxd = wq->idxd;
 	struct device *dev = &idxd->pdev->dev;
-	u32 status;
+	u32 stat;
 
 	if (wq->state == IDXD_WQ_ENABLED) {
 		dev_dbg(dev, "WQ %d already enabled\n", wq->id);
 		return -ENXIO;
 	}
 
-	idxd_cmd_exec(idxd, IDXD_CMD_ENABLE_WQ, wq->id, &status);
+	idxd_cmd_exec(idxd, IDXD_CMD_ENABLE_WQ, wq->id, &stat);
 
-	if (status != IDXD_CMDSTS_SUCCESS &&
-	    status != IDXD_CMDSTS_ERR_WQ_ENABLED) {
-		dev_dbg(dev, "WQ enable failed: %#x\n", status);
+	if (status)
+		*status = stat;
+
+	if (stat != IDXD_CMDSTS_SUCCESS &&
+	    stat != IDXD_CMDSTS_ERR_WQ_ENABLED) {
+		dev_dbg(dev, "WQ enable failed: %#x\n", stat);
 		return -ENXIO;
 	}
 
@@ -221,11 +224,11 @@ int idxd_wq_enable(struct idxd_wq *wq)
 	return 0;
 }
 
-int idxd_wq_disable(struct idxd_wq *wq)
+int idxd_wq_disable(struct idxd_wq *wq, u32 *status)
 {
 	struct idxd_device *idxd = wq->idxd;
 	struct device *dev = &idxd->pdev->dev;
-	u32 status, operand;
+	u32 stat, operand;
 
 	dev_dbg(dev, "Disabling WQ %d\n", wq->id);
 
@@ -235,10 +238,13 @@ int idxd_wq_disable(struct idxd_wq *wq)
 	}
 
 	operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
-	idxd_cmd_exec(idxd, IDXD_CMD_DISABLE_WQ, operand, &status);
+	idxd_cmd_exec(idxd, IDXD_CMD_DISABLE_WQ, operand, &stat);
+
+	if (status)
+		*status = stat;
 
-	if (status != IDXD_CMDSTS_SUCCESS) {
-		dev_dbg(dev, "WQ disable failed: %#x\n", status);
+	if (stat != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "WQ disable failed: %#x\n", stat);
 		return -ENXIO;
 	}
 
@@ -247,20 +253,31 @@ int idxd_wq_disable(struct idxd_wq *wq)
 	return 0;
 }
 
-void idxd_wq_drain(struct idxd_wq *wq)
+int idxd_wq_drain(struct idxd_wq *wq, u32 *status)
 {
 	struct idxd_device *idxd = wq->idxd;
 	struct device *dev = &idxd->pdev->dev;
-	u32 operand;
+	u32 operand, stat;
 
 	if (wq->state != IDXD_WQ_ENABLED) {
 		dev_dbg(dev, "WQ %d in wrong state: %d\n", wq->id, wq->state);
-		return;
+		return 0;
 	}
 
 	dev_dbg(dev, "Draining WQ %d\n", wq->id);
 	operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
-	idxd_cmd_exec(idxd, IDXD_CMD_DRAIN_WQ, operand, NULL);
+	idxd_cmd_exec(idxd, IDXD_CMD_DRAIN_WQ, operand, &stat);
+
+	if (status)
+		*status = stat;
+
+	if (stat != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "WQ drain failed: %#x\n", stat);
+		return -ENXIO;
+	}
+
+	dev_dbg(dev, "WQ %d drained\n", wq->id);
+	return 0;
 }
 
 int idxd_wq_map_portal(struct idxd_wq *wq)
@@ -287,11 +304,11 @@ void idxd_wq_unmap_portal(struct idxd_wq *wq)
 	devm_iounmap(dev, wq->portal);
 }
 
-int idxd_wq_abort(struct idxd_wq *wq)
+int idxd_wq_abort(struct idxd_wq *wq, u32 *status)
 {
 	struct idxd_device *idxd = wq->idxd;
 	struct device *dev = &idxd->pdev->dev;
-	u32 operand, status;
+	u32 operand, stat;
 
 	dev_dbg(dev, "Abort WQ %d\n", wq->id);
 	if (wq->state != IDXD_WQ_ENABLED) {
@@ -301,9 +318,13 @@ int idxd_wq_abort(struct idxd_wq *wq)
 
 	operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
 	dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_WQ, operand);
-	idxd_cmd_exec(idxd, IDXD_CMD_ABORT_WQ, operand, &status);
-	if (status != IDXD_CMDSTS_SUCCESS) {
-		dev_dbg(dev, "WQ abort failed: %#x\n", status);
+	idxd_cmd_exec(idxd, IDXD_CMD_ABORT_WQ, operand, &stat);
+
+	if (status)
+		*status = stat;
+
+	if (stat != IDXD_CMDSTS_SUCCESS) {
+		dev_dbg(dev, "WQ abort failed: %#x\n", stat);
 		return -ENXIO;
 	}
 
@@ -319,7 +340,7 @@ int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
 	unsigned int offset;
 	unsigned long flags;
 
-	rc = idxd_wq_disable(wq);
+	rc = idxd_wq_disable(wq, NULL);
 	if (rc < 0)
 		return rc;
 
@@ -331,7 +352,7 @@ int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
 	iowrite32(wqcfg.bits[WQCFG_PASID_IDX], idxd->reg_base + offset);
 	spin_unlock_irqrestore(&idxd->dev_lock, flags);
 
-	rc = idxd_wq_enable(wq);
+	rc = idxd_wq_enable(wq, NULL);
 	if (rc < 0)
 		return rc;
 
@@ -346,7 +367,7 @@ int idxd_wq_disable_pasid(struct idxd_wq *wq)
 	unsigned int offset;
 	unsigned long flags;
 
-	rc = idxd_wq_disable(wq);
+	rc = idxd_wq_disable(wq, NULL);
 	if (rc < 0)
 		return rc;
 
@@ -358,7 +379,7 @@ int idxd_wq_disable_pasid(struct idxd_wq *wq)
 	iowrite32(wqcfg.bits[WQCFG_PASID_IDX], idxd->reg_base + offset);
 	spin_unlock_irqrestore(&idxd->dev_lock, flags);
 
-	rc = idxd_wq_enable(wq);
+	rc = idxd_wq_enable(wq, NULL);
 	if (rc < 0)
 		return rc;
 
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index ab28a1bffb7c..e616d18b53c0 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -350,15 +350,15 @@ int idxd_device_release_int_handle(struct idxd_device *idxd, int handle,
 /* work queue control */
 int idxd_wq_alloc_resources(struct idxd_wq *wq);
 void idxd_wq_free_resources(struct idxd_wq *wq);
-int idxd_wq_enable(struct idxd_wq *wq);
-int idxd_wq_disable(struct idxd_wq *wq);
-void idxd_wq_drain(struct idxd_wq *wq);
+int idxd_wq_enable(struct idxd_wq *wq, u32 *status);
+int idxd_wq_disable(struct idxd_wq *wq, u32 *status);
+int idxd_wq_drain(struct idxd_wq *wq, u32 *status);
 int idxd_wq_map_portal(struct idxd_wq *wq);
 void idxd_wq_unmap_portal(struct idxd_wq *wq);
 void idxd_wq_disable_cleanup(struct idxd_wq *wq);
 int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid);
 int idxd_wq_disable_pasid(struct idxd_wq *wq);
-int idxd_wq_abort(struct idxd_wq *wq);
+int idxd_wq_abort(struct idxd_wq *wq, u32 *status);
 void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid);
 void idxd_wq_setup_priv(struct idxd_wq *wq, int priv);
 
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
index 593a2f6ed16c..a94fce00767b 100644
--- a/drivers/dma/idxd/irq.c
+++ b/drivers/dma/idxd/irq.c
@@ -48,7 +48,7 @@ static void idxd_device_reinit(struct work_struct *work)
 		struct idxd_wq *wq = &idxd->wqs[i];
 
 		if (wq->state == IDXD_WQ_ENABLED) {
-			rc = idxd_wq_enable(wq);
+			rc = idxd_wq_enable(wq, NULL);
 			if (rc < 0) {
 				dev_warn(dev, "Unable to re-enable wq %s\n",
 					 dev_name(&wq->conf_dev));
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index 3b6febe22a0e..91270121dfbc 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -85,7 +85,7 @@ static void idxd_vdcm_init(struct vdcm_idxd *vidxd)
 	vidxd_mmio_init(vidxd);
 
 	if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
-		idxd_wq_disable(wq);
+		idxd_wq_disable(wq, NULL);
 }
 
 static void idxd_vdcm_release(struct mdev_device *mdev)
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 17f13ebae028..fe5f95509c5c 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -218,7 +218,7 @@ static int idxd_config_bus_probe(struct device *dev)
 			return rc;
 		}
 
-		rc = idxd_wq_enable(wq);
+		rc = idxd_wq_enable(wq, NULL);
 		if (rc < 0) {
 			mutex_unlock(&wq->wq_lock);
 			dev_warn(dev, "WQ %d enabling failed: %d\n",
@@ -229,7 +229,7 @@ static int idxd_config_bus_probe(struct device *dev)
 		rc = idxd_wq_map_portal(wq);
 		if (rc < 0) {
 			dev_warn(dev, "wq portal mapping failed: %d\n", rc);
-			rc = idxd_wq_disable(wq);
+			rc = idxd_wq_disable(wq, NULL);
 			if (rc < 0)
 				dev_warn(dev, "IDXD wq disable failed\n");
 			mutex_unlock(&wq->wq_lock);
@@ -287,8 +287,8 @@ static void disable_wq(struct idxd_wq *wq)
 
 	idxd_wq_unmap_portal(wq);
 
-	idxd_wq_drain(wq);
-	rc = idxd_wq_disable(wq);
+	idxd_wq_drain(wq, NULL);
+	rc = idxd_wq_disable(wq, NULL);
 
 	idxd_wq_free_resources(wq);
 	wq->client_count = 0;



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 12/17] dmaengine: idxd: virtual device commands emulation
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (10 preceding siblings ...)
  2020-10-30 18:52 ` [PATCH v4 11/17] dmaengine: idxd: prep for virtual device commands Dave Jiang
@ 2020-10-30 18:52 ` Dave Jiang
  2020-10-30 18:52 ` [PATCH v4 13/17] dmaengine: idxd: ims setup for the vdcm Dave Jiang
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:52 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Add all the helper functions that supports the emulation of the commands
that are submitted to the device command register.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/registers.h |   16 +-
 drivers/dma/idxd/vdev.c      |  427 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 438 insertions(+), 5 deletions(-)

diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index 5a76fd0ab6ad..17f0d868e5a4 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -119,7 +119,8 @@ union gencfg_reg {
 union genctrl_reg {
 	struct {
 		u32 softerr_int_en:1;
-		u32 rsvd:31;
+		u32 halt_state_int_en:1;
+		u32 rsvd:30;
 	};
 	u32 bits;
 } __packed;
@@ -141,6 +142,8 @@ enum idxd_device_status_state {
 	IDXD_DEVICE_STATE_HALT,
 };
 
+#define IDXD_GENSTATS_MASK		0x03
+
 enum idxd_device_reset_type {
 	IDXD_DEVICE_RESET_SOFTWARE = 0,
 	IDXD_DEVICE_RESET_FLR,
@@ -153,6 +156,7 @@ enum idxd_device_reset_type {
 #define IDXD_INTC_CMD			0x02
 #define IDXD_INTC_OCCUPY			0x04
 #define IDXD_INTC_PERFMON_OVFL		0x08
+#define IDXD_INTC_HALT_STATE		0x10
 
 #define IDXD_CMD_OFFSET			0xa0
 union idxd_command_reg {
@@ -164,6 +168,7 @@ union idxd_command_reg {
 	};
 	u32 bits;
 } __packed;
+#define IDXD_CMD_INT_MASK		0x80000000
 
 enum idxd_cmd {
 	IDXD_CMD_ENABLE_DEVICE = 1,
@@ -227,10 +232,11 @@ enum idxd_cmdsts_err {
 	/* disable device errors */
 	IDXD_CMDSTS_ERR_DIS_DEV_EN = 0x31,
 	/* disable WQ, drain WQ, abort WQ, reset WQ */
-	IDXD_CMDSTS_ERR_DEV_NOT_EN,
+	IDXD_CMDSTS_ERR_WQ_NOT_EN,
 	/* request interrupt handle */
 	IDXD_CMDSTS_ERR_INVAL_INT_IDX = 0x41,
 	IDXD_CMDSTS_ERR_NO_HANDLE,
+	IDXD_CMDSTS_ERR_INVAL_INT_IDX_RELEASE,
 };
 
 #define IDXD_CMDCAP_OFFSET		0xb0
@@ -351,6 +357,12 @@ union wqcfg {
 	u32 bits[8];
 } __packed;
 
+enum idxd_wq_hw_state {
+	IDXD_WQ_DEV_DISABLED = 0,
+	IDXD_WQ_DEV_ENABLED,
+	IDXD_WQ_DEV_BUSY,
+};
+
 #define WQCFG_PASID_IDX		2
 #define WQCFG_PRIV_IDX		2
 #define WQCFG_MODE_DEDICATED	1
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index b38bb676e604..6e7f98d0e52f 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -463,17 +463,438 @@ void vidxd_mmio_init(struct vdcm_idxd *vidxd)
 
 static void idxd_complete_command(struct vdcm_idxd *vidxd, enum idxd_cmdsts_err val)
 {
-	/* PLACEHOLDER */
+	u8 *bar0 = vidxd->bar0;
+	u32 *cmd = (u32 *)(bar0 + IDXD_CMD_OFFSET);
+	u32 *cmdsts = (u32 *)(bar0 + IDXD_CMDSTS_OFFSET);
+	u32 *intcause = (u32 *)(bar0 + IDXD_INTCAUSE_OFFSET);
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	*cmdsts = val;
+	dev_dbg(dev, "%s: cmd: %#x  status: %#x\n", __func__, *cmd, val);
+
+	if (*cmd & IDXD_CMD_INT_MASK) {
+		*intcause |= IDXD_INTC_CMD;
+		vidxd_send_interrupt(vidxd, 0);
+	}
+}
+
+static void vidxd_enable(struct vdcm_idxd *vidxd)
+{
+	u8 *bar0 = vidxd->bar0;
+	union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	dev_dbg(dev, "%s\n", __func__);
+	if (gensts->state == IDXD_DEVICE_STATE_ENABLED)
+		return idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_ENABLED);
+
+	/* Check PCI configuration */
+	if (!(vidxd->cfg[PCI_COMMAND] & PCI_COMMAND_MASTER))
+		return idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_BUSMASTER_EN);
+
+	gensts->state = IDXD_DEVICE_STATE_ENABLED;
+
+	return idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_disable(struct vdcm_idxd *vidxd)
+{
+	struct idxd_wq *wq;
+	union wqcfg *wqcfg;
+	u8 *bar0 = vidxd->bar0;
+	union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	u32 status;
+
+	dev_dbg(dev, "%s\n", __func__);
+	if (gensts->state == IDXD_DEVICE_STATE_DISABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DIS_DEV_EN);
+		return;
+	}
+
+	wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+	wq = vidxd->wq;
+
+	/* If it is a DWQ, need to disable the DWQ as well */
+	if (wq_dedicated(wq)) {
+		idxd_wq_disable(wq, &status);
+		if (status) {
+			dev_warn(dev, "vidxd disable (wq disable) failed: %#x\n", status);
+			idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DIS_DEV_EN);
+			return;
+		}
+	} else {
+		idxd_wq_drain(wq, &status);
+		if (status)
+			dev_warn(dev, "vidxd disable (wq drain) failed: %#x\n", status);
+	}
+
+	wqcfg->wq_state = 0;
+	gensts->state = IDXD_DEVICE_STATE_DISABLED;
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_drain_all(struct vdcm_idxd *vidxd)
+{
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	struct idxd_wq *wq = vidxd->wq;
+
+	dev_dbg(dev, "%s\n", __func__);
+
+	idxd_wq_drain(wq, NULL);
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_drain(struct vdcm_idxd *vidxd, int val)
+{
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	u8 *bar0 = vidxd->bar0;
+	union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+	struct idxd_wq *wq = vidxd->wq;
+	u32 status;
+
+	dev_dbg(dev, "%s\n", __func__);
+	if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+		return;
+	}
+
+	idxd_wq_drain(wq, &status);
+	if (status) {
+		dev_dbg(dev, "wq drain failed: %#x\n", status);
+		idxd_complete_command(vidxd, status);
+		return;
+	}
+
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_abort_all(struct vdcm_idxd *vidxd)
+{
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	struct idxd_wq *wq = vidxd->wq;
+
+	dev_dbg(dev, "%s\n", __func__);
+	idxd_wq_abort(wq, NULL);
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_abort(struct vdcm_idxd *vidxd, int val)
+{
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	u8 *bar0 = vidxd->bar0;
+	union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+	struct idxd_wq *wq = vidxd->wq;
+	u32 status;
+
+	dev_dbg(dev, "%s\n", __func__);
+	if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+		return;
+	}
+
+	idxd_wq_abort(wq, &status);
+	if (status) {
+		dev_dbg(dev, "wq abort failed: %#x\n", status);
+		idxd_complete_command(vidxd, status);
+		return;
+	}
+
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
 }
 
 void vidxd_reset(struct vdcm_idxd *vidxd)
 {
-	/* PLACEHOLDER */
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	u8 *bar0 = vidxd->bar0;
+	union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+	struct idxd_wq *wq;
+
+	dev_dbg(dev, "%s\n", __func__);
+	gensts->state = IDXD_DEVICE_STATE_DRAIN;
+	wq = vidxd->wq;
+
+	if (wq->state == IDXD_WQ_ENABLED) {
+		idxd_wq_abort(wq, NULL);
+		idxd_wq_disable(wq, NULL);
+	}
+
+	vidxd_mmio_init(vidxd);
+	gensts->state = IDXD_DEVICE_STATE_DISABLED;
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_reset(struct vdcm_idxd *vidxd, int wq_id_mask)
+{
+	struct idxd_wq *wq;
+	u8 *bar0 = vidxd->bar0;
+	union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	u32 status;
+
+	wq = vidxd->wq;
+	dev_dbg(dev, "vidxd reset wq %u:%u\n", 0, wq->id);
+
+	if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+		return;
+	}
+
+	idxd_wq_abort(wq, &status);
+	if (status) {
+		dev_dbg(dev, "vidxd reset wq failed to abort: %#x\n", status);
+		idxd_complete_command(vidxd, status);
+		return;
+	}
+
+	idxd_wq_disable(wq, &status);
+	if (status) {
+		dev_dbg(dev, "vidxd reset wq failed to disable: %#x\n", status);
+		idxd_complete_command(vidxd, status);
+		return;
+	}
+
+	wqcfg->wq_state = IDXD_WQ_DEV_DISABLED;
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_alloc_int_handle(struct vdcm_idxd *vidxd, int operand)
+{
+	bool ims = !!(operand & CMD_INT_HANDLE_IMS);
+	u32 cmdsts;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	int ims_idx, vidx;
+
+	vidx = operand & GENMASK(15, 0);
+
+	dev_dbg(dev, "allocating int handle for %d\n", vidx);
+
+	/* vidx cannot be 0 since that's emulated and does not require IMS handle */
+	if (vidx <= 0 || vidx >= VIDXD_MAX_MSIX_ENTRIES) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX);
+		return;
+	}
+
+	if (ims) {
+		dev_warn(dev, "IMS allocation is not implemented yet\n");
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_NO_HANDLE);
+		return;
+	}
+
+	ims_idx = vidxd->irq_entries[vidx - 1].entry->device_msi.hwirq;
+	vidx--; /* MSIX idx 0 is a slow path interrupt */
+	cmdsts = ims_idx << IDXD_CMDSTS_RES_SHIFT;
+	dev_dbg(dev, "int handle %d:%d\n", vidx, ims_idx);
+	idxd_complete_command(vidxd, cmdsts);
+}
+
+static void vidxd_release_int_handle(struct vdcm_idxd *vidxd, int operand)
+{
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	bool ims = !!(operand & CMD_INT_HANDLE_IMS);
+	int handle, i;
+	bool found = false;
+
+	handle = operand & GENMASK(15, 0);
+	dev_dbg(dev, "allocating int handle %d\n", handle);
+
+	if (ims) {
+		dev_warn(dev, "IMS allocation is not implemented yet\n");
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX_RELEASE);
+		return;
+	}
+
+	for (i = 0; i < VIDXD_MAX_MSIX_ENTRIES - 1; i++) {
+		if (vidxd->irq_entries[i].entry->device_msi.hwirq == handle) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found) {
+		dev_warn(dev, "Freeing unallocated int handle.\n");
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX_RELEASE);
+	}
+
+	dev_dbg(dev, "int handle %d released.\n", handle);
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_enable(struct vdcm_idxd *vidxd, int wq_id)
+{
+	struct idxd_wq *wq;
+	u8 *bar0 = vidxd->bar0;
+	union wq_cap_reg *wqcap;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	struct idxd_device *idxd;
+	union wqcfg *vwqcfg, *wqcfg;
+	unsigned long flags;
+	int wq_pasid;
+	u32 status;
+	int priv;
+
+	if (wq_id >= VIDXD_MAX_WQS) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_WQIDX);
+		return;
+	}
+
+	idxd = vidxd->idxd;
+	wq = vidxd->wq;
+
+	dev_dbg(dev, "%s: wq %u:%u\n", __func__, wq_id, wq->id);
+
+	vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET + wq_id * 32);
+	wqcap = (union wq_cap_reg *)(bar0 + IDXD_WQCAP_OFFSET);
+	wqcfg = wq->wqcfg;
+
+	if (vidxd_state(vidxd) != IDXD_DEVICE_STATE_ENABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOTEN);
+		return;
+	}
+
+	if (vwqcfg->wq_state != IDXD_WQ_DEV_DISABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_ENABLED);
+		return;
+	}
+
+	if (wq_dedicated(wq) && wqcap->dedicated_mode == 0) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_MODE);
+		return;
+	}
+
+	wq_pasid = idxd_mdev_get_pasid(mdev);
+	priv = 1;
+
+	if (wq_pasid >= 0) {
+		/* Clear pasid_en, pasid, and priv values */
+		wqcfg->bits[WQCFG_PASID_IDX] &= ~GENMASK(29, 8);
+		wqcfg->priv = priv;
+		wqcfg->pasid_en = 1;
+		wqcfg->pasid = wq_pasid;
+		dev_dbg(dev, "program pasid %d in wq %d\n", wq_pasid, wq->id);
+		spin_lock_irqsave(&idxd->dev_lock, flags);
+		idxd_wq_setup_pasid(wq, wq_pasid);
+		idxd_wq_setup_priv(wq, priv);
+		spin_unlock_irqrestore(&idxd->dev_lock, flags);
+		idxd_wq_enable(wq, &status);
+		if (status) {
+			dev_err(dev, "vidxd enable wq %d failed\n", wq->id);
+			idxd_complete_command(vidxd, status);
+			return;
+		}
+	} else {
+		dev_err(dev, "idxd pasid setup failed wq %d wq_pasid %d\n", wq->id, wq_pasid);
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_PASID_EN);
+		return;
+	}
+
+	vwqcfg->wq_state = IDXD_WQ_DEV_ENABLED;
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_disable(struct vdcm_idxd *vidxd, int wq_id_mask)
+{
+	struct idxd_wq *wq;
+	union wqcfg *wqcfg;
+	u8 *bar0 = vidxd->bar0;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	u32 status;
+
+	wq = vidxd->wq;
+
+	dev_dbg(dev, "vidxd disable wq %u:%u\n", 0, wq->id);
+
+	wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+	if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+		idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+		return;
+	}
+
+	/* If it is a DWQ, need to disable the DWQ as well */
+	if (wq_dedicated(wq)) {
+		idxd_wq_disable(wq, &status);
+		if (status) {
+			dev_warn(dev, "vidxd disable wq failed: %#x\n", status);
+			idxd_complete_command(vidxd, status);
+			return;
+		}
+	} else {
+		idxd_wq_drain(wq, &status);
+		if (status) {
+			dev_warn(dev, "vidxd disable drain wq failed: %#x\n", status);
+			idxd_complete_command(vidxd, status);
+			return;
+		}
+	}
+
+	wqcfg->wq_state = IDXD_WQ_DEV_DISABLED;
+	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
 }
 
 void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
 {
-	/* PLACEHOLDER */
+	union idxd_command_reg *reg = (union idxd_command_reg *)(vidxd->bar0 + IDXD_CMD_OFFSET);
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+
+	reg->bits = val;
+
+	dev_dbg(dev, "%s: cmd code: %u reg: %x\n", __func__, reg->cmd, reg->bits);
+
+	switch (reg->cmd) {
+	case IDXD_CMD_ENABLE_DEVICE:
+		vidxd_enable(vidxd);
+		break;
+	case IDXD_CMD_DISABLE_DEVICE:
+		vidxd_disable(vidxd);
+		break;
+	case IDXD_CMD_DRAIN_ALL:
+		vidxd_drain_all(vidxd);
+		break;
+	case IDXD_CMD_ABORT_ALL:
+		vidxd_abort_all(vidxd);
+		break;
+	case IDXD_CMD_RESET_DEVICE:
+		vidxd_reset(vidxd);
+		break;
+	case IDXD_CMD_ENABLE_WQ:
+		vidxd_wq_enable(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_DISABLE_WQ:
+		vidxd_wq_disable(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_DRAIN_WQ:
+		vidxd_wq_drain(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_ABORT_WQ:
+		vidxd_wq_abort(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_RESET_WQ:
+		vidxd_wq_reset(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_REQUEST_INT_HANDLE:
+		vidxd_alloc_int_handle(vidxd, reg->operand);
+		break;
+	case IDXD_CMD_RELEASE_INT_HANDLE:
+		vidxd_release_int_handle(vidxd, reg->operand);
+		break;
+	default:
+		idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_CMD);
+		break;
+	}
 }
 
 int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 13/17] dmaengine: idxd: ims setup for the vdcm
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (11 preceding siblings ...)
  2020-10-30 18:52 ` [PATCH v4 12/17] dmaengine: idxd: virtual device commands emulation Dave Jiang
@ 2020-10-30 18:52 ` Dave Jiang
  2020-10-30 21:26   ` Thomas Gleixner
  2020-10-30 18:52 ` [PATCH v4 14/17] dmaengine: idxd: add mdev type as a new wq type Dave Jiang
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:52 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: Megha Dey, dmaengine, linux-kernel, linux-pci, kvm

Add setup for IMS enabling for the mediated device.

On the actual hardware the MSIX vector 0 is misc interrupt and
handles events such as administrative command completion, error
reporting, performance monitor overflow, and etc. The MSIX vectors
1...N are used for descriptor completion interrupts. On the guest
kernel, the MSIX interrupts are backed by the mediated device through
emulation or IMS vectors. Vector 0 is handled through emulation by
the host vdcm. The vector 1 (and more may be supported later) is
backed by IMS.

IMS can be setup with interrupt handlers via request_irq() just like
MSIX interrupts once the relevant IRQ domain is set.

The msi_domain_alloc_irqs()/msi_domain_free_irqs() APIs can then be
used to allocate interrupts from the above set domain.

Register with the irq bypass manager in order to allow the IMS interrupt be
injected into the guest and bypass the host.

Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/Kconfig     |    2 ++
 drivers/dma/idxd/idxd.h |    1 +
 drivers/dma/idxd/mdev.c |   26 +++++++++++++++++++++
 drivers/dma/idxd/mdev.h |    1 +
 drivers/dma/idxd/vdev.c |   57 ++++++++++++++++++++++++++++++++++++++---------
 kernel/irq/msi.c        |    2 ++
 6 files changed, 78 insertions(+), 11 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index c5970e4a3a2c..b0335a4321f5 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -312,6 +312,8 @@ config INTEL_IDXD_MDEV
 	depends on VFIO_MDEV
 	depends on VFIO_MDEV_DEVICE
 	select PCI_SIOV
+	select IRQ_BYPASS_MANAGER
+	select IMS_MSI_ARRAY
 
 config INTEL_IOATDMA
 	tristate "Intel I/OAT DMA support"
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index e616d18b53c0..72c30826f1bb 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -213,6 +213,7 @@ struct idxd_device {
 	struct workqueue_struct *wq;
 	struct work_struct work;
 
+	struct irq_domain *ims_domain;
 	int *int_handles;
 };
 
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index 91270121dfbc..ed79c85e692e 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -508,8 +508,12 @@ static int msix_trigger_unregister(struct vdcm_idxd *vidxd, int index)
 
 	dev_dbg(dev, "disable MSIX trigger %d\n", index);
 	if (index) {
+		struct irq_bypass_producer *producer;
 		u32 auxval;
 
+		producer = &vidxd->vdev.producer[index - 1];
+		irq_bypass_unregister_producer(producer);
+
 		irq_entry = &vidxd->irq_entries[index - 1];
 		if (irq_entry->irq_set) {
 			free_irq(irq_entry->entry->irq, irq_entry);
@@ -553,9 +557,11 @@ static int msix_trigger_register(struct vdcm_idxd *vidxd, u32 fd, int index)
 	 * in i - 1 to the host setup and irq_entries.
 	 */
 	if (index) {
+		struct irq_bypass_producer *producer;
 		int pasid;
 		u32 auxval;
 
+		producer = &vidxd->vdev.producer[index - 1];
 		irq_entry = &vidxd->irq_entries[index - 1];
 		pasid = idxd_mdev_get_pasid(mdev);
 		if (pasid < 0)
@@ -581,6 +587,14 @@ static int msix_trigger_register(struct vdcm_idxd *vidxd, u32 fd, int index)
 			irq_set_auxdata(irq_entry->entry->irq, IMS_AUXDATA_CONTROL_WORD, auxval);
 			return rc;
 		}
+
+		producer->token = trigger;
+		producer->irq = irq_entry->entry->irq;
+		rc = irq_bypass_register_producer(producer);
+		if (unlikely(rc))
+			dev_info(dev, "irq bypass producer (token %p) registration failed: %d\n",
+				 producer->token, rc);
+
 		irq_entry->irq_set = true;
 	}
 
@@ -934,6 +948,7 @@ static const struct mdev_parent_ops idxd_vdcm_ops = {
 int idxd_mdev_host_init(struct idxd_device *idxd)
 {
 	struct device *dev = &idxd->pdev->dev;
+	struct ims_array_info ims_info;
 	int rc;
 
 	if (!test_bit(IDXD_FLAG_SIOV_SUPPORTED, &idxd->flags))
@@ -950,6 +965,15 @@ int idxd_mdev_host_init(struct idxd_device *idxd)
 		return -EOPNOTSUPP;
 	}
 
+	ims_info.max_slots = idxd->ims_size;
+	ims_info.slots = idxd->reg_base + idxd->ims_offset;
+	idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
+	if (!idxd->ims_domain) {
+		dev_warn(dev, "Fail to acquire IMS domain\n");
+		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
+		return -ENODEV;
+	}
+
 	return mdev_register_device(dev, &idxd_vdcm_ops);
 }
 
@@ -958,6 +982,8 @@ void idxd_mdev_host_release(struct idxd_device *idxd)
 	struct device *dev = &idxd->pdev->dev;
 	int rc;
 
+	irq_domain_remove(idxd->ims_domain);
+
 	mdev_unregister_device(dev);
 	if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
 		rc = iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
diff --git a/drivers/dma/idxd/mdev.h b/drivers/dma/idxd/mdev.h
index b474f2303ba0..266231987331 100644
--- a/drivers/dma/idxd/mdev.h
+++ b/drivers/dma/idxd/mdev.h
@@ -43,6 +43,7 @@ struct ims_irq_entry {
 struct idxd_vdev {
 	struct mdev_device *mdev;
 	struct eventfd_ctx *msix_trigger[VIDXD_MAX_MSIX_ENTRIES];
+	struct irq_bypass_producer producer[VIDXD_MAX_MSIX_ENTRIES];
 };
 
 struct vdcm_idxd {
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index 6e7f98d0e52f..d61bc17624b9 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -16,6 +16,7 @@
 #include <linux/intel-svm.h>
 #include <linux/kvm_host.h>
 #include <linux/eventfd.h>
+#include <linux/irqchip/irq-ims-msi.h>
 #include <uapi/linux/idxd.h>
 #include "registers.h"
 #include "idxd.h"
@@ -844,6 +845,51 @@ static void vidxd_wq_disable(struct vdcm_idxd *vidxd, int wq_id_mask)
 	idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
 }
 
+void vidxd_free_ims_entries(struct vdcm_idxd *vidxd)
+{
+	struct irq_domain *irq_domain;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	int i;
+
+	for (i = 0; i < VIDXD_MAX_MSIX_VECS; i++)
+		vidxd->irq_entries[i].entry = NULL;
+
+	irq_domain = dev_get_msi_domain(dev);
+	if (irq_domain)
+		msi_domain_free_irqs(irq_domain, dev);
+	else
+		dev_warn(dev, "No IMS irq domain.\n");
+}
+
+int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
+{
+	struct irq_domain *irq_domain;
+	struct idxd_device *idxd = vidxd->idxd;
+	struct mdev_device *mdev = vidxd->vdev.mdev;
+	struct device *dev = mdev_dev(mdev);
+	int vecs = VIDXD_MAX_MSIX_VECS - 1;
+	struct msi_desc *entry;
+	struct ims_irq_entry *irq_entry;
+	int rc, i = 0;
+
+	irq_domain = idxd->ims_domain;
+	dev_set_msi_domain(dev, irq_domain);
+	rc = msi_domain_alloc_irqs(irq_domain, dev, vecs);
+	if (rc < 0)
+		return rc;
+
+	for_each_msi_entry(entry, dev) {
+		irq_entry = &vidxd->irq_entries[i];
+		irq_entry->vidxd = vidxd;
+		irq_entry->entry = entry;
+		irq_entry->id = i;
+		i++;
+	}
+
+	return 0;
+}
+
 void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
 {
 	union idxd_command_reg *reg = (union idxd_command_reg *)(vidxd->bar0 + IDXD_CMD_OFFSET);
@@ -896,14 +942,3 @@ void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
 		break;
 	}
 }
-
-int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
-{
-	/* PLACEHOLDER */
-	return 0;
-}
-
-void vidxd_free_ims_entries(struct vdcm_idxd *vidxd)
-{
-	/* PLACEHOLDER */
-}
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index c7e47c26cd90..89cf60a30803 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -536,6 +536,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 	return ops->domain_alloc_irqs(domain, dev, nvec);
 }
+EXPORT_SYMBOL(msi_domain_alloc_irqs);
 
 void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
 {
@@ -572,6 +573,7 @@ void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
 
 	return ops->domain_free_irqs(domain, dev);
 }
+EXPORT_SYMBOL(msi_domain_free_irqs);
 
 /**
  * msi_get_domain_info - Get the MSI interrupt domain info for @domain



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 14/17] dmaengine: idxd: add mdev type as a new wq type
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (12 preceding siblings ...)
  2020-10-30 18:52 ` [PATCH v4 13/17] dmaengine: idxd: ims setup for the vdcm Dave Jiang
@ 2020-10-30 18:52 ` Dave Jiang
  2020-10-30 18:52 ` [PATCH v4 15/17] dmaengine: idxd: add dedicated wq mdev type Dave Jiang
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:52 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Add "mdev" wq type and support helpers. The mdev wq type marks the wq
to be utilized as a VFIO mediated device.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/idxd.h  |    2 ++
 drivers/dma/idxd/sysfs.c |   13 +++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 72c30826f1bb..4e583fdd15d2 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -72,6 +72,7 @@ enum idxd_wq_type {
 	IDXD_WQT_NONE = 0,
 	IDXD_WQT_KERNEL,
 	IDXD_WQT_USER,
+	IDXD_WQT_MDEV,
 };
 
 struct idxd_cdev {
@@ -321,6 +322,7 @@ void idxd_cleanup_sysfs(struct idxd_device *idxd);
 int idxd_register_driver(void);
 void idxd_unregister_driver(void);
 struct bus_type *idxd_get_bus_type(struct idxd_device *idxd);
+bool is_idxd_wq_mdev(struct idxd_wq *wq);
 
 /* device interrupt control */
 irqreturn_t idxd_irq_handler(int vec, void *data);
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index fe5f95509c5c..5b79d9019f2e 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -14,6 +14,7 @@ static char *idxd_wq_type_names[] = {
 	[IDXD_WQT_NONE]		= "none",
 	[IDXD_WQT_KERNEL]	= "kernel",
 	[IDXD_WQT_USER]		= "user",
+	[IDXD_WQT_MDEV]		= "mdev",
 };
 
 static void idxd_conf_device_release(struct device *dev)
@@ -69,6 +70,11 @@ static inline bool is_idxd_wq_cdev(struct idxd_wq *wq)
 	return wq->type == IDXD_WQT_USER;
 }
 
+inline bool is_idxd_wq_mdev(struct idxd_wq *wq)
+{
+	return wq->type == IDXD_WQT_MDEV ? true : false;
+}
+
 static int idxd_config_bus_match(struct device *dev,
 				 struct device_driver *drv)
 {
@@ -1094,8 +1100,9 @@ static ssize_t wq_type_show(struct device *dev,
 		return sprintf(buf, "%s\n",
 			       idxd_wq_type_names[IDXD_WQT_KERNEL]);
 	case IDXD_WQT_USER:
-		return sprintf(buf, "%s\n",
-			       idxd_wq_type_names[IDXD_WQT_USER]);
+		return sprintf(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_USER]);
+	case IDXD_WQT_MDEV:
+		return sprintf(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_MDEV]);
 	case IDXD_WQT_NONE:
 	default:
 		return sprintf(buf, "%s\n",
@@ -1122,6 +1129,8 @@ static ssize_t wq_type_store(struct device *dev,
 		wq->type = IDXD_WQT_KERNEL;
 	else if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_USER]))
 		wq->type = IDXD_WQT_USER;
+	else if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_MDEV]))
+		wq->type = IDXD_WQT_MDEV;
 	else
 		return -EINVAL;
 



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 15/17] dmaengine: idxd: add dedicated wq mdev type
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (13 preceding siblings ...)
  2020-10-30 18:52 ` [PATCH v4 14/17] dmaengine: idxd: add mdev type as a new wq type Dave Jiang
@ 2020-10-30 18:52 ` Dave Jiang
  2020-10-30 18:52 ` [PATCH v4 16/17] dmaengine: idxd: add new wq state for mdev Dave Jiang
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:52 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

Add the support code for "1dwq" mdev type. This mdev type follows the
standard VFIO mdev flow. The "1dwq" type will export a single dedicated wq
to the mdev. The dwq will have read-only configuration that is configured
by the host. The mdev type does not support PASID and SVA and will match
the stage 1 driver in functional support. For backward compatibility, the
mdev will maintain the DSA spec definition of this mdev type once the
commit goes upstream.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/mdev.c |  141 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 133 insertions(+), 8 deletions(-)

diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index ed79c85e692e..16b56f8f7fc1 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -111,20 +111,58 @@ static void idxd_vdcm_release(struct mdev_device *mdev)
 	mutex_unlock(&vidxd->dev_lock);
 }
 
+static struct idxd_wq *find_any_dwq(struct idxd_device *idxd)
+{
+	int i;
+	struct idxd_wq *wq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&idxd->dev_lock, flags);
+	for (i = 0; i < idxd->max_wqs; i++) {
+		wq = &idxd->wqs[i];
+
+		if (wq->state != IDXD_WQ_ENABLED)
+			continue;
+
+		if (!wq_dedicated(wq))
+			continue;
+
+		if (idxd_wq_refcount(wq) != 0)
+			continue;
+
+		spin_unlock_irqrestore(&idxd->dev_lock, flags);
+		mutex_lock(&wq->wq_lock);
+		if (idxd_wq_refcount(wq)) {
+			spin_lock_irqsave(&idxd->dev_lock, flags);
+			continue;
+		}
+
+		idxd_wq_get(wq);
+		mutex_unlock(&wq->wq_lock);
+		return wq;
+	}
+
+	spin_unlock_irqrestore(&idxd->dev_lock, flags);
+	return NULL;
+}
+
 static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev_device *mdev,
 					   struct vdcm_idxd_type *type)
 {
 	struct vdcm_idxd *vidxd;
 	struct idxd_wq *wq = NULL;
+	int rc;
 
-	/* PLACEHOLDER, wq matching comes later */
-
+	if (type->type == IDXD_MDEV_TYPE_1_DWQ)
+		wq = find_any_dwq(idxd);
 	if (!wq)
 		return ERR_PTR(-ENODEV);
 
 	vidxd = kzalloc(sizeof(*vidxd), GFP_KERNEL);
-	if (!vidxd)
-		return ERR_PTR(-ENOMEM);
+	if (!vidxd) {
+		rc = -ENOMEM;
+		goto err;
+	}
 
 	mutex_init(&vidxd->dev_lock);
 	vidxd->idxd = idxd;
@@ -135,14 +173,23 @@ static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev
 	vidxd->num_wqs = VIDXD_MAX_WQS;
 
 	idxd_vdcm_init(vidxd);
-	mutex_lock(&wq->wq_lock);
-	idxd_wq_get(wq);
-	mutex_unlock(&wq->wq_lock);
 
 	return vidxd;
+
+ err:
+	mutex_lock(&wq->wq_lock);
+	idxd_wq_put(wq);
+	mutex_unlock(&wq->wq_lock);
+	return ERR_PTR(rc);
 }
 
-static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES];
+static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES] = {
+	{
+		.name = "1dwq-v1",
+		.description = "IDXD MDEV with 1 dedicated workqueue",
+		.type = IDXD_MDEV_TYPE_1_DWQ,
+	},
+};
 
 static struct vdcm_idxd_type *idxd_vdcm_find_vidxd_type(struct device *dev,
 							const char *name)
@@ -934,7 +981,85 @@ static long idxd_vdcm_ioctl(struct mdev_device *mdev, unsigned int cmd,
 	return rc;
 }
 
+static ssize_t name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	struct vdcm_idxd_type *type;
+
+	type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+
+	if (type)
+		return sprintf(buf, "%s\n", type->description);
+
+	return -EINVAL;
+}
+static MDEV_TYPE_ATTR_RO(name);
+
+static int find_available_mdev_instances(struct idxd_device *idxd, struct vdcm_idxd_type *type)
+{
+	int count = 0, i;
+	unsigned long flags;
+
+	if (type->type != IDXD_MDEV_TYPE_1_DWQ)
+		return 0;
+
+	spin_lock_irqsave(&idxd->dev_lock, flags);
+	for (i = 0; i < idxd->max_wqs; i++) {
+		struct idxd_wq *wq;
+
+		wq = &idxd->wqs[i];
+		if (!is_idxd_wq_mdev(wq) || !wq_dedicated(wq) || idxd_wq_refcount(wq))
+			continue;
+
+		count++;
+	}
+	spin_unlock_irqrestore(&idxd->dev_lock, flags);
+
+	return count;
+}
+
+static ssize_t available_instances_show(struct kobject *kobj,
+					struct device *dev, char *buf)
+{
+	int count;
+	struct idxd_device *idxd = dev_get_drvdata(dev);
+	struct vdcm_idxd_type *type;
+
+	type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+	if (!type)
+		return -EINVAL;
+
+	count = find_available_mdev_instances(idxd, type);
+
+	return sprintf(buf, "%d\n", count);
+}
+static MDEV_TYPE_ATTR_RO(available_instances);
+
+static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
+			       char *buf)
+{
+	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+static MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *idxd_mdev_types_attrs[] = {
+	&mdev_type_attr_name.attr,
+	&mdev_type_attr_device_api.attr,
+	&mdev_type_attr_available_instances.attr,
+	NULL,
+};
+
+static struct attribute_group idxd_mdev_type_group0 = {
+	.name = "1dwq-v1",
+	.attrs = idxd_mdev_types_attrs,
+};
+
+static struct attribute_group *idxd_mdev_type_groups[] = {
+	&idxd_mdev_type_group0,
+	NULL,
+};
+
 static const struct mdev_parent_ops idxd_vdcm_ops = {
+	.supported_type_groups	= idxd_mdev_type_groups,
 	.create			= idxd_vdcm_create,
 	.remove			= idxd_vdcm_remove,
 	.open			= idxd_vdcm_open,



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 16/17] dmaengine: idxd: add new wq state for mdev
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (14 preceding siblings ...)
  2020-10-30 18:52 ` [PATCH v4 15/17] dmaengine: idxd: add dedicated wq mdev type Dave Jiang
@ 2020-10-30 18:52 ` Dave Jiang
  2020-10-30 18:52 ` [PATCH v4 17/17] dmaengine: idxd: add error notification from host driver to mediated device Dave Jiang
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:52 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

When a dedicated wq is enabled as mdev, we must disable the wq on the
device in order to program the pasid to the wq. Introduce a wq state
IDXD_WQ_LOCKED that is software state only in order to prevent the user
from modifying the configuration while mdev wq is in this state. While
in this state, the wq is not in DISABLED state and will prevent any
modifications to the configuration. It is also not in the ENABLED state
and therefore prevents any actions allowed in the ENABLED state.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/idxd.h  |    1 +
 drivers/dma/idxd/mdev.c  |    4 +++-
 drivers/dma/idxd/sysfs.c |    2 ++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 4e583fdd15d2..03275ad9e849 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -61,6 +61,7 @@ struct idxd_group {
 enum idxd_wq_state {
 	IDXD_WQ_DISABLED = 0,
 	IDXD_WQ_ENABLED,
+	IDXD_WQ_LOCKED,
 };
 
 enum idxd_wq_flag {
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index 16b56f8f7fc1..3db7717a10c0 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -84,8 +84,10 @@ static void idxd_vdcm_init(struct vdcm_idxd *vidxd)
 
 	vidxd_mmio_init(vidxd);
 
-	if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
+	if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED) {
 		idxd_wq_disable(wq, NULL);
+		wq->state = IDXD_WQ_LOCKED;
+	}
 }
 
 static void idxd_vdcm_release(struct mdev_device *mdev)
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 5b79d9019f2e..3bbbd413980e 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -821,6 +821,8 @@ static ssize_t wq_state_show(struct device *dev,
 		return sprintf(buf, "disabled\n");
 	case IDXD_WQ_ENABLED:
 		return sprintf(buf, "enabled\n");
+	case IDXD_WQ_LOCKED:
+		return sprintf(buf, "locked\n");
 	}
 
 	return sprintf(buf, "unknown\n");



^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH v4 17/17] dmaengine: idxd: add error notification from host driver to mediated device
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (15 preceding siblings ...)
  2020-10-30 18:52 ` [PATCH v4 16/17] dmaengine: idxd: add new wq state for mdev Dave Jiang
@ 2020-10-30 18:52 ` Dave Jiang
  2020-10-30 18:58 ` [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Jason Gunthorpe
  2020-10-30 20:48 ` Thomas Gleixner
  18 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 18:52 UTC (permalink / raw)
  To: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

When a device error occurs, the mediated device need to be notified in
order to notify the guest of device error. Add support to notify the
specific mdev when an error is wq specific and broadcast errors to all mdev
when it's a generic device error.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/idxd.h |   12 ++++++++++++
 drivers/dma/idxd/irq.c  |    4 ++++
 drivers/dma/idxd/vdev.c |   32 ++++++++++++++++++++++++++++++++
 drivers/dma/idxd/vdev.h |    1 +
 4 files changed, 49 insertions(+)

diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 03275ad9e849..2f9e44bcd436 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -393,4 +393,16 @@ int idxd_mdev_host_init(struct idxd_device *idxd);
 void idxd_mdev_host_release(struct idxd_device *idxd);
 int idxd_mdev_get_pasid(struct mdev_device *mdev);
 
+#ifdef CONFIG_INTEL_IDXD_MDEV
+void idxd_vidxd_send_errors(struct idxd_device *idxd);
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq);
+#else
+static inline void idxd_vidxd_send_errors(struct idxd_device *idxd)
+{
+}
+static inline void idxd_wq_vidxd_send_errors(struct idxd_wq *wq)
+{
+}
+#endif /* CONFIG_INTEL_IDXD_MDEV */
+
 #endif
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
index a94fce00767b..9219dcf0a34d 100644
--- a/drivers/dma/idxd/irq.c
+++ b/drivers/dma/idxd/irq.c
@@ -137,6 +137,8 @@ irqreturn_t idxd_misc_thread(int vec, void *data)
 
 			if (wq->type == IDXD_WQT_USER)
 				wake_up_interruptible(&wq->idxd_cdev.err_queue);
+			else if (wq->type == IDXD_WQT_MDEV)
+				idxd_wq_vidxd_send_errors(wq);
 		} else {
 			int i;
 
@@ -145,6 +147,8 @@ irqreturn_t idxd_misc_thread(int vec, void *data)
 
 				if (wq->type == IDXD_WQT_USER)
 					wake_up_interruptible(&wq->idxd_cdev.err_queue);
+				else if (wq->type == IDXD_WQT_MDEV)
+					idxd_wq_vidxd_send_errors(wq);
 			}
 		}
 
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index d61bc17624b9..fd42674490d6 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -942,3 +942,35 @@ void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
 		break;
 	}
 }
+
+static void vidxd_send_errors(struct vdcm_idxd *vidxd)
+{
+	struct idxd_device *idxd = vidxd->idxd;
+	u8 *bar0 = vidxd->bar0;
+	union sw_err_reg *swerr = (union sw_err_reg *)(bar0 + IDXD_SWERR_OFFSET);
+	union genctrl_reg *genctrl = (union genctrl_reg *)(bar0 + IDXD_GENCTRL_OFFSET);
+	int i;
+
+	if (swerr->valid) {
+		if (!swerr->overflow)
+			swerr->overflow = 1;
+		return;
+	}
+
+	lockdep_assert_held(&idxd->dev_lock);
+	for (i = 0; i < 4; i++) {
+		swerr->bits[i] = idxd->sw_err.bits[i];
+		swerr++;
+	}
+
+	if (genctrl->softerr_int_en)
+		vidxd_send_interrupt(vidxd, 0);
+}
+
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq)
+{
+	struct vdcm_idxd *vidxd;
+
+	list_for_each_entry(vidxd, &wq->vdcm_list, list)
+		vidxd_send_errors(vidxd);
+}
diff --git a/drivers/dma/idxd/vdev.h b/drivers/dma/idxd/vdev.h
index d23e63eb7f43..98810ae95782 100644
--- a/drivers/dma/idxd/vdev.h
+++ b/drivers/dma/idxd/vdev.h
@@ -23,5 +23,6 @@ int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx);
 int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd);
 void vidxd_free_ims_entries(struct vdcm_idxd *vidxd);
 void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val);
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq);
 
 #endif



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (16 preceding siblings ...)
  2020-10-30 18:52 ` [PATCH v4 17/17] dmaengine: idxd: add error notification from host driver to mediated device Dave Jiang
@ 2020-10-30 18:58 ` Jason Gunthorpe
  2020-10-30 19:13   ` Dave Jiang
  2020-10-30 20:48 ` Thomas Gleixner
  18 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-10-30 18:58 UTC (permalink / raw)
  To: Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
>  .../ABI/stable/sysfs-driver-dma-idxd          |    6 +
>  Documentation/driver-api/vfio/mdev-idxd.rst   |  404 ++++++
>  MAINTAINERS                                   |    1 +
>  drivers/dma/Kconfig                           |    9 +
>  drivers/dma/idxd/Makefile                     |    2 +
>  drivers/dma/idxd/cdev.c                       |    6 +-
>  drivers/dma/idxd/device.c                     |  294 ++++-
>  drivers/dma/idxd/idxd.h                       |   67 +-
>  drivers/dma/idxd/init.c                       |   86 ++
>  drivers/dma/idxd/irq.c                        |    6 +-
>  drivers/dma/idxd/mdev.c                       | 1121 +++++++++++++++++
>  drivers/dma/idxd/mdev.h                       |  116 ++

Again, a subsytem driver belongs in the directory hierarchy of the
subsystem, not in other random places. All this mdev stuff belongs
under drivers/vfio

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 18:58 ` [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Jason Gunthorpe
@ 2020-10-30 19:13   ` Dave Jiang
  2020-10-30 19:17     ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 19:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm



On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
>>   .../ABI/stable/sysfs-driver-dma-idxd          |    6 +
>>   Documentation/driver-api/vfio/mdev-idxd.rst   |  404 ++++++
>>   MAINTAINERS                                   |    1 +
>>   drivers/dma/Kconfig                           |    9 +
>>   drivers/dma/idxd/Makefile                     |    2 +
>>   drivers/dma/idxd/cdev.c                       |    6 +-
>>   drivers/dma/idxd/device.c                     |  294 ++++-
>>   drivers/dma/idxd/idxd.h                       |   67 +-
>>   drivers/dma/idxd/init.c                       |   86 ++
>>   drivers/dma/idxd/irq.c                        |    6 +-
>>   drivers/dma/idxd/mdev.c                       | 1121 +++++++++++++++++
>>   drivers/dma/idxd/mdev.h                       |  116 ++
> 
> Again, a subsytem driver belongs in the directory hierarchy of the
> subsystem, not in other random places. All this mdev stuff belongs
> under drivers/vfio

Alex seems to have disagreed last time....
https://lore.kernel.org/dmaengine/20200917113016.425dcde7@x1.home/

And I do agree with his perspective. The mdev is an extension of the PF driver. 
It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 19:13   ` Dave Jiang
@ 2020-10-30 19:17     ` Jason Gunthorpe
  2020-10-30 19:23       ` Raj, Ashok
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-10-30 19:17 UTC (permalink / raw)
  To: Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
> 
> 
> On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > >   .../ABI/stable/sysfs-driver-dma-idxd          |    6 +
> > >   Documentation/driver-api/vfio/mdev-idxd.rst   |  404 ++++++
> > >   MAINTAINERS                                   |    1 +
> > >   drivers/dma/Kconfig                           |    9 +
> > >   drivers/dma/idxd/Makefile                     |    2 +
> > >   drivers/dma/idxd/cdev.c                       |    6 +-
> > >   drivers/dma/idxd/device.c                     |  294 ++++-
> > >   drivers/dma/idxd/idxd.h                       |   67 +-
> > >   drivers/dma/idxd/init.c                       |   86 ++
> > >   drivers/dma/idxd/irq.c                        |    6 +-
> > >   drivers/dma/idxd/mdev.c                       | 1121 +++++++++++++++++
> > >   drivers/dma/idxd/mdev.h                       |  116 ++
> > 
> > Again, a subsytem driver belongs in the directory hierarchy of the
> > subsystem, not in other random places. All this mdev stuff belongs
> > under drivers/vfio
> 
> Alex seems to have disagreed last time....
> https://lore.kernel.org/dmaengine/20200917113016.425dcde7@x1.home/

Nobody else in the kernel is splitting subsystems up anymore
 
> And I do agree with his perspective. The mdev is an extension of the PF
> driver. It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.

By this logic we'd have giagantic drivers under drivers/ethernet
touching netdev, rdma, scsi, vdpa, etc just because that is where the
PF driver came from.

It is not how the kernel works. Subsystem owners are responsible for
their subsystem, drivers implementing their subsystem are under the
subsystem directory.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 19:17     ` Jason Gunthorpe
@ 2020-10-30 19:23       ` Raj, Ashok
  2020-10-30 19:30         ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Raj, Ashok @ 2020-10-30 19:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, tglx,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

On Fri, Oct 30, 2020 at 04:17:06PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
> > 
> > 
> > On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > > >   .../ABI/stable/sysfs-driver-dma-idxd          |    6 +
> > > >   Documentation/driver-api/vfio/mdev-idxd.rst   |  404 ++++++
> > > >   MAINTAINERS                                   |    1 +
> > > >   drivers/dma/Kconfig                           |    9 +
> > > >   drivers/dma/idxd/Makefile                     |    2 +
> > > >   drivers/dma/idxd/cdev.c                       |    6 +-
> > > >   drivers/dma/idxd/device.c                     |  294 ++++-
> > > >   drivers/dma/idxd/idxd.h                       |   67 +-
> > > >   drivers/dma/idxd/init.c                       |   86 ++
> > > >   drivers/dma/idxd/irq.c                        |    6 +-
> > > >   drivers/dma/idxd/mdev.c                       | 1121 +++++++++++++++++
> > > >   drivers/dma/idxd/mdev.h                       |  116 ++
> > > 
> > > Again, a subsytem driver belongs in the directory hierarchy of the
> > > subsystem, not in other random places. All this mdev stuff belongs
> > > under drivers/vfio
> > 
> > Alex seems to have disagreed last time....
> > https://lore.kernel.org/dmaengine/20200917113016.425dcde7@x1.home/
> 
> Nobody else in the kernel is splitting subsystems up anymore
>  
> > And I do agree with his perspective. The mdev is an extension of the PF
> > driver. It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.
> 
> By this logic we'd have giagantic drivers under drivers/ethernet
> touching netdev, rdma, scsi, vdpa, etc just because that is where the
> PF driver came from.

What makes you think this is providing services like scsi/rdma/vdpa etc.. ?

for DSA this playes the exact same role, not a different function 
as you highlight above. these mdev's are creating DSA for virtualization
use. They aren't providing a completely different role or subsystem per-se.

Cheers,
Ashok



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 19:23       ` Raj, Ashok
@ 2020-10-30 19:30         ` Jason Gunthorpe
  2020-10-30 20:43           ` Raj, Ashok
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-10-30 19:30 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, tglx,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Oct 30, 2020 at 12:23:25PM -0700, Raj, Ashok wrote:
> On Fri, Oct 30, 2020 at 04:17:06PM -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
> > > 
> > > 
> > > On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > > > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > > > >   .../ABI/stable/sysfs-driver-dma-idxd          |    6 +
> > > > >   Documentation/driver-api/vfio/mdev-idxd.rst   |  404 ++++++
> > > > >   MAINTAINERS                                   |    1 +
> > > > >   drivers/dma/Kconfig                           |    9 +
> > > > >   drivers/dma/idxd/Makefile                     |    2 +
> > > > >   drivers/dma/idxd/cdev.c                       |    6 +-
> > > > >   drivers/dma/idxd/device.c                     |  294 ++++-
> > > > >   drivers/dma/idxd/idxd.h                       |   67 +-
> > > > >   drivers/dma/idxd/init.c                       |   86 ++
> > > > >   drivers/dma/idxd/irq.c                        |    6 +-
> > > > >   drivers/dma/idxd/mdev.c                       | 1121 +++++++++++++++++
> > > > >   drivers/dma/idxd/mdev.h                       |  116 ++
> > > > 
> > > > Again, a subsytem driver belongs in the directory hierarchy of the
> > > > subsystem, not in other random places. All this mdev stuff belongs
> > > > under drivers/vfio
> > > 
> > > Alex seems to have disagreed last time....
> > > https://lore.kernel.org/dmaengine/20200917113016.425dcde7@x1.home/
> > 
> > Nobody else in the kernel is splitting subsystems up anymore
> >  
> > > And I do agree with his perspective. The mdev is an extension of the PF
> > > driver. It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.
> > 
> > By this logic we'd have giagantic drivers under drivers/ethernet
> > touching netdev, rdma, scsi, vdpa, etc just because that is where the
> > PF driver came from.
> 
> What makes you think this is providing services like scsi/rdma/vdpa etc.. ?
> 
> for DSA this playes the exact same role, not a different function 
> as you highlight above. these mdev's are creating DSA for virtualization
> use. They aren't providing a completely different role or subsystem per-se.

It is a different subsystem, different maintainer, and different
reviewers.

It is a development process problem, it doesn't matter what it is
doing.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-10-30 18:51 ` [PATCH v4 06/17] PCI: add SIOV and IMS capability detection Dave Jiang
@ 2020-10-30 19:51   ` Bjorn Helgaas
  2020-10-30 21:20     ` Dave Jiang
  0 siblings, 1 reply; 123+ messages in thread
From: Bjorn Helgaas @ 2020-10-30 19:51 UTC (permalink / raw)
  To: Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, dmaengine, linux-kernel,
	linux-pci, kvm

On Fri, Oct 30, 2020 at 11:51:32AM -0700, Dave Jiang wrote:
> Intel Scalable I/O Virtualization (SIOV) enables sharing of I/O devices
> across isolated domains through PASID based sub-device partitioning.
> Interrupt Message Storage (IMS) enables devices to store the interrupt
> messages in a device-specific optimized manner without the scalability
> restrictions of the PCIe defined MSI-X capability. IMS is one of the
> features supported under SIOV.
>
> Move SIOV detection code from Intel iommu driver code to common PCI. Making
> the detection code common allows supported accelerator drivers to query the
> PCI core for SIOV and IMS capabilities. The support code will add the
> ability to query the PCI DVSEC capabilities for the SIOV cap.

This patch really does not include anything related to SIOV other than
adding a little code to *find* the capability.  It doesn't add
anything that actually *uses* it.  I think this patch should simply
add pci_find_dvsec(), and it doesn't need any of this SIOV or IMS
description.

> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Baolu Lu <baolu.lu@intel.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/intel/iommu.c   |   31 ++-----------------------
>  drivers/pci/Kconfig           |   15 ++++++++++++
>  drivers/pci/Makefile          |    2 ++
>  drivers/pci/dvsec.c           |   40 +++++++++++++++++++++++++++++++++
>  drivers/pci/siov.c            |   50 +++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci-siov.h      |   18 +++++++++++++++
>  include/linux/pci.h           |    3 ++
>  include/uapi/linux/pci_regs.h |    4 +++
>  8 files changed, 134 insertions(+), 29 deletions(-)
>  create mode 100644 drivers/pci/dvsec.c
>  create mode 100644 drivers/pci/siov.c
>  create mode 100644 include/linux/pci-siov.h
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 3e77a88b236c..d9335f590b42 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -36,6 +36,7 @@
>  #include <linux/tboot.h>
>  #include <linux/dmi.h>
>  #include <linux/pci-ats.h>
> +#include <linux/pci-siov.h>
>  #include <linux/memblock.h>
>  #include <linux/dma-map-ops.h>
>  #include <linux/dma-direct.h>
> @@ -5883,34 +5884,6 @@ static int intel_iommu_disable_auxd(struct device *dev)
>  	return 0;
>  }
>  
> -/*
> - * A PCI express designated vendor specific extended capability is defined
> - * in the section 3.7 of Intel scalable I/O virtualization technical spec
> - * for system software and tools to detect endpoint devices supporting the
> - * Intel scalable IO virtualization without host driver dependency.
> - *
> - * Returns the address of the matching extended capability structure within
> - * the device's PCI configuration space or 0 if the device does not support
> - * it.
> - */
> -static int siov_find_pci_dvsec(struct pci_dev *pdev)
> -{
> -	int pos;
> -	u16 vendor, id;
> -
> -	pos = pci_find_next_ext_capability(pdev, 0, 0x23);
> -	while (pos) {
> -		pci_read_config_word(pdev, pos + 4, &vendor);
> -		pci_read_config_word(pdev, pos + 8, &id);
> -		if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
> -			return pos;
> -
> -		pos = pci_find_next_ext_capability(pdev, pos, 0x23);
> -	}
> -
> -	return 0;
> -}
> -
>  static bool
>  intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
>  {
> @@ -5925,7 +5898,7 @@ intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
>  		if (ret < 0)
>  			return false;
>  
> -		return !!siov_find_pci_dvsec(to_pci_dev(dev));
> +		return pci_siov_supported(to_pci_dev(dev));
>  	}
>  
>  	if (feat == IOMMU_DEV_FEAT_SVA) {
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 0c473d75e625..cf7f4d17d8cc 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -161,6 +161,21 @@ config PCI_PASID
>  
>  	  If unsure, say N.
>  
> +config PCI_DVSEC
> +	bool
> +
> +config PCI_SIOV
> +	select PCI_PASID

This patch has nothing to do with PCI_PASID.  If you want to add this
select later in a patch that *does* add something that requires
PCI_PASID, that's OK.

> +	select PCI_DVSEC
> +	bool "PCI SIOV support"
> +	help
> +	  Scalable I/O Virtualzation enables sharing of I/O devices across isolated
> +	  domains through PASID based sub-device partitioning. One of the sub features
> +	  supported by SIOV is Inetrrupt Message Storage (IMS). Select this option if
> +	  you want to compile the support into your kernel.
> +	  If unsure, say N.
> +
>  config PCI_P2PDMA
>  	bool "PCI peer-to-peer transfer support"
>  	depends on ZONE_DEVICE
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 522d2b974e91..653a1d69b0fc 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -20,6 +20,8 @@ obj-$(CONFIG_PCI_QUIRKS)	+= quirks.o
>  obj-$(CONFIG_HOTPLUG_PCI)	+= hotplug/
>  obj-$(CONFIG_PCI_MSI)		+= msi.o
>  obj-$(CONFIG_PCI_ATS)		+= ats.o
> +obj-$(CONFIG_PCI_DVSEC)		+= dvsec.o
> +obj-$(CONFIG_PCI_SIOV)		+= siov.o
>  obj-$(CONFIG_PCI_IOV)		+= iov.o
>  obj-$(CONFIG_PCI_BRIDGE_EMUL)	+= pci-bridge-emul.o
>  obj-$(CONFIG_PCI_LABEL)		+= pci-label.o
> diff --git a/drivers/pci/dvsec.c b/drivers/pci/dvsec.c
> new file mode 100644
> index 000000000000..e49b079f0717
> --- /dev/null
> +++ b/drivers/pci/dvsec.c
> @@ -0,0 +1,40 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI DVSEC helper functions
> + * Copyright (C) 2020 Intel Corp.
> + */
> +
> +#include <linux/export.h>
> +#include <linux/pci.h>
> +#include <uapi/linux/pci_regs.h>
> +#include "pci.h"
> +
> +/**
> + * pci_find_dvsec - return position of DVSEC with provided vendor and dvsec id
> + * @dev: the PCI device
> + * @vendor: Vendor for the DVSEC
> + * @id: the DVSEC cap id
> + *
> + * Return the offset of DVSEC on success or -ENOTSUPP if not found

s/vendor/Vendor/
s/dvsec/DVSEC/
s/id/ID/ twice above

Please put this function in drivers/pci/pci.c next to
pci_find_ext_capability().  I don't think it's worth making a new file
just for this.

> + */
> +int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
> +{
> +	u16 dev_vendor, dev_id;
> +	int pos;
> +
> +	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
> +	if (!pos)
> +		return -ENOTSUPP;
> +
> +	while (pos) {
> +		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &dev_vendor);
> +		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &dev_id);
> +		if (dev_vendor == vendor && dev_id == id)
> +			return pos;
> +
> +		pos = pci_find_next_ext_capability(dev, pos, PCI_EXT_CAP_ID_DVSEC);
> +	}
> +
> +	return -ENOTSUPP;
> +}
> +EXPORT_SYMBOL_GPL(pci_find_dvsec);
> diff --git a/drivers/pci/siov.c b/drivers/pci/siov.c
> new file mode 100644
> index 000000000000..6147e6ae5832
> --- /dev/null
> +++ b/drivers/pci/siov.c
> @@ -0,0 +1,50 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Intel Scalable I/O Virtualization support
> + * Copyright (C) 2020 Intel Corp.
> + */
> +
> +#include <linux/export.h>
> +#include <linux/pci.h>
> +#include <linux/pci-siov.h>
> +#include <uapi/linux/pci_regs.h>
> +#include "pci.h"
> +
> +/*
> + * A PCI express designated vendor specific extended capability is defined
> + * in the section 3.7 of Intel scalable I/O virtualization technical spec
> + * for system software and tools to detect endpoint devices supporting the
> + * Intel scalable IO virtualization without host driver dependency.
> + */
> +
> +/**
> + * pci_siov_supported - check if the device can use SIOV
> + * @dev: the PCI device
> + *
> + * Returns true if the device supports SIOV,  false otherwise.
> + */
> +bool pci_siov_supported(struct pci_dev *dev)
> +{
> +	return pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV) < 0 ? false : true;
> +}
> +EXPORT_SYMBOL_GPL(pci_siov_supported);
> +
> +/**
> + * pci_ims_supported - check if the device can use IMS
> + * @dev: the PCI device
> + *
> + * Returns true if the device supports IMS, false otherwise.
> + */
> +bool pci_ims_supported(struct pci_dev *dev)
> +{
> +	int pos;
> +	u32 caps;
> +
> +	pos = pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV);
> +	if (pos < 0)
> +		return false;
> +
> +	pci_read_config_dword(dev, pos + PCI_DVSEC_INTEL_SIOV_CAP, &caps);
> +	return (caps & PCI_DVSEC_INTEL_SIOV_CAP_IMS) ? true : false;
> +}
> +EXPORT_SYMBOL_GPL(pci_ims_supported);

I don't really see the point of these *_supported() functions.  If the
caller wants to use them, I would expect it to call
pci_find_dvsec(PCI_DVSEC_ID_INTEL_SIOV) itself anyway.

But there *are* no calls to pci_find_dvsec(PCI_DVSEC_ID_INTEL_SIOV).
So apparently all you care about is whether the capability *exists*,
and you don't need any information at all from the capability
registers except PCI_DVSEC_INTEL_SIOV_CAP_IMS?  That seems a little
weird.

I don't think it's worth adding a whole new file just for this.  The
only value the PCI core is adding here is a way to locate the
PCI_DVSEC_ID_INTEL_SIOV capability.

> diff --git a/include/linux/pci-siov.h b/include/linux/pci-siov.h
> new file mode 100644
> index 000000000000..a8a4eb5f4634
> --- /dev/null
> +++ b/include/linux/pci-siov.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef LINUX_PCI_SIOV_H
> +#define LINUX_PCI_SIOV_H
> +
> +#include <linux/pci.h>
> +
> +#ifdef CONFIG_PCI_SIOV
> +/* Scalable I/O Virtualization */
> +bool pci_siov_supported(struct pci_dev *dev);
> +bool pci_ims_supported(struct pci_dev *dev);
> +#else /* CONFIG_PCI_SIOV */
> +static inline bool pci_siov_supported(struct pci_dev *d)
> +{ return false; }
> +static inline bool pci_ims_supported(struct pci_dev *d)
> +{ return false; }
> +#endif /* CONFIG_PCI_SIOV */
> +
> +#endif /* LINUX_PCI_SIOV_H */

What's the benefit to putting these declarations in a separate
pci-siov.h as opposed to putting them in pci.h itself?  That's what we
do for things like MSI, IOV, etc.

> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 22207a79762c..4710f09b43b1 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1070,6 +1070,7 @@ int pci_find_next_ext_capability(struct pci_dev *dev, int pos, int cap);
>  int pci_find_ht_capability(struct pci_dev *dev, int ht_cap);
>  int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int ht_cap);
>  struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
> +int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id);
>  
>  u64 pci_get_dsn(struct pci_dev *dev);
>  
> @@ -1726,6 +1727,8 @@ static inline int pci_find_next_capability(struct pci_dev *dev, u8 post,
>  { return 0; }
>  static inline int pci_find_ext_capability(struct pci_dev *dev, int cap)
>  { return 0; }
> +static inline int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
> +{ return 0; }
>  
>  static inline u64 pci_get_dsn(struct pci_dev *dev)
>  { return 0; }
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 8f8bd2318c6c..3532528441ef 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1071,6 +1071,10 @@
>  #define PCI_DVSEC_HEADER1		0x4 /* Designated Vendor-Specific Header1 */
>  #define PCI_DVSEC_HEADER2		0x8 /* Designated Vendor-Specific Header2 */
>  
> +#define PCI_DVSEC_ID_INTEL_SIOV		0x5
> +#define PCI_DVSEC_INTEL_SIOV_CAP	0x14
> +#define PCI_DVSEC_INTEL_SIOV_CAP_IMS	0x1

Convention in this file is to write constants in the register width,
e.g.,

  #define PCI_DVSEC_ID_INTEL_SIOV		0x0005
  #define PCI_DVSEC_INTEL_SIOV_CAP_IMS	0x00000001

You can learn this by looking at the surrounding definitions.

>  /* Data Link Feature */
>  #define PCI_DLF_CAP		0x04	/* Capabilities Register */
>  #define  PCI_DLF_EXCHANGE_ENABLE	0x80000000  /* Data Link Feature Exchange Enable */
> 
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 02/17] iommu/vt-d: Add DEV-MSI support
  2020-10-30 18:51 ` [PATCH v4 02/17] iommu/vt-d: Add DEV-MSI support Dave Jiang
@ 2020-10-30 20:31   ` Thomas Gleixner
  2020-10-30 20:52     ` Dave Jiang
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-10-30 20:31 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

On Fri, Oct 30 2020 at 11:51, Dave Jiang wrote:
> From: Megha Dey <megha.dey@intel.com>

This conflicts with

     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/apic

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 19:30         ` Jason Gunthorpe
@ 2020-10-30 20:43           ` Raj, Ashok
  2020-10-30 22:54             ` Jason Gunthorpe
  2020-10-31  2:50             ` Thomas Gleixner
  0 siblings, 2 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-10-30 20:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, tglx,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

On Fri, Oct 30, 2020 at 04:30:45PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 12:23:25PM -0700, Raj, Ashok wrote:
> > On Fri, Oct 30, 2020 at 04:17:06PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
> > > > 
> > > > 
> > > > On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > > > > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > > > > >   .../ABI/stable/sysfs-driver-dma-idxd          |    6 +
> > > > > >   Documentation/driver-api/vfio/mdev-idxd.rst   |  404 ++++++
> > > > > >   MAINTAINERS                                   |    1 +
> > > > > >   drivers/dma/Kconfig                           |    9 +
> > > > > >   drivers/dma/idxd/Makefile                     |    2 +
> > > > > >   drivers/dma/idxd/cdev.c                       |    6 +-
> > > > > >   drivers/dma/idxd/device.c                     |  294 ++++-
> > > > > >   drivers/dma/idxd/idxd.h                       |   67 +-
> > > > > >   drivers/dma/idxd/init.c                       |   86 ++
> > > > > >   drivers/dma/idxd/irq.c                        |    6 +-
> > > > > >   drivers/dma/idxd/mdev.c                       | 1121 +++++++++++++++++
> > > > > >   drivers/dma/idxd/mdev.h                       |  116 ++
> > > > > 
> > > > > Again, a subsytem driver belongs in the directory hierarchy of the
> > > > > subsystem, not in other random places. All this mdev stuff belongs
> > > > > under drivers/vfio
> > > > 
> > > > Alex seems to have disagreed last time....
> > > > https://lore.kernel.org/dmaengine/20200917113016.425dcde7@x1.home/
> > > 
> > > Nobody else in the kernel is splitting subsystems up anymore
> > >  
> > > > And I do agree with his perspective. The mdev is an extension of the PF
> > > > driver. It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.
> > > 
> > > By this logic we'd have giagantic drivers under drivers/ethernet
> > > touching netdev, rdma, scsi, vdpa, etc just because that is where the
> > > PF driver came from.
> > 
> > What makes you think this is providing services like scsi/rdma/vdpa etc.. ?
> > 
> > for DSA this playes the exact same role, not a different function 
> > as you highlight above. these mdev's are creating DSA for virtualization
> > use. They aren't providing a completely different role or subsystem per-se.
> 
> It is a different subsystem, different maintainer, and different
> reviewers.
> 
> It is a development process problem, it doesn't matter what it is
> doing.

So drawing that parallel, do you expect all drivers that call
pci_register_driver() to be located in drivers/pci? Aren't they scattered
all over the place ata,scsi, platform drivers and such?

As Alex pointed out, i915 and handful of s390 drivers that are mdev users
are not in drivers/vfio. Are you sayint those drivers don't get reviewed? 

This is no different than PF driver offering VF services. Its a logical
extension. 

Reviews happen for mdev users today. What you suggest seems like cutting 
the feet to fit the shoe. Unless the maintainers are asking things 
to be split just because its calling mdev_register_device() that practice 
doesn't exist and would be totally weird if you want to move all callers of
pci_register_driver(). 

Your argument seems interesting even entertaining :-). But honestly i'm not finding it
practical :-). So every caller of mmu_register_notifier() needs to be in
mm? 

What you mention for different functions make absolute sense, not arguing
against that.  but this ain't that. 

And we just follow the asks of the maintainer. 

I know you aren't going to give up, but there is little we can do. I want
the maintainers to make that call and I'm not add more noise to this.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
                   ` (17 preceding siblings ...)
  2020-10-30 18:58 ` [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Jason Gunthorpe
@ 2020-10-30 20:48 ` Thomas Gleixner
  2020-10-30 20:59   ` Dave Jiang
  18 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-10-30 20:48 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: Megha Dey, dmaengine, linux-kernel, linux-pci, kvm

On Fri, Oct 30 2020 at 11:50, Dave Jiang wrote:
> The code has dependency on Thomas’s MSI restructuring patch series:
> https://lore.kernel.org/lkml/20200826111628.794979401@linutronix.de/

which is outdated and not longer applicable.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 02/17] iommu/vt-d: Add DEV-MSI support
  2020-10-30 20:31   ` Thomas Gleixner
@ 2020-10-30 20:52     ` Dave Jiang
  0 siblings, 0 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 20:52 UTC (permalink / raw)
  To: Thomas Gleixner, vkoul, megha.dey, maz, bhelgaas,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav, rafael, netanelg,
	shahafs, yan.y.zhao, pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm



On 10/30/2020 1:31 PM, Thomas Gleixner wrote:
> On Fri, Oct 30 2020 at 11:51, Dave Jiang wrote:
>> From: Megha Dey <megha.dey@intel.com>
> 
> This conflicts with
> 
>       git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/apic

I'll get that fixed up. Thanks!

> 
> Thanks,
> 
>          tglx
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 20:48 ` Thomas Gleixner
@ 2020-10-30 20:59   ` Dave Jiang
  2020-10-30 22:10     ` Thomas Gleixner
  0 siblings, 1 reply; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 20:59 UTC (permalink / raw)
  To: Thomas Gleixner, vkoul, megha.dey, maz, bhelgaas,
	alex.williamson, jacob.jun.pan, ashok.raj, jgg, yi.l.liu,
	baolu.lu, kevin.tian, sanjay.k.kumar, tony.luck, jing.lin,
	dan.j.williams, kwankhede, eric.auger, parav, rafael, netanelg,
	shahafs, yan.y.zhao, pbonzini, samuel.ortiz, mona.hossain
  Cc: Megha Dey, dmaengine, linux-kernel, linux-pci, kvm



On 10/30/2020 1:48 PM, Thomas Gleixner wrote:
> On Fri, Oct 30 2020 at 11:50, Dave Jiang wrote:
>> The code has dependency on Thomas’s MSI restructuring patch series:
>> https://lore.kernel.org/lkml/20200826111628.794979401@linutronix.de/
> 
> which is outdated and not longer applicable.

Yes.... I wasn't sure how to point to these patches from you as a dependency.

irqdomain/msi: Provide msi_alloc/free_store() callbacks
platform-msi: Add device MSI infrastructure
genirq/msi: Provide and use msi_domain_set_default_info_flags()
genirq/proc: Take buslock on affinity write
platform-msi: Provide default irq_chip:: Ack
x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI
x86/irq: Add DEV_MSI allocation type

Do I need to include these patches in my series? Thanks!

> 
> Thanks,
> 
>          tglx
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-10-30 19:51   ` Bjorn Helgaas
@ 2020-10-30 21:20     ` Dave Jiang
  2020-10-30 21:50       ` Bjorn Helgaas
  2020-10-30 22:45       ` Jason Gunthorpe
  0 siblings, 2 replies; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 21:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, dmaengine, linux-kernel,
	linux-pci, kvm



On 10/30/2020 12:51 PM, Bjorn Helgaas wrote:
> On Fri, Oct 30, 2020 at 11:51:32AM -0700, Dave Jiang wrote:
>> Intel Scalable I/O Virtualization (SIOV) enables sharing of I/O devices
>> across isolated domains through PASID based sub-device partitioning.
>> Interrupt Message Storage (IMS) enables devices to store the interrupt
>> messages in a device-specific optimized manner without the scalability
>> restrictions of the PCIe defined MSI-X capability. IMS is one of the
>> features supported under SIOV.
>>
>> Move SIOV detection code from Intel iommu driver code to common PCI. Making
>> the detection code common allows supported accelerator drivers to query the
>> PCI core for SIOV and IMS capabilities. The support code will add the
>> ability to query the PCI DVSEC capabilities for the SIOV cap.
> 
> This patch really does not include anything related to SIOV other than
> adding a little code to *find* the capability.  It doesn't add
> anything that actually *uses* it.  I think this patch should simply
> add pci_find_dvsec(), and it doesn't need any of this SIOV or IMS
> description.
> 

Thanks for the review Bjorn! I'll carve out a patch with just find_dvsec() and 
apply your comments and recommendations.

So the intel-iommu driver checks for the SIOV cap. And the idxd driver checks 
for SIOV and IMS cap. There will be other upcoming drivers that will check for 
such cap too. It is Intel vendor specific right now, but SIOV is public and 
other vendors may implement to the spec. Is there a good place to put the common 
capability check for that?

There are some other fields in the SIOV dvsec cap, but presently they are not 
being utilized. The idxd driver is only interested in making sure that SIOV and 
IMS (sub feature) support are present at this point.

- Dave

>> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Baolu Lu <baolu.lu@intel.com>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Ashok Raj <ashok.raj@intel.com>
>> ---
>>   drivers/iommu/intel/iommu.c   |   31 ++-----------------------
>>   drivers/pci/Kconfig           |   15 ++++++++++++
>>   drivers/pci/Makefile          |    2 ++
>>   drivers/pci/dvsec.c           |   40 +++++++++++++++++++++++++++++++++
>>   drivers/pci/siov.c            |   50 +++++++++++++++++++++++++++++++++++++++++
>>   include/linux/pci-siov.h      |   18 +++++++++++++++
>>   include/linux/pci.h           |    3 ++
>>   include/uapi/linux/pci_regs.h |    4 +++
>>   8 files changed, 134 insertions(+), 29 deletions(-)
>>   create mode 100644 drivers/pci/dvsec.c
>>   create mode 100644 drivers/pci/siov.c
>>   create mode 100644 include/linux/pci-siov.h
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 3e77a88b236c..d9335f590b42 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -36,6 +36,7 @@
>>   #include <linux/tboot.h>
>>   #include <linux/dmi.h>
>>   #include <linux/pci-ats.h>
>> +#include <linux/pci-siov.h>
>>   #include <linux/memblock.h>
>>   #include <linux/dma-map-ops.h>
>>   #include <linux/dma-direct.h>
>> @@ -5883,34 +5884,6 @@ static int intel_iommu_disable_auxd(struct device *dev)
>>   	return 0;
>>   }
>>   
>> -/*
>> - * A PCI express designated vendor specific extended capability is defined
>> - * in the section 3.7 of Intel scalable I/O virtualization technical spec
>> - * for system software and tools to detect endpoint devices supporting the
>> - * Intel scalable IO virtualization without host driver dependency.
>> - *
>> - * Returns the address of the matching extended capability structure within
>> - * the device's PCI configuration space or 0 if the device does not support
>> - * it.
>> - */
>> -static int siov_find_pci_dvsec(struct pci_dev *pdev)
>> -{
>> -	int pos;
>> -	u16 vendor, id;
>> -
>> -	pos = pci_find_next_ext_capability(pdev, 0, 0x23);
>> -	while (pos) {
>> -		pci_read_config_word(pdev, pos + 4, &vendor);
>> -		pci_read_config_word(pdev, pos + 8, &id);
>> -		if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
>> -			return pos;
>> -
>> -		pos = pci_find_next_ext_capability(pdev, pos, 0x23);
>> -	}
>> -
>> -	return 0;
>> -}
>> -
>>   static bool
>>   intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
>>   {
>> @@ -5925,7 +5898,7 @@ intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
>>   		if (ret < 0)
>>   			return false;
>>   
>> -		return !!siov_find_pci_dvsec(to_pci_dev(dev));
>> +		return pci_siov_supported(to_pci_dev(dev));
>>   	}
>>   
>>   	if (feat == IOMMU_DEV_FEAT_SVA) {
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index 0c473d75e625..cf7f4d17d8cc 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -161,6 +161,21 @@ config PCI_PASID
>>   
>>   	  If unsure, say N.
>>   
>> +config PCI_DVSEC
>> +	bool
>> +
>> +config PCI_SIOV
>> +	select PCI_PASID
> 
> This patch has nothing to do with PCI_PASID.  If you want to add this
> select later in a patch that *does* add something that requires
> PCI_PASID, that's OK.
> 
>> +	select PCI_DVSEC
>> +	bool "PCI SIOV support"
>> +	help
>> +	  Scalable I/O Virtualzation enables sharing of I/O devices across isolated
>> +	  domains through PASID based sub-device partitioning. One of the sub features
>> +	  supported by SIOV is Inetrrupt Message Storage (IMS). Select this option if
>> +	  you want to compile the support into your kernel.
>> +	  If unsure, say N.
>> +
>>   config PCI_P2PDMA
>>   	bool "PCI peer-to-peer transfer support"
>>   	depends on ZONE_DEVICE
>> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
>> index 522d2b974e91..653a1d69b0fc 100644
>> --- a/drivers/pci/Makefile
>> +++ b/drivers/pci/Makefile
>> @@ -20,6 +20,8 @@ obj-$(CONFIG_PCI_QUIRKS)	+= quirks.o
>>   obj-$(CONFIG_HOTPLUG_PCI)	+= hotplug/
>>   obj-$(CONFIG_PCI_MSI)		+= msi.o
>>   obj-$(CONFIG_PCI_ATS)		+= ats.o
>> +obj-$(CONFIG_PCI_DVSEC)		+= dvsec.o
>> +obj-$(CONFIG_PCI_SIOV)		+= siov.o
>>   obj-$(CONFIG_PCI_IOV)		+= iov.o
>>   obj-$(CONFIG_PCI_BRIDGE_EMUL)	+= pci-bridge-emul.o
>>   obj-$(CONFIG_PCI_LABEL)		+= pci-label.o
>> diff --git a/drivers/pci/dvsec.c b/drivers/pci/dvsec.c
>> new file mode 100644
>> index 000000000000..e49b079f0717
>> --- /dev/null
>> +++ b/drivers/pci/dvsec.c
>> @@ -0,0 +1,40 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * PCI DVSEC helper functions
>> + * Copyright (C) 2020 Intel Corp.
>> + */
>> +
>> +#include <linux/export.h>
>> +#include <linux/pci.h>
>> +#include <uapi/linux/pci_regs.h>
>> +#include "pci.h"
>> +
>> +/**
>> + * pci_find_dvsec - return position of DVSEC with provided vendor and dvsec id
>> + * @dev: the PCI device
>> + * @vendor: Vendor for the DVSEC
>> + * @id: the DVSEC cap id
>> + *
>> + * Return the offset of DVSEC on success or -ENOTSUPP if not found
> 
> s/vendor/Vendor/
> s/dvsec/DVSEC/
> s/id/ID/ twice above
> 
> Please put this function in drivers/pci/pci.c next to
> pci_find_ext_capability().  I don't think it's worth making a new file
> just for this.
> 
>> + */
>> +int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
>> +{
>> +	u16 dev_vendor, dev_id;
>> +	int pos;
>> +
>> +	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
>> +	if (!pos)
>> +		return -ENOTSUPP;
>> +
>> +	while (pos) {
>> +		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &dev_vendor);
>> +		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &dev_id);
>> +		if (dev_vendor == vendor && dev_id == id)
>> +			return pos;
>> +
>> +		pos = pci_find_next_ext_capability(dev, pos, PCI_EXT_CAP_ID_DVSEC);
>> +	}
>> +
>> +	return -ENOTSUPP;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_find_dvsec);
>> diff --git a/drivers/pci/siov.c b/drivers/pci/siov.c
>> new file mode 100644
>> index 000000000000..6147e6ae5832
>> --- /dev/null
>> +++ b/drivers/pci/siov.c
>> @@ -0,0 +1,50 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Intel Scalable I/O Virtualization support
>> + * Copyright (C) 2020 Intel Corp.
>> + */
>> +
>> +#include <linux/export.h>
>> +#include <linux/pci.h>
>> +#include <linux/pci-siov.h>
>> +#include <uapi/linux/pci_regs.h>
>> +#include "pci.h"
>> +
>> +/*
>> + * A PCI express designated vendor specific extended capability is defined
>> + * in the section 3.7 of Intel scalable I/O virtualization technical spec
>> + * for system software and tools to detect endpoint devices supporting the
>> + * Intel scalable IO virtualization without host driver dependency.
>> + */
>> +
>> +/**
>> + * pci_siov_supported - check if the device can use SIOV
>> + * @dev: the PCI device
>> + *
>> + * Returns true if the device supports SIOV,  false otherwise.
>> + */
>> +bool pci_siov_supported(struct pci_dev *dev)
>> +{
>> +	return pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV) < 0 ? false : true;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_siov_supported);
>> +
>> +/**
>> + * pci_ims_supported - check if the device can use IMS
>> + * @dev: the PCI device
>> + *
>> + * Returns true if the device supports IMS, false otherwise.
>> + */
>> +bool pci_ims_supported(struct pci_dev *dev)
>> +{
>> +	int pos;
>> +	u32 caps;
>> +
>> +	pos = pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV);
>> +	if (pos < 0)
>> +		return false;
>> +
>> +	pci_read_config_dword(dev, pos + PCI_DVSEC_INTEL_SIOV_CAP, &caps);
>> +	return (caps & PCI_DVSEC_INTEL_SIOV_CAP_IMS) ? true : false;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_ims_supported);
> 
> I don't really see the point of these *_supported() functions.  If the
> caller wants to use them, I would expect it to call
> pci_find_dvsec(PCI_DVSEC_ID_INTEL_SIOV) itself anyway.
> 
> But there *are* no calls to pci_find_dvsec(PCI_DVSEC_ID_INTEL_SIOV).
> So apparently all you care about is whether the capability *exists*,
> and you don't need any information at all from the capability
> registers except PCI_DVSEC_INTEL_SIOV_CAP_IMS?  That seems a little
> weird.
> 
> I don't think it's worth adding a whole new file just for this.  The
> only value the PCI core is adding here is a way to locate the
> PCI_DVSEC_ID_INTEL_SIOV capability.
> 
>> diff --git a/include/linux/pci-siov.h b/include/linux/pci-siov.h
>> new file mode 100644
>> index 000000000000..a8a4eb5f4634
>> --- /dev/null
>> +++ b/include/linux/pci-siov.h
>> @@ -0,0 +1,18 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef LINUX_PCI_SIOV_H
>> +#define LINUX_PCI_SIOV_H
>> +
>> +#include <linux/pci.h>
>> +
>> +#ifdef CONFIG_PCI_SIOV
>> +/* Scalable I/O Virtualization */
>> +bool pci_siov_supported(struct pci_dev *dev);
>> +bool pci_ims_supported(struct pci_dev *dev);
>> +#else /* CONFIG_PCI_SIOV */
>> +static inline bool pci_siov_supported(struct pci_dev *d)
>> +{ return false; }
>> +static inline bool pci_ims_supported(struct pci_dev *d)
>> +{ return false; }
>> +#endif /* CONFIG_PCI_SIOV */
>> +
>> +#endif /* LINUX_PCI_SIOV_H */
> 
> What's the benefit to putting these declarations in a separate
> pci-siov.h as opposed to putting them in pci.h itself?  That's what we
> do for things like MSI, IOV, etc.
> 
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 22207a79762c..4710f09b43b1 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1070,6 +1070,7 @@ int pci_find_next_ext_capability(struct pci_dev *dev, int pos, int cap);
>>   int pci_find_ht_capability(struct pci_dev *dev, int ht_cap);
>>   int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int ht_cap);
>>   struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
>> +int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id);
>>   
>>   u64 pci_get_dsn(struct pci_dev *dev);
>>   
>> @@ -1726,6 +1727,8 @@ static inline int pci_find_next_capability(struct pci_dev *dev, u8 post,
>>   { return 0; }
>>   static inline int pci_find_ext_capability(struct pci_dev *dev, int cap)
>>   { return 0; }
>> +static inline int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
>> +{ return 0; }
>>   
>>   static inline u64 pci_get_dsn(struct pci_dev *dev)
>>   { return 0; }
>> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
>> index 8f8bd2318c6c..3532528441ef 100644
>> --- a/include/uapi/linux/pci_regs.h
>> +++ b/include/uapi/linux/pci_regs.h
>> @@ -1071,6 +1071,10 @@
>>   #define PCI_DVSEC_HEADER1		0x4 /* Designated Vendor-Specific Header1 */
>>   #define PCI_DVSEC_HEADER2		0x8 /* Designated Vendor-Specific Header2 */
>>   
>> +#define PCI_DVSEC_ID_INTEL_SIOV		0x5
>> +#define PCI_DVSEC_INTEL_SIOV_CAP	0x14
>> +#define PCI_DVSEC_INTEL_SIOV_CAP_IMS	0x1
> 
> Convention in this file is to write constants in the register width,
> e.g.,
> 
>    #define PCI_DVSEC_ID_INTEL_SIOV		0x0005
>    #define PCI_DVSEC_INTEL_SIOV_CAP_IMS	0x00000001
> 
> You can learn this by looking at the surrounding definitions.
> 
>>   /* Data Link Feature */
>>   #define PCI_DLF_CAP		0x04	/* Capabilities Register */
>>   #define  PCI_DLF_EXCHANGE_ENABLE	0x80000000  /* Data Link Feature Exchange Enable */
>>
>>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 13/17] dmaengine: idxd: ims setup for the vdcm
  2020-10-30 18:52 ` [PATCH v4 13/17] dmaengine: idxd: ims setup for the vdcm Dave Jiang
@ 2020-10-30 21:26   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2020-10-30 21:26 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: Megha Dey, dmaengine, linux-kernel, linux-pci, kvm

On Fri, Oct 30 2020 at 11:52, Dave Jiang wrote:
> Add setup for IMS enabling for the mediated device.

....

> Register with the irq bypass manager in order to allow the IMS interrupt be
> injected into the guest and bypass the host.

Why is this part of the patch which adds IMS support? This are two
completely different things.

Again, Documentation/process/submitting-patches.rst is very clear about
this:
        Solve only one problem per patch.

You want me to review the IMS related things. Why are you mixing that
completely unrelated bypass stuff to it?

> +void vidxd_free_ims_entries(struct vdcm_idxd *vidxd)
> +{
> +	struct irq_domain *irq_domain;
> +	struct mdev_device *mdev = vidxd->vdev.mdev;
> +	struct device *dev = mdev_dev(mdev);
> +	int i;
> +
> +	for (i = 0; i < VIDXD_MAX_MSIX_VECS; i++)
> +		vidxd->irq_entries[i].entry = NULL;

See below.

> +	irq_domain = dev_get_msi_domain(dev);
> +	if (irq_domain)
> +		msi_domain_free_irqs(irq_domain, dev);
> +	else
> +		dev_warn(dev, "No IMS irq domain.\n");

How is the code even getting to this point if the domain allocation
failed in the first place?

> +int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
> +{
> +	struct irq_domain *irq_domain;
> +	struct idxd_device *idxd = vidxd->idxd;
> +	struct mdev_device *mdev = vidxd->vdev.mdev;
> +	struct device *dev = mdev_dev(mdev);
> +	int vecs = VIDXD_MAX_MSIX_VECS - 1;

Some sensible comment about the -1 is missing here.

> +	struct msi_desc *entry;
> +	struct ims_irq_entry *irq_entry;
> +	int rc, i = 0;
> +
> +	irq_domain = idxd->ims_domain;
> +	dev_set_msi_domain(dev, irq_domain);
> +	rc = msi_domain_alloc_irqs(irq_domain, dev, vecs);
> +	if (rc < 0)
> +		return rc;
> +
> +	for_each_msi_entry(entry, dev) {
> +		irq_entry = &vidxd->irq_entries[i];
> +		irq_entry->vidxd = vidxd;
> +		irq_entry->entry = entry;

What's the business with storing the MSI entry here? Just to do this:

       ims_idx = vidxd->irq_entries[vidx - 1].entry->device_msi.hwirq;

and this:

      if (vidxd->irq_entries[i].entry->device_msi.hwirq == handle) {

What's wrong with storing the hardware interrupt index right here
instead of handing that pointer around? The usage sites have no reason
to know about the entry itself.

> +		irq_entry->id = i;

Again, what is the point of storing the array offset in the array slot?
If it _is_ useful then adding a comment is not too much asked for.

So the place I found which uses it cannot compute the index obviously,
but this:

        vidxd_send_interrupt(irq_entry->vidxd, irq_entry->id + 1);

is again just voodoo programming. Why can't you just provide a data set
which contains data ready for consumption at the usage site?

> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
> index c7e47c26cd90..89cf60a30803 100644
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -536,6 +536,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
>  
>  	return ops->domain_alloc_irqs(domain, dev, nvec);
>  }
> +EXPORT_SYMBOL(msi_domain_alloc_irqs);

Sigh... This want's to be a preperatory patch and the export wants to be
EXPORT_SYMBOL_GPL
  
Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-10-30 21:20     ` Dave Jiang
@ 2020-10-30 21:50       ` Bjorn Helgaas
  2020-10-30 22:45       ` Jason Gunthorpe
  1 sibling, 0 replies; 123+ messages in thread
From: Bjorn Helgaas @ 2020-10-30 21:50 UTC (permalink / raw)
  To: Dave Jiang
  Cc: vkoul, megha.dey, maz, bhelgaas, tglx, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, dmaengine, linux-kernel,
	linux-pci, kvm

On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
> 
> 
> On 10/30/2020 12:51 PM, Bjorn Helgaas wrote:
> > On Fri, Oct 30, 2020 at 11:51:32AM -0700, Dave Jiang wrote:
> > > Intel Scalable I/O Virtualization (SIOV) enables sharing of I/O devices
> > > across isolated domains through PASID based sub-device partitioning.
> > > Interrupt Message Storage (IMS) enables devices to store the interrupt
> > > messages in a device-specific optimized manner without the scalability
> > > restrictions of the PCIe defined MSI-X capability. IMS is one of the
> > > features supported under SIOV.
> > > 
> > > Move SIOV detection code from Intel iommu driver code to common PCI. Making
> > > the detection code common allows supported accelerator drivers to query the
> > > PCI core for SIOV and IMS capabilities. The support code will add the
> > > ability to query the PCI DVSEC capabilities for the SIOV cap.
> > 
> > This patch really does not include anything related to SIOV other than
> > adding a little code to *find* the capability.  It doesn't add
> > anything that actually *uses* it.  I think this patch should simply
> > add pci_find_dvsec(), and it doesn't need any of this SIOV or IMS
> > description.
> 
> Thanks for the review Bjorn! I'll carve out a patch with just find_dvsec()
> and apply your comments and recommendations.
> 
> So the intel-iommu driver checks for the SIOV cap. And the idxd driver
> checks for SIOV and IMS cap. There will be other upcoming drivers that will
> check for such cap too. It is Intel vendor specific right now, but SIOV is
> public and other vendors may implement to the spec. Is there a good place to
> put the common capability check for that?

Let's wait and see what that code looks like and figure it out then.
We can always move it to the PCI core if it turns out to be generic.

Right now the code only finds a capability and checks a bit in it.
None of that is anything the PCI core is interested in.

> There are some other fields in the SIOV dvsec cap, but presently they are
> not being utilized. The idxd driver is only interested in making sure that
> SIOV and IMS (sub feature) support are present at this point.

I'm a little dubious about code that checks whether support is present
but doesn't actually *do* anything with that support, but as long as
it's outside the PCI core, that's up to you :)

Bjorn

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 01/17] irqchip: Add IMS (Interrupt Message Store) driver
  2020-10-30 18:50 ` [PATCH v4 01/17] irqchip: Add IMS (Interrupt Message Store) driver Dave Jiang
@ 2020-10-30 22:01   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2020-10-30 22:01 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, jgg, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: dmaengine, linux-kernel, linux-pci, kvm

On Fri, Oct 30 2020 at 11:50, Dave Jiang wrote:
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -487,6 +487,8 @@ extern int irq_get_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
>  extern int irq_set_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
>  				 bool state);
>  
> +int irq_set_auxdata(unsigned int irq, unsigned int which, u64 val);
....
> +EXPORT_SYMBOL_GPL(irq_set_auxdata);

Again: Read and follow documentation. This does not belong into this
driver patch and wants to be a standalone preperatory patch.

Also the core change, the irq chip, the iommu support and the device msi
dependency has to be completely seperate from this idxd series.

You cannot just dump a pile of patches touching several subsystems at
once plus having dependencies on stuff which is not even agreed on and
merged and then expect that everything just falls into place.

The various subsystems involved are not holding their breath and putting
a lock on development just because you have a series against some random
snapshot.

The dependencies, e.g. the device msi infrastructure, are not going to
make their way magically into the proper maintainer tree either.

If this ever goes into a mergeable state, then the merge logistics for
this whole thing need to be carefully sorted out and it's on you to make
that as simple as possible for every maintainer involved.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 20:59   ` Dave Jiang
@ 2020-10-30 22:10     ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2020-10-30 22:10 UTC (permalink / raw)
  To: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, alex.williamson,
	jacob.jun.pan, ashok.raj, jgg, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain
  Cc: Megha Dey, dmaengine, linux-kernel, linux-pci, kvm

On Fri, Oct 30 2020 at 13:59, Dave Jiang wrote:
> On 10/30/2020 1:48 PM, Thomas Gleixner wrote:
>> On Fri, Oct 30 2020 at 11:50, Dave Jiang wrote:
>>> The code has dependency on Thomas’s MSI restructuring patch series:
>>> https://lore.kernel.org/lkml/20200826111628.794979401@linutronix.de/
>> 
>> which is outdated and not longer applicable.
>
> Yes.... I wasn't sure how to point to these patches from you as a dependency.
>
> irqdomain/msi: Provide msi_alloc/free_store() callbacks
> platform-msi: Add device MSI infrastructure
> genirq/msi: Provide and use msi_domain_set_default_info_flags()
> genirq/proc: Take buslock on affinity write
> platform-msi: Provide default irq_chip:: Ack
> x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI
> x86/irq: Add DEV_MSI allocation type

How can you point at something which is not longer applicable?

> Do I need to include these patches in my series? Thanks!

No. They are NOT part of this series. Prerequisites are seperate
entities and your series can be based on them.

So for one you want to make sure that the prerequisites for your IDXD
stuff are going to be merged into the relevant maintainer trees.

To allow people working with your stuff you simply provide an
aggregation git tree which contains all the collected prerequisites.
This aggregation tree needs to be rebased when the prerequisites change
during review or are merged into a maintainer tree/branch.

It's not rocket science and a lot of people do exactly this all the time
in order to coordinate changes which have dependencies over multiple
subsystems.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-10-30 21:20     ` Dave Jiang
  2020-10-30 21:50       ` Bjorn Helgaas
@ 2020-10-30 22:45       ` Jason Gunthorpe
  2020-10-30 22:49         ` Dave Jiang
  1 sibling, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-10-30 22:45 UTC (permalink / raw)
  To: Dave Jiang
  Cc: Bjorn Helgaas, vkoul, megha.dey, maz, bhelgaas, tglx,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, samuel.ortiz, mona.hossain, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
> So the intel-iommu driver checks for the SIOV cap. And the idxd driver
> checks for SIOV and IMS cap. There will be other upcoming drivers that will
> check for such cap too. It is Intel vendor specific right now, but SIOV is
> public and other vendors may implement to the spec. Is there a good place to
> put the common capability check for that?

I'm still really unhappy with these SIOV caps. It was explained this
is just a hack to make up for pci_ims_array_create_msi_irq_domain()
succeeding in VM cases when it doesn't actually work.

Someday this is likely to get fixed, so tying platform behavior to PCI
caps is completely wrong.

This needs to be solved in the platform code,
pci_ims_array_create_msi_irq_domain() should not succeed in these
cases.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-10-30 22:45       ` Jason Gunthorpe
@ 2020-10-30 22:49         ` Dave Jiang
  2020-11-02 13:21           ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Dave Jiang @ 2020-10-30 22:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, vkoul, megha.dey, maz, bhelgaas, tglx,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, samuel.ortiz, mona.hossain, dmaengine,
	linux-kernel, linux-pci, kvm



On 10/30/2020 3:45 PM, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
>> So the intel-iommu driver checks for the SIOV cap. And the idxd driver
>> checks for SIOV and IMS cap. There will be other upcoming drivers that will
>> check for such cap too. It is Intel vendor specific right now, but SIOV is
>> public and other vendors may implement to the spec. Is there a good place to
>> put the common capability check for that?
> 
> I'm still really unhappy with these SIOV caps. It was explained this
> is just a hack to make up for pci_ims_array_create_msi_irq_domain()
> succeeding in VM cases when it doesn't actually work.
> 
> Someday this is likely to get fixed, so tying platform behavior to PCI
> caps is completely wrong.
> 
> This needs to be solved in the platform code,
> pci_ims_array_create_msi_irq_domain() should not succeed in these
> cases.

That sounds reasonable. Are you asking that the IMS cap check should gate the 
success/failure of pci_ims_array_create_msi_irq_domain() rather than the driver?

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 20:43           ` Raj, Ashok
@ 2020-10-30 22:54             ` Jason Gunthorpe
  2020-10-31  2:50             ` Thomas Gleixner
  1 sibling, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-10-30 22:54 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, tglx,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Oct 30, 2020 at 01:43:07PM -0700, Raj, Ashok wrote:
 
> So drawing that parallel, do you expect all drivers that call
> pci_register_driver() to be located in drivers/pci? Aren't they scattered
> all over the place ata,scsi, platform drivers and such?

The subsystem is the thing that calls
device_register. pci_register_driver() doesn't do that.

> As Alex pointed out, i915 and handful of s390 drivers that are mdev users
> are not in drivers/vfio. Are you sayint those drivers don't get reviewed? 

Past mistakes do not justify continuing to do it wrong.

ARM and PPC went through a huge multi year cleanup moving code out of
arch and into the proper drivers/ directories. We know this is the
correct way to work the development process.

> Your argument seems interesting even entertaining :-). But honestly i'm not finding it
> practical :-). So every caller of mmu_register_notifier() needs to be in
> mm? 

mmu notifiers are not a subsytem, they are core libary code.

You seem to completely not understand what a subsystem is. :(

> I know you aren't going to give up, but there is little we can do. I want
> the maintainers to make that call and I'm not add more noise to this.

Well, hopefully Vinod will insist on following kernel norms here.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-30 20:43           ` Raj, Ashok
  2020-10-30 22:54             ` Jason Gunthorpe
@ 2020-10-31  2:50             ` Thomas Gleixner
  2020-10-31 23:53               ` Raj, Ashok
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-10-31  2:50 UTC (permalink / raw)
  To: Raj, Ashok, Jason Gunthorpe
  Cc: Dave Jiang, vkoul, megha.dey, maz, bhelgaas, alex.williamson,
	jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian, sanjay.k.kumar,
	tony.luck, jing.lin, dan.j.williams, kwankhede, eric.auger,
	parav, rafael, netanelg, shahafs, yan.y.zhao, pbonzini,
	samuel.ortiz, mona.hossain, Megha Dey, dmaengine, linux-kernel,
	linux-pci, kvm, Ashok Raj

Ashok,

On Fri, Oct 30 2020 at 13:43, Ashok Raj wrote:
> On Fri, Oct 30, 2020 at 04:30:45PM -0300, Jason Gunthorpe wrote:
>> On Fri, Oct 30, 2020 at 12:23:25PM -0700, Raj, Ashok wrote:
>> It is a different subsystem, different maintainer, and different
>> reviewers.
>> 
>> It is a development process problem, it doesn't matter what it is
>> doing.

< skip a lot of non-sensical arguments>

> I know you aren't going to give up, but there is little we can do. I want
> the maintainers to make that call and I'm not add more noise to this.

Jason is absolutely right.

Just because there is historical precendence which does not care about
the differentiation of subsystems is not an argument at all to make the
same mistakes which have been made years ago.

IDXD is just infrastructure which provides the base for a variety of
different functionalities. Very similar to what multi function devices
provide. In fact IDXD is pretty much a MFD facility.

Sticking all of it into dmaengine is sloppy at best. The dma engine
related part of IDXD is only a part of the overall functionality.

I'm well aware that it is conveniant to just throw everything into
drivers/myturf/ but that does neither make it reviewable nor
maintainable.

What's the problem with restructuring your code in a way which makes it
fit into existing subsystems?

The whole thing - as I pointed out to Dave earlier - is based on 'works
for me' wishful thinking with a blissful ignorance of the development
process and the requirement to split a large problem into the proper
bits and pieces aka. engineering 101.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-31  2:50             ` Thomas Gleixner
@ 2020-10-31 23:53               ` Raj, Ashok
  2020-11-02 13:20                 ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Raj, Ashok @ 2020-10-31 23:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jason Gunthorpe, Dave Jiang, vkoul, megha.dey, maz, bhelgaas,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

Hi Thomas,

On Sat, Oct 31, 2020 at 03:50:43AM +0100, Thomas Gleixner wrote:
> Ashok,
> 
> < skip a lot of non-sensical arguments>

Ouch!.. Didn't mean to awaken you like this :-).. apologies.. profusely! 

> 
> Just because there is historical precendence which does not care about
> the differentiation of subsystems is not an argument at all to make the
> same mistakes which have been made years ago.
> 
> IDXD is just infrastructure which provides the base for a variety of
> different functionalities. Very similar to what multi function devices
> provide. In fact IDXD is pretty much a MFD facility.

I'm only asking this to better understand the thought process. 
I don't intend to be defensive,  I have my hands tied back.. so we will do
what you say best fits per your recommendation.

Not my intend to dig a deeper hole than I have already dug! :-(

IDXD is just a glorified DMA engine, data mover. It also does a few other
things. In that sense its a multi-function facility. But doesn't do  different 
functional pieces like PCIe multi-function device in that sense. i.e
it doesn't do other storage and network in that sense. 

> 
> Sticking all of it into dmaengine is sloppy at best. The dma engine
> related part of IDXD is only a part of the overall functionality.

dmaengine is the basic non-transformational data-mover. Doing other operations
or transformations are just the glorified data-mover part. But fundamentally
not different.

> 
> I'm well aware that it is conveniant to just throw everything into
> drivers/myturf/ but that does neither make it reviewable nor
> maintainable.

That's true, when we add lot of functionality in one place. IDXD doing
mdev support is not offering new functioanlity. SRIOV PF drivers that support
PF/VF mailboxes are part of PF drivers today. IDXD mdev is preciely playing that
exact role. 

If we are doing this just to improve review effectiveness, Now we would need
some parent driver, and these sub-drivers registering seemed like a bit of
over-engineering when these sub-drivers actually are an extension of the
base driver and offer nothing more than extending sub-device partitions 
of IDXD for guest drivers. These look and feel like IDXD, not another device 
interface. In that sense if we move PF/VF mailboxes as
separate drivers i thought it feels a bit odd.

Please don't take it the wrong way. 

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-10-31 23:53               ` Raj, Ashok
@ 2020-11-02 13:20                 ` Jason Gunthorpe
  2020-11-02 16:20                   ` Raj, Ashok
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-02 13:20 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Thomas Gleixner, Dave Jiang, vkoul, megha.dey, maz, bhelgaas,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm

On Sat, Oct 31, 2020 at 04:53:59PM -0700, Raj, Ashok wrote:

> If we are doing this just to improve review effectiveness, Now we would need
> some parent driver, and these sub-drivers registering seemed like a bit of
> over-engineering when these sub-drivers actually are an extension of the
> base driver and offer nothing more than extending sub-device partitions
> of IDXD for guest drivers. These look and feel like IDXD, not another device 
> interface. In that sense if we move PF/VF mailboxes as
> separate drivers i thought it feels a bit odd.

You need this split anyhow, putting VFIO calls into the main idxd
module is not OK.

Plugging in a PCI device should not auto-load VFIO modules.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-10-30 22:49         ` Dave Jiang
@ 2020-11-02 13:21           ` Jason Gunthorpe
  2020-11-03  2:49             ` Tian, Kevin
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-02 13:21 UTC (permalink / raw)
  To: Dave Jiang
  Cc: Bjorn Helgaas, vkoul, megha.dey, maz, bhelgaas, tglx,
	alex.williamson, jacob.jun.pan, ashok.raj, yi.l.liu, baolu.lu,
	kevin.tian, sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, samuel.ortiz, mona.hossain, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Oct 30, 2020 at 03:49:22PM -0700, Dave Jiang wrote:
> 
> 
> On 10/30/2020 3:45 PM, Jason Gunthorpe wrote:
> > On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
> > > So the intel-iommu driver checks for the SIOV cap. And the idxd driver
> > > checks for SIOV and IMS cap. There will be other upcoming drivers that will
> > > check for such cap too. It is Intel vendor specific right now, but SIOV is
> > > public and other vendors may implement to the spec. Is there a good place to
> > > put the common capability check for that?
> > 
> > I'm still really unhappy with these SIOV caps. It was explained this
> > is just a hack to make up for pci_ims_array_create_msi_irq_domain()
> > succeeding in VM cases when it doesn't actually work.
> > 
> > Someday this is likely to get fixed, so tying platform behavior to PCI
> > caps is completely wrong.
> > 
> > This needs to be solved in the platform code,
> > pci_ims_array_create_msi_irq_domain() should not succeed in these
> > cases.
> 
> That sounds reasonable. Are you asking that the IMS cap check should gate
> the success/failure of pci_ims_array_create_msi_irq_domain() rather than the
> driver?

There shouldn't be an IMS cap at all

As I understand, the problem here is the only way to establish new
VT-d IRQ routing is by trapping and emulating MSI/MSI-X related
activities and triggering routing of the vectors into the guest.

There is a missing hypercall to allow the guest to do this on its own,
presumably it will someday be fixed so IMS can work in guests.

Until the hypercall is added pci_ims_array_create_msi_irq_domain()
should simply fail in guests. No PCI cap check required.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-11-02 13:20                 ` Jason Gunthorpe
@ 2020-11-02 16:20                   ` Raj, Ashok
  2020-11-02 17:19                     ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Raj, Ashok @ 2020-11-02 16:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, Dave Jiang, vkoul, megha.dey, maz, bhelgaas,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

Hi Jason

On Mon, Nov 02, 2020 at 09:20:36AM -0400, Jason Gunthorpe wrote:

> > of IDXD for guest drivers. These look and feel like IDXD, not another device 
> > interface. In that sense if we move PF/VF mailboxes as
> > separate drivers i thought it feels a bit odd.
> 
> You need this split anyhow, putting VFIO calls into the main idxd
> module is not OK.
> 
> Plugging in a PCI device should not auto-load VFIO modules.

Yes, I agree that would be a good reason to separate them completely and
glue functionality with private APIs between the 2 modules.

- Separate mdev code from base idxd.
- Separate maintainers, so its easy to review and include. (But remember
  they are heavily inter-dependent. They have to move to-gether)

Almost all SRIOV drivers today are just configured with some form of Kconfig
and those relevant files are compiled into the same module.

I think in *most* applications idxd would be operating in that mode, where
you have the base driver and mdev parts (like VF) compiled in if configured
such.

Creating these private interfaces for intra-module are just 1-1 and not
general purpose and every accelerator needs to create these instances.

I wasn't sure focibly creating this firewall between the PF/VF interfaces
is actually worth the work every driver is going to require. I can see
where this is required when they offer separate functional interfaces
when we talk about multi-function in a more confined definition today.

idxd mdev's are purely a VF extension. It doesn't provide any different
function. For e.g. like an RDMA device that can provide iWarp, ipoib or
even multiplexing storage over IB. IDXD is a fixed function interface.

Sure having separate modules helps with that isolation. But I'm not
convinced if this simplifies, or complicates things more than what is
required for these device types.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-11-02 16:20                   ` Raj, Ashok
@ 2020-11-02 17:19                     ` Jason Gunthorpe
  2020-11-02 18:18                       ` Dave Jiang
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-02 17:19 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Thomas Gleixner, Dave Jiang, vkoul, megha.dey, maz, bhelgaas,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm

On Mon, Nov 02, 2020 at 08:20:43AM -0800, Raj, Ashok wrote:
> Creating these private interfaces for intra-module are just 1-1 and not
> general purpose and every accelerator needs to create these instances.

This is where we are going, auxillary bus should be merged soon which
is specifically to connect these kinds of devices across subsystems

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-11-02 17:19                     ` Jason Gunthorpe
@ 2020-11-02 18:18                       ` Dave Jiang
  2020-11-02 18:26                         ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Dave Jiang @ 2020-11-02 18:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Raj, Ashok
  Cc: Thomas Gleixner, vkoul, megha.dey, maz, bhelgaas,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm



On 11/2/2020 10:19 AM, Jason Gunthorpe wrote:
> On Mon, Nov 02, 2020 at 08:20:43AM -0800, Raj, Ashok wrote:
>> Creating these private interfaces for intra-module are just 1-1 and not
>> general purpose and every accelerator needs to create these instances.
> 
> This is where we are going, auxillary bus should be merged soon which
> is specifically to connect these kinds of devices across subsystems

I think this resolves the aux device probe/remove issue via a common bus. But it 
does not help with the mdev device needing a lot of the device handling calls 
from the parent driver as it share the same handling as the parent device. My 
plan is to export all the needed call via EXPORT_SYMBOL_NS() so the calls can be 
shared in its own namespace between the modules. Do you have any objection with 
that?

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-11-02 18:18                       ` Dave Jiang
@ 2020-11-02 18:26                         ` Jason Gunthorpe
  2020-11-02 18:38                           ` Dan Williams
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-02 18:26 UTC (permalink / raw)
  To: Dave Jiang
  Cc: Raj, Ashok, Thomas Gleixner, vkoul, megha.dey, maz, bhelgaas,
	alex.williamson, jacob.jun.pan, yi.l.liu, baolu.lu, kevin.tian,
	sanjay.k.kumar, tony.luck, jing.lin, dan.j.williams, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, samuel.ortiz, mona.hossain, Megha Dey, dmaengine,
	linux-kernel, linux-pci, kvm

On Mon, Nov 02, 2020 at 11:18:33AM -0700, Dave Jiang wrote:
> 
> 
> On 11/2/2020 10:19 AM, Jason Gunthorpe wrote:
> > On Mon, Nov 02, 2020 at 08:20:43AM -0800, Raj, Ashok wrote:
> > > Creating these private interfaces for intra-module are just 1-1 and not
> > > general purpose and every accelerator needs to create these instances.
> > 
> > This is where we are going, auxillary bus should be merged soon which
> > is specifically to connect these kinds of devices across subsystems
> 
> I think this resolves the aux device probe/remove issue via a common bus.
> But it does not help with the mdev device needing a lot of the device
> handling calls from the parent driver as it share the same handling as the
> parent device.

The intention of auxiliary bus is that the two parts will tightly
couple across some exported function interface.

> My plan is to export all the needed call via EXPORT_SYMBOL_NS() so
> the calls can be shared in its own namespace between the modules. Do
> you have any objection with that?

I think you will be the first to use the namespace stuff for this, it
seems like a good idea and others should probably do so as well.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-11-02 18:26                         ` Jason Gunthorpe
@ 2020-11-02 18:38                           ` Dan Williams
  2020-11-02 18:51                             ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Dan Williams @ 2020-11-02 18:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, Raj, Ashok, Thomas Gleixner, Vinod Koul, Dey, Megha,
	maz, Bjorn Helgaas, Alex Williamson, Jacob jun Pan, Yi L Liu,
	Baolu Lu, Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin,
	kwankhede, eric.auger, Parav Pandit, Rafael J. Wysocki, netanelg,
	shahafs, yan.y.zhao, Paolo Bonzini, Samuel Ortiz, Mona Hossain,
	Megha Dey, dmaengine, Linux Kernel Mailing List, Linux PCI,
	KVM list

On Mon, Nov 2, 2020 at 10:26 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Nov 02, 2020 at 11:18:33AM -0700, Dave Jiang wrote:
> >
> >
> > On 11/2/2020 10:19 AM, Jason Gunthorpe wrote:
> > > On Mon, Nov 02, 2020 at 08:20:43AM -0800, Raj, Ashok wrote:
> > > > Creating these private interfaces for intra-module are just 1-1 and not
> > > > general purpose and every accelerator needs to create these instances.
> > >
> > > This is where we are going, auxillary bus should be merged soon which
> > > is specifically to connect these kinds of devices across subsystems
> >
> > I think this resolves the aux device probe/remove issue via a common bus.
> > But it does not help with the mdev device needing a lot of the device
> > handling calls from the parent driver as it share the same handling as the
> > parent device.
>
> The intention of auxiliary bus is that the two parts will tightly
> couple across some exported function interface.
>
> > My plan is to export all the needed call via EXPORT_SYMBOL_NS() so
> > the calls can be shared in its own namespace between the modules. Do
> > you have any objection with that?
>
> I think you will be the first to use the namespace stuff for this, it
> seems like a good idea and others should probably do so as well.

I was thinking either EXPORT_SYMBOL_NS, or auxiliary bus, because you
should be able to export an ops structure with all the necessary
callbacks. Aux bus seems cleaner because the lifetime rules and
ownership concerns are clearer.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-11-02 18:38                           ` Dan Williams
@ 2020-11-02 18:51                             ` Jason Gunthorpe
  2020-11-02 19:26                               ` Dan Williams
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-02 18:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Jiang, Raj, Ashok, Thomas Gleixner, Vinod Koul, Dey, Megha,
	maz, Bjorn Helgaas, Alex Williamson, Jacob jun Pan, Yi L Liu,
	Baolu Lu, Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin,
	kwankhede, eric.auger, Parav Pandit, Rafael J. Wysocki, netanelg,
	shahafs, yan.y.zhao, Paolo Bonzini, Samuel Ortiz, Mona Hossain,
	Megha Dey, dmaengine, Linux Kernel Mailing List, Linux PCI,
	KVM list

On Mon, Nov 02, 2020 at 10:38:28AM -0800, Dan Williams wrote:

> > I think you will be the first to use the namespace stuff for this, it
> > seems like a good idea and others should probably do so as well.
> 
> I was thinking either EXPORT_SYMBOL_NS, or auxiliary bus, because you
> should be able to export an ops structure with all the necessary
> callbacks. 

'or'? 

Auxiliary bus should not be used with huge arrays of function
pointers... The module providing the device should export a normal
linkable function interface. Putting that in a namespace makes a lot
of sense.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver
  2020-11-02 18:51                             ` Jason Gunthorpe
@ 2020-11-02 19:26                               ` Dan Williams
  0 siblings, 0 replies; 123+ messages in thread
From: Dan Williams @ 2020-11-02 19:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Jiang, Raj, Ashok, Thomas Gleixner, Vinod Koul, Dey, Megha,
	maz, Bjorn Helgaas, Alex Williamson, Jacob jun Pan, Yi L Liu,
	Baolu Lu, Tian, Kevin, Sanjay K Kumar, Luck, Tony, Jing Lin,
	kwankhede, eric.auger, Parav Pandit, Rafael J. Wysocki, netanelg,
	shahafs, yan.y.zhao, Paolo Bonzini, Samuel Ortiz, Mona Hossain,
	Megha Dey, dmaengine, Linux Kernel Mailing List, Linux PCI,
	KVM list

On Mon, Nov 2, 2020 at 10:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Nov 02, 2020 at 10:38:28AM -0800, Dan Williams wrote:
>
> > > I think you will be the first to use the namespace stuff for this, it
> > > seems like a good idea and others should probably do so as well.
> >
> > I was thinking either EXPORT_SYMBOL_NS, or auxiliary bus, because you
> > should be able to export an ops structure with all the necessary
> > callbacks.
>
> 'or'?
>
> Auxiliary bus should not be used with huge arrays of function
> pointers... The module providing the device should export a normal
> linkable function interface. Putting that in a namespace makes a lot
> of sense.

True, probably needs to be a mixture of both.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-02 13:21           ` Jason Gunthorpe
@ 2020-11-03  2:49             ` Tian, Kevin
  2020-11-03 12:43               ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2020-11-03  2:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Jiang, Dave
  Cc: Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, tglx,
	alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, November 2, 2020 9:22 PM
> 
> On Fri, Oct 30, 2020 at 03:49:22PM -0700, Dave Jiang wrote:
> >
> >
> > On 10/30/2020 3:45 PM, Jason Gunthorpe wrote:
> > > On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
> > > > So the intel-iommu driver checks for the SIOV cap. And the idxd driver
> > > > checks for SIOV and IMS cap. There will be other upcoming drivers that
> will
> > > > check for such cap too. It is Intel vendor specific right now, but SIOV is
> > > > public and other vendors may implement to the spec. Is there a good
> place to
> > > > put the common capability check for that?
> > >
> > > I'm still really unhappy with these SIOV caps. It was explained this
> > > is just a hack to make up for pci_ims_array_create_msi_irq_domain()
> > > succeeding in VM cases when it doesn't actually work.
> > >
> > > Someday this is likely to get fixed, so tying platform behavior to PCI
> > > caps is completely wrong.
> > >
> > > This needs to be solved in the platform code,
> > > pci_ims_array_create_msi_irq_domain() should not succeed in these
> > > cases.
> >
> > That sounds reasonable. Are you asking that the IMS cap check should gate
> > the success/failure of pci_ims_array_create_msi_irq_domain() rather than
> the
> > driver?
> 
> There shouldn't be an IMS cap at all
> 
> As I understand, the problem here is the only way to establish new
> VT-d IRQ routing is by trapping and emulating MSI/MSI-X related
> activities and triggering routing of the vectors into the guest.
> 
> There is a missing hypercall to allow the guest to do this on its own,
> presumably it will someday be fixed so IMS can work in guests.

Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
interface so any guest driver (if following the spec) can seamlessly
work on all hypervisors.

Thanks
Kevin



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-03  2:49             ` Tian, Kevin
@ 2020-11-03 12:43               ` Jason Gunthorpe
  2020-11-04  3:41                 ` Tian, Kevin
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-03 12:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	tglx, alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:

> > There is a missing hypercall to allow the guest to do this on its own,
> > presumably it will someday be fixed so IMS can work in guests.
> 
> Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> interface so any guest driver (if following the spec) can seamlessly
> work on all hypervisors.

It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
is architecturally wrong.

IMS *can not work* in any hypervsior without some special
hypercall. Just block it in the platform code and forget about the PCI
cap.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-03 12:43               ` Jason Gunthorpe
@ 2020-11-04  3:41                 ` Tian, Kevin
  2020-11-04 12:40                   ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2020-11-04  3:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	tglx, alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 3, 2020 8:44 PM
> 
> On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> 
> > > There is a missing hypercall to allow the guest to do this on its own,
> > > presumably it will someday be fixed so IMS can work in guests.
> >
> > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > interface so any guest driver (if following the spec) can seamlessly
> > work on all hypervisors.
> 
> It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> is architecturally wrong.
> 
> IMS *can not work* in any hypervsior without some special
> hypercall. Just block it in the platform code and forget about the PCI
> cap.
> 

It's per-device thing instead of platform thing. If the VMM understands
the IMS format of a specific device and virtualize it to the guest, the
guest can use IMS w/o any hypercall. If the VMM doesn't understand, it
simply clears the IMS cap bit for this device which forces the guest to
use the standard PCI MSI/MSI-X interface. In VMM side the decision is
based on device virtualization knowledge, e.g. in VFIO, instead of in 
platform virtualization logic. Your platform argument is based on the 
hypercall assumption, which is what we want to avoid instead.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-04  3:41                 ` Tian, Kevin
@ 2020-11-04 12:40                   ` Jason Gunthorpe
  2020-11-04 13:34                     ` Tian, Kevin
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-04 12:40 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	tglx, alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Wed, Nov 04, 2020 at 03:41:33AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 3, 2020 8:44 PM
> > 
> > On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> > 
> > > > There is a missing hypercall to allow the guest to do this on its own,
> > > > presumably it will someday be fixed so IMS can work in guests.
> > >
> > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > interface so any guest driver (if following the spec) can seamlessly
> > > work on all hypervisors.
> > 
> > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> > is architecturally wrong.
> > 
> > IMS *can not work* in any hypervsior without some special
> > hypercall. Just block it in the platform code and forget about the PCI
> > cap.
> > 
> 
> It's per-device thing instead of platform thing. If the VMM understands
> the IMS format of a specific device and virtualize it to the guest,

Please no! Adding device specific emulation is just going down deeper
into this bad architecture.

Interrupts is a platform issue. Using emulation of MSI to dynamically
insert vectors to a VM was a reasonable, but hacky thing. Now it needs
proper platform support.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-04 12:40                   ` Jason Gunthorpe
@ 2020-11-04 13:34                     ` Tian, Kevin
  2020-11-04 13:54                       ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2020-11-04 13:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	tglx, alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, November 4, 2020 8:40 PM
> 
> On Wed, Nov 04, 2020 at 03:41:33AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, November 3, 2020 8:44 PM
> > >
> > > On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> > >
> > > > > There is a missing hypercall to allow the guest to do this on its own,
> > > > > presumably it will someday be fixed so IMS can work in guests.
> > > >
> > > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > > interface so any guest driver (if following the spec) can seamlessly
> > > > work on all hypervisors.
> > >
> > > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> > > is architecturally wrong.
> > >
> > > IMS *can not work* in any hypervsior without some special
> > > hypercall. Just block it in the platform code and forget about the PCI
> > > cap.
> > >
> >
> > It's per-device thing instead of platform thing. If the VMM understands
> > the IMS format of a specific device and virtualize it to the guest,
> 
> Please no! Adding device specific emulation is just going down deeper
> into this bad architecture.
> 
> Interrupts is a platform issue. Using emulation of MSI to dynamically

Interrupt controller is a platform issue. Interrupt source is about device.

> insert vectors to a VM was a reasonable, but hacky thing. Now it needs
> proper platform support.
> 

why is MSI emulation a hacky thing? isn't it defined by PCISIG? I guess
that I must misunderstand your real point here...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-04 13:34                     ` Tian, Kevin
@ 2020-11-04 13:54                       ` Jason Gunthorpe
  2020-11-06  9:48                         ` Tian, Kevin
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-04 13:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	tglx, alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Wed, Nov 04, 2020 at 01:34:08PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, November 4, 2020 8:40 PM
> > 
> > On Wed, Nov 04, 2020 at 03:41:33AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, November 3, 2020 8:44 PM
> > > >
> > > > On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> > > >
> > > > > > There is a missing hypercall to allow the guest to do this on its own,
> > > > > > presumably it will someday be fixed so IMS can work in guests.
> > > > >
> > > > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > > > interface so any guest driver (if following the spec) can seamlessly
> > > > > work on all hypervisors.
> > > >
> > > > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> > > > is architecturally wrong.
> > > >
> > > > IMS *can not work* in any hypervsior without some special
> > > > hypercall. Just block it in the platform code and forget about the PCI
> > > > cap.
> > > >
> > >
> > > It's per-device thing instead of platform thing. If the VMM understands
> > > the IMS format of a specific device and virtualize it to the guest,
> > 
> > Please no! Adding device specific emulation is just going down deeper
> > into this bad architecture.
> > 
> > Interrupts is a platform issue. Using emulation of MSI to dynamically
> 
> Interrupt controller is a platform issue. Interrupt source is about device.

The interrupt controller is responsible to create an addr/data pair
for an interrupt message. It sets the message format and ensures it
routes to the proper CPU interrupt handler. Everything about the
addr/data pair is owned by the platform interrupt controller.

Devices do not create interrupts. They only trigger the addr/data pair
the platform gives them.

> > insert vectors to a VM was a reasonable, but hacky thing. Now it needs
> > proper platform support.
>
> why is MSI emulation a hacky thing? isn't it defined by PCISIG? I guess
> that I must misunderstand your real point here...

It means the interrupt controller in the VM's platform is a fiction,
the addr/data pairs it creates are not real.

A PCI device assigned to a VM is supposed to be fully contained by the
IOMMU, interrupts included, so there is no reason to do MSI emulation
if the VM's interrupt controller is aware of what addr/data pairs it
can use with the device - eg by getting them through a hypercall. This
is much cleaner and supports things like IMS

Trying to do IMS emulation is nutz, the entire point of IMS is the
device can do what it likes, and emulating that is not going to
feasible. For instance go read the discussion I had with Thomas how a
object-centric device would manage interrupts.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-04 13:54                       ` Jason Gunthorpe
@ 2020-11-06  9:48                         ` Tian, Kevin
  2020-11-06 13:14                           ` Jason Gunthorpe
  2020-11-07  0:32                           ` Thomas Gleixner
  0 siblings, 2 replies; 123+ messages in thread
From: Tian, Kevin @ 2020-11-06  9:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	tglx, alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, November 4, 2020 9:54 PM
> 
> On Wed, Nov 04, 2020 at 01:34:08PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, November 4, 2020 8:40 PM
> > >
> > > On Wed, Nov 04, 2020 at 03:41:33AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Tuesday, November 3, 2020 8:44 PM
> > > > >
> > > > > On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> > > > >
> > > > > > > There is a missing hypercall to allow the guest to do this on its own,
> > > > > > > presumably it will someday be fixed so IMS can work in guests.
> > > > > >
> > > > > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > > > > interface so any guest driver (if following the spec) can seamlessly
> > > > > > work on all hypervisors.
> > > > >
> > > > > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM
> issue
> > > > > is architecturally wrong.
> > > > >
> > > > > IMS *can not work* in any hypervsior without some special
> > > > > hypercall. Just block it in the platform code and forget about the PCI
> > > > > cap.
> > > > >
> > > >
> > > > It's per-device thing instead of platform thing. If the VMM understands
> > > > the IMS format of a specific device and virtualize it to the guest,
> > >
> > > Please no! Adding device specific emulation is just going down deeper
> > > into this bad architecture.
> > >
> > > Interrupts is a platform issue. Using emulation of MSI to dynamically
> >
> > Interrupt controller is a platform issue. Interrupt source is about device.
> 
> The interrupt controller is responsible to create an addr/data pair
> for an interrupt message. It sets the message format and ensures it
> routes to the proper CPU interrupt handler. Everything about the
> addr/data pair is owned by the platform interrupt controller.
> 
> Devices do not create interrupts. They only trigger the addr/data pair
> the platform gives them.

I guess that we may just view it from different angles. On x86 platform,
a MSI/IMS capable device directly composes interrupt messages, with 
addr/data pair filled by OS. If there is no IOMMU remapping enabled in 
the middle, the message just hits the CPU. Your description possibly
is from software side, e.g. describing the hierarchical IRQ domain
concept?

> 
> > > insert vectors to a VM was a reasonable, but hacky thing. Now it needs
> > > proper platform support.
> >
> > why is MSI emulation a hacky thing? isn't it defined by PCISIG? I guess
> > that I must misunderstand your real point here...
> 
> It means the interrupt controller in the VM's platform is a fiction,
> the addr/data pairs it creates are not real.
> 
> A PCI device assigned to a VM is supposed to be fully contained by the
> IOMMU, interrupts included, so there is no reason to do MSI emulation
> if the VM's interrupt controller is aware of what addr/data pairs it
> can use with the device - eg by getting them through a hypercall. This
> is much cleaner and supports things like IMS

I agree with this point, just as how pci-hyperv.c works. In concept Linux
guest driver should be able to use IMS when running on Hyper-v. There
is no such thing for KVM, but possibly one day we will need similar stuff.
Before that happens the guest could choose to simply disallow devmsi 
by default in the platform code (inventing a hypercall just for 'disable' 
doesn't make sense) and ignore the IMS cap. One small open is whether
this can be done in one central-place. The detection of running as guest
is done in arch-specific code. Do we need disabling devmsi for every arch?

But when talking about virtualization it's not good to assume the guest
behavior. It's perfectly sane to run a guest OS which doesn't implement 
any PV stuff (thus don't know running in a VM) but do support IMS. In 
such scenario the IMS cap allows the hypervisor to educate the guest 
driver to use MSI instead of IMS, as long as the driver follows the device 
spec. In this regard I don't think that the IMS cap will be a short-term 
thing, although Linux may choose to not use it.

> 
> Trying to do IMS emulation is nutz, the entire point of IMS is the
> device can do what it likes, and emulating that is not going to
> feasible. For instance go read the discussion I had with Thomas how a
> object-centric device would manage interrupts.
> 

Do you mind providing the link? There were lots of discussions between
you and Thomas. I failed to locate the exact mail when searching above
keywords. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-06  9:48                         ` Tian, Kevin
@ 2020-11-06 13:14                           ` Jason Gunthorpe
  2020-11-06 16:48                             ` Raj, Ashok
  2020-11-08 21:18                             ` Thomas Gleixner
  2020-11-07  0:32                           ` Thomas Gleixner
  1 sibling, 2 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 13:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	tglx, alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
> > The interrupt controller is responsible to create an addr/data pair
> > for an interrupt message. It sets the message format and ensures it
> > routes to the proper CPU interrupt handler. Everything about the
> > addr/data pair is owned by the platform interrupt controller.
> > 
> > Devices do not create interrupts. They only trigger the addr/data pair
> > the platform gives them.
> 
> I guess that we may just view it from different angles. On x86 platform,
> a MSI/IMS capable device directly composes interrupt messages, with 
> addr/data pair filled by OS.

Yes, all platforms work like that. The addr/data pair is *opaque* to
the device. Only the platform interrupt controller component
understands how to form those values.

> If there is no IOMMU remapping enabled in the middle, the message
> just hits the CPU. Your description possibly is from software side,
> e.g. describing the hierarchical IRQ domain concept?

I suppose you could say that. Technically the APIC doesn't form any
addr/data pairs, but the configuration of the APIC, IOMMU and other
platform components define what addr/data pairs are acceptable.

The IRQ domain stuff broadly puts responsibilty to form these values
in the IRQ layer which abstracts all the platform detatils. In Linux
we expect the platform to provide the IRQ Domain tha can specify
working addr/data pairs.

> I agree with this point, just as how pci-hyperv.c works. In concept Linux
> guest driver should be able to use IMS when running on Hyper-v. There
> is no such thing for KVM, but possibly one day we will need similar stuff.
> Before that happens the guest could choose to simply disallow devmsi
> by default in the platform code (inventing a hypercall just for 'disable' 
> doesn't make sense) and ignore the IMS cap. One small open is whether
> this can be done in one central-place. The detection of running as guest
> is done in arch-specific code. Do we need disabling devmsi for every arch?
>
> But when talking about virtualization it's not good to assume the guest
> behavior. It's perfectly sane to run a guest OS which doesn't implement 
> any PV stuff (thus don't know running in a VM) but do support IMS. In 
> such scenario the IMS cap allows the hypervisor to educate the guest 
> driver to use MSI instead of IMS, as long as the driver follows the device 
> spec. In this regard I don't think that the IMS cap will be a short-term 
> thing, although Linux may choose to not use it.

The IMS flag belongs in the platform not in the devices.

For instance you could put a "disable IMS" flag in the ACPI tables, in
the config space of the emuulated root port, or any other areas that
clearly belong to the platform.

The OS logic would be
 - If no IMS information found then use IMS (Bare metal)
 - If the IMS disable flag is found then
   - If (future) hypercall available and the OS knows how to use it
     then use IMS
   - If no hypercall found, or no OS knowledge, fail IMS

Our devices can use IMS even in a pure no-emulation
configurations. Saying that we need to insert complicated security
sensitive emulation just to get IMS in the guest is absolutely crazy.

> Do you mind providing the link? There were lots of discussions between
> you and Thomas. I failed to locate the exact mail when searching above
> keywords. 

Read through these two threads:

https://lore.kernel.org/linux-hyperv/20200821002949.049867339@linutronix.de/
https://lore.kernel.org/dmaengine/159534734833.28840.10067945890695808535.stgit@djiang5-desk3.ch.intel.com/

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-06 13:14                           ` Jason Gunthorpe
@ 2020-11-06 16:48                             ` Raj, Ashok
  2020-11-06 17:51                               ` Jason Gunthorpe
  2020-11-08 21:18                             ` Thomas Gleixner
  1 sibling, 1 reply; 123+ messages in thread
From: Raj, Ashok @ 2020-11-06 16:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz,
	bhelgaas, tglx, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

Hi Jason

On Fri, Nov 06, 2020 at 09:14:15AM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
> > > The interrupt controller is responsible to create an addr/data pair
> > > for an interrupt message. It sets the message format and ensures it
> > > routes to the proper CPU interrupt handler. Everything about the
> > > addr/data pair is owned by the platform interrupt controller.
> > > 
> > > Devices do not create interrupts. They only trigger the addr/data pair
> > > the platform gives them.
> > 
> > I guess that we may just view it from different angles. On x86 platform,
> > a MSI/IMS capable device directly composes interrupt messages, with 
> > addr/data pair filled by OS.
> 
> Yes, all platforms work like that. The addr/data pair is *opaque* to
> the device. Only the platform interrupt controller component
> understands how to form those values.

True, the addr/data pair is opaque. IMS doesn't dictate what the contents
of addr/data pair is made of. That is still a platform attribute. IMS simply 
controls where the pair is physically stored. Which only the device dictates.

> 
> > If there is no IOMMU remapping enabled in the middle, the message
> > just hits the CPU. Your description possibly is from software side,
> > e.g. describing the hierarchical IRQ domain concept?
> 
> I suppose you could say that. Technically the APIC doesn't form any
> addr/data pairs, but the configuration of the APIC, IOMMU and other
> platform components define what addr/data pairs are acceptable.
> 
> The IRQ domain stuff broadly puts responsibilty to form these values
> in the IRQ layer which abstracts all the platform detatils. In Linux
> we expect the platform to provide the IRQ Domain tha can specify
> working addr/data pairs.
> 
> > I agree with this point, just as how pci-hyperv.c works. In concept Linux
> > guest driver should be able to use IMS when running on Hyper-v. There
> > is no such thing for KVM, but possibly one day we will need similar stuff.
> > Before that happens the guest could choose to simply disallow devmsi
> > by default in the platform code (inventing a hypercall just for 'disable' 
> > doesn't make sense) and ignore the IMS cap. One small open is whether
> > this can be done in one central-place. The detection of running as guest
> > is done in arch-specific code. Do we need disabling devmsi for every arch?
> >
> > But when talking about virtualization it's not good to assume the guest
> > behavior. It's perfectly sane to run a guest OS which doesn't implement 
> > any PV stuff (thus don't know running in a VM) but do support IMS. In 
> > such scenario the IMS cap allows the hypervisor to educate the guest 
> > driver to use MSI instead of IMS, as long as the driver follows the device 
> > spec. In this regard I don't think that the IMS cap will be a short-term 
> > thing, although Linux may choose to not use it.
> 
> The IMS flag belongs in the platform not in the devices.

This support is mostly a SW thing right? we don't need to muck with
platform/ACPI for that matter. 

> 
> For instance you could put a "disable IMS" flag in the ACPI tables, in
> the config space of the emuulated root port, or any other areas that
> clearly belong to the platform.

Maybe there is a different interpretation for IMS that I'm missing. Devices
that need more interrupt support than supported by PCIe standards, and how
device has grouped the storage needs for the addr/data pair is a device
attribute.

I missed why ACPI tables should carry such information. If kernel doesn't
want to support those devices its within kernel control. Which means kernel
will only use the available MSIx interfaces. This is legacy support.

> 
> The OS logic would be
>  - If no IMS information found then use IMS (Bare metal)
>  - If the IMS disable flag is found then
>    - If (future) hypercall available and the OS knows how to use it
>      then use IMS
>    - If no hypercall found, or no OS knowledge, fail IMS
> 
> Our devices can use IMS even in a pure no-emulation

This is true for IMS as well. But probably not implemented in the kernel as
such. From a HW point of view (take idxd for instance) the facility is
available to native OS as well. The early RFC supported this for native.

Native devices can have both MSIx and IMS capability. But as I understand this
isn't how we have partitioned things in SW today. We left IMS only for
mdev's. And I agree this would be very useful.

In cases where we want to support interrupt handles for user space
notification (when application specifies that in the descriptor). Those
could be IMS. The device HW has support for it.

Remember the "Why PASID in IMS entry" discussion?

https://lore.kernel.org/lkml/20201008233210.GH4734@nvidia.com/

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-06 16:48                             ` Raj, Ashok
@ 2020-11-06 17:51                               ` Jason Gunthorpe
  2020-11-06 23:47                                 ` Dan Williams
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 17:51 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz,
	bhelgaas, tglx, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Nov 06, 2020 at 08:48:50AM -0800, Raj, Ashok wrote:
> > The IMS flag belongs in the platform not in the devices.
> 
> This support is mostly a SW thing right? we don't need to muck with
> platform/ACPI for that matter. 

Something needs to tell the guest OS platform what to do, so you need
a place to put it.

Putting it in a per-device PCI cap is horrible and hacky from an
architectural perspective.

> I missed why ACPI tables should carry such information. If kernel doesn't
> want to support those devices its within kernel control. Which means kernel
> will only use the available MSIx interfaces. This is legacy support.

The platform flag tells the guest that it can (or can't) support IMS
*at all*

Primarily a guest would be blocked because the VMM provides no way for
the guest to create addr/data pairs.

Has nothing to do with individual devices.

 
> > The OS logic would be
> >  - If no IMS information found then use IMS (Bare metal)
> >  - If the IMS disable flag is found then
> >    - If (future) hypercall available and the OS knows how to use it
> >      then use IMS
> >    - If no hypercall found, or no OS knowledge, fail IMS
> > 
> > Our devices can use IMS even in a pure no-emulation
> 
> This is true for IMS as well. But probably not implemented in the kernel as
> such. From a HW point of view (take idxd for instance) the facility is
> available to native OS as well. The early RFC supported this for native.

I can't follow what you are trying to say here.

Dave said the IMS cap was to indicate that the VMM supported emulation
of IMS so that the VMM can do the MSI addr/data translation as part of
the emulation.

I'm saying emulation will be too horrible for our devices that don't
require *any* emulation.

It is a bad architecture. The platform needs to handle this globally
for all devices, not special hacky emulations things custom made for
every device out there.

> Native devices can have both MSIx and IMS capability. But as I
> understand this isn't how we have partitioned things in SW today. We
> left IMS only for mdev's. And I agree this would be very useful.

That split is just some decision idxd did, we are thinking about doing
other things in our devices.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-06 17:51                               ` Jason Gunthorpe
@ 2020-11-06 23:47                                 ` Dan Williams
  2020-11-07  0:12                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Dan Williams @ 2020-11-06 23:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey,
	Megha, maz, bhelgaas, tglx, alex.williamson, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Nov 6, 2020 at 9:51 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
[..]
> > This is true for IMS as well. But probably not implemented in the kernel as
> > such. From a HW point of view (take idxd for instance) the facility is
> > available to native OS as well. The early RFC supported this for native.
>
> I can't follow what you are trying to say here.

I'm having a hard time following the technical cruxes of this debate.
I grokked your feedback on the original IMS proposal way back at the
beginning of this effort (pre-COVID even!), so maybe I can mediate
here as well. Although, SIOV is that much harder for me to spell than
IMS, so bear with me.

> Dave said the IMS cap was to indicate that the VMM supported emulation
> of IMS so that the VMM can do the MSI addr/data translation as part of
> the emulation.
>
> I'm saying emulation will be too horrible for our devices that don't
> require *any* emulation.

This part I think I understand, i.e. why spend any logic emulating IMS
as MSI since the IMS capability can be a paravirtualized interface
from guest to VMM with none of the compromises that MSI would enforce.
Did I get that right?

> It is a bad architecture. The platform needs to handle this globally
> for all devices, not special hacky emulations things custom made for
> every device out there.

I confess I don't quite understand the shape of what "platform needs
to handle this globally" means, but I understand the desired end
result of "no emulation added where not needed". However, would this
mean that the bare-metal idxd driver can not be used directly in the
guest without modification? For example, as I understand from talking
to Ashok, idxd has some device events like error notification hard
wired to MSI while data patch interrupts are IMS. So even if the IMS
side does not hook up MSI emulation doesn't idxd still need MSI
emulation to reuse the bare metal driver directly?

> > Native devices can have both MSIx and IMS capability. But as I
> > understand this isn't how we have partitioned things in SW today. We
> > left IMS only for mdev's. And I agree this would be very useful.
>
> That split is just some decision idxd did, we are thinking about doing
> other things in our devices.

Where does the collision happen between what you need for a clean
implementation of an IMS-like capability (/me misses his "dev-msi"
name that got thrown out in the Thomas rewrite), and emulation needed
to not have VF special casing in the idxd driver.

Also feel free to straighten me out (Jason or Ashok) if I've botched
the understanding of this.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-06 23:47                                 ` Dan Williams
@ 2020-11-07  0:12                                   ` Jason Gunthorpe
  2020-11-07  1:42                                     ` Dan Williams
                                                       ` (2 more replies)
  0 siblings, 3 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-07  0:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Raj, Ashok, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey,
	Megha, maz, bhelgaas, tglx, alex.williamson, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:

> Also feel free to straighten me out (Jason or Ashok) if I've botched
> the understanding of this.

It is pretty simple when you get down to it.

We have a new kernel API that Thomas added:

  pci_subdevice_msi_create_irq_domain()

This creates an IRQ domain that hands out addr/data pairs that
trigger interrupts.

On bare metal the addr/data pairs from the IRQ domain are programmed
into the HW in some HW specific way by the device driver that calls
the above function.

On (kvm) virtualization the addr/data pair the IRQ domain hands out
doesn't work. It is some fake thing.

To make this work on normal MSI/MSI-X the VMM implements emulation of
the standard MSI/MSI-X programming and swaps the fake addr/data pair
for a real one obtained from the hypervisor IRQ domain.

To "deal" with this issue the SIOV spec suggests to add a per-device
PCI Capability that says "IMS works". Which means either:
 - This is bare metal, so of course it works
 - The VMM is trapping and emulating whatever the device specific IMS
   programming is.

The idea being that a VMM can never advertise the IMS cap flag to the
guest unles the VMM provides a device specific driver that does device
specific emulation to capture the addr/data pair. Remeber IMS doesn't
say how to program the addr/data pair! Every device is unique!

On something like IDXD this emulation is not so hard, on something
like mlx5 this is completely unworkable. Further we never do
emulation on our devices, they always pass native hardware through,
even for SIOV-like cases.

In the end pci_subdevice_msi_create_irq_domain() is a platform
function. Either it should work completely on every device with no
device-specific emulation required in the VMM, or it should not work
at all and return -EOPNOTSUPP.

The only sane way to implement this generically is for the VMM to
provide a hypercall to obtain a real *working* addr/data pair(s) and
then have the platform hand those out from
pci_subdevice_msi_create_irq_domain(). 

All IMS device drivers will work correctly. No VMM device emulation is
ever needed to translate addr/data pairs.

Earlier in this thread Kevin said hyper-v is already working this way,
even for MSI/MSI-X. To me this says it is fundamentally a KVM platform
problem and it should not be solved by PCI capability flags.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-06  9:48                         ` Tian, Kevin
  2020-11-06 13:14                           ` Jason Gunthorpe
@ 2020-11-07  0:32                           ` Thomas Gleixner
  2020-11-09  5:25                             ` Tian, Kevin
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-07  0:32 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Nov 06 2020 at 09:48, Kevin Tian wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> On Wed, Nov 04, 2020 at 01:34:08PM +0000, Tian, Kevin wrote:
>> The interrupt controller is responsible to create an addr/data pair
>> for an interrupt message. It sets the message format and ensures it
>> routes to the proper CPU interrupt handler. Everything about the
>> addr/data pair is owned by the platform interrupt controller.
>> 
>> Devices do not create interrupts. They only trigger the addr/data pair
>> the platform gives them.
>
> I guess that we may just view it from different angles. On x86 platform,
> a MSI/IMS capable device directly composes interrupt messages, with 
> addr/data pair filled by OS. If there is no IOMMU remapping enabled in 
> the middle, the message just hits the CPU. Your description possibly
> is from software side, e.g. describing the hierarchical IRQ domain
> concept?

No. The device composes nothing. If the interrupt is raised in the
device then the MSI block sends the message which was composed by the OS
and stored in the device's message store. For PCI/MSI that's the MSI or
MSIX table and for IMS that's either on device memory (as IDXD uses) or
some completely different location which Jason described.

This has absolutely nothing to do with the X86 platform. MSI is a
architecture independent mechanism: Send whatever the OS put into the
storage to raise an interrupt in the CPU. The device does neither know
whether that message is going to be intercepted by an interrupt
remapping unit or not.

Stop claiming that any of this has anything to do with x86. It has
absolutely nothing to do with x86 and looking at MSI from an x86
perspective instead of looking at it from the architecture agnostic
technical reality of MSI is the reason why we have this discussion at
all.

We had a similar discussion vs. the way how IMS interrupts have to be
dealt with in terms of irq domains. Can you finally stop looking at
everything as a big x86/intel/platform lump and understand that things
are very well structured and seperated both at the hardware and at the
software level? 

> Do you mind providing the link? There were lots of discussions between
> you and Thomas. I failed to locate the exact mail when searching above
> keywords. 

In this thread: 20200821002424.119492231@linutronix.de and you were on
Cc

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-07  0:12                                   ` Jason Gunthorpe
@ 2020-11-07  1:42                                     ` Dan Williams
  2020-11-08 18:11                                     ` Raj, Ashok
  2020-11-08 18:47                                     ` Thomas Gleixner
  2 siblings, 0 replies; 123+ messages in thread
From: Dan Williams @ 2020-11-07  1:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey,
	Megha, maz, bhelgaas, tglx, alex.williamson, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Nov 6, 2020 at 4:12 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:
[..]
> The only sane way to implement this generically is for the VMM to
> provide a hypercall to obtain a real *working* addr/data pair(s) and
> then have the platform hand those out from
> pci_subdevice_msi_create_irq_domain().

Yeah, that seems a logical attach point for this magic. Appreciate you
taking the time to lay it out.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-07  0:12                                   ` Jason Gunthorpe
  2020-11-07  1:42                                     ` Dan Williams
@ 2020-11-08 18:11                                     ` Raj, Ashok
  2020-11-08 18:34                                       ` David Woodhouse
  2020-11-08 23:41                                       ` Jason Gunthorpe
  2020-11-08 18:47                                     ` Thomas Gleixner
  2 siblings, 2 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-08 18:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, tglx, alex.williamson, Pan, Jacob jun,
	Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

Hi Jason

Thanks, its now clear what you had mentioned earlier.

I had couple questions/clarifications below. Thanks for working 
through this.

On Fri, Nov 06, 2020 at 08:12:07PM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:
> 
> > Also feel free to straighten me out (Jason or Ashok) if I've botched
> > the understanding of this.
> 
> It is pretty simple when you get down to it.
> 
> We have a new kernel API that Thomas added:
> 
>   pci_subdevice_msi_create_irq_domain()
> 
> This creates an IRQ domain that hands out addr/data pairs that
> trigger interrupts.
> 
> On bare metal the addr/data pairs from the IRQ domain are programmed
> into the HW in some HW specific way by the device driver that calls
> the above function.
> 
> On (kvm) virtualization the addr/data pair the IRQ domain hands out
> doesn't work. It is some fake thing.

Is it really some fake thing? I thought the vCPU and vector are real
for a guest, and VMM ensures when interrupts are delivered they are either.

1. Handled by VMM first and then injected to guest
2. Handled in a Posted Interrupt manner, and injected to guest
   when it resumes. It can be delivered directly if guest was running
   when the interrupt arrived.

> 
> To make this work on normal MSI/MSI-X the VMM implements emulation of
> the standard MSI/MSI-X programming and swaps the fake addr/data pair
> for a real one obtained from the hypervisor IRQ domain.
> 
> To "deal" with this issue the SIOV spec suggests to add a per-device
> PCI Capability that says "IMS works". Which means either:
>  - This is bare metal, so of course it works
>  - The VMM is trapping and emulating whatever the device specific IMS
>    programming is.
> 
> The idea being that a VMM can never advertise the IMS cap flag to the
> guest unles the VMM provides a device specific driver that does device
> specific emulation to capture the addr/data pair. Remeber IMS doesn't
> say how to program the addr/data pair! Every device is unique!
> 
> On something like IDXD this emulation is not so hard, on something
> like mlx5 this is completely unworkable. Further we never do
> emulation on our devices, they always pass native hardware through,
> even for SIOV-like cases.

So is that true for interrupts too? Possibly you have the interrupt
entries sitting in memory resident on the device? Don't we need the 
VMM to ensure they are brokered by VMM in either one of the two ways 
above? What if the guest creates some addr in the 0xfee... range
how do we take care of interrupt remapping and such without any VMM 
assist?

Its probably a gap in my understanding. 

> 
> In the end pci_subdevice_msi_create_irq_domain() is a platform
> function. Either it should work completely on every device with no
> device-specific emulation required in the VMM, or it should not work
> at all and return -EOPNOTSUPP.
> 
> The only sane way to implement this generically is for the VMM to
> provide a hypercall to obtain a real *working* addr/data pair(s) and
> then have the platform hand those out from
> pci_subdevice_msi_create_irq_domain(). 
> 
> All IMS device drivers will work correctly. No VMM device emulation is
> ever needed to translate addr/data pairs.
> 

That's true. Probably this can work the same even for MSIx types too then?

When we do interrupt remapping support in guest which would be required 
if we support x2apic in guest, I think this is something we should look into more 
carefully to make this work.

One criteria that we generally tried to follow is driver that runs in host
and guest are the same, and if needed they need some functionality make it
work around some capability  detection so the alternate path can be plummed in
a generic way. 

I agree with the overall idea and we should certainly take that into consideration
when we need IMS in guest support and in context of interrupt remapping.

Hopefully I understood the overall concept. If I mis-understood any of this
please let me know.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 18:11                                     ` Raj, Ashok
@ 2020-11-08 18:34                                       ` David Woodhouse
  2020-11-08 23:25                                         ` Raj, Ashok
  2020-11-08 23:41                                       ` Jason Gunthorpe
  1 sibling, 1 reply; 123+ messages in thread
From: David Woodhouse @ 2020-11-08 18:34 UTC (permalink / raw)
  To: Raj, Ashok, Jason Gunthorpe
  Cc: Dan Williams, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, tglx, alex.williamson, Pan, Jacob jun,
	Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

[-- Attachment #1: Type: text/plain, Size: 4128 bytes --]

On Sun, 2020-11-08 at 10:11 -0800, Raj, Ashok wrote:
> Hi Jason
> 
> Thanks, its now clear what you had mentioned earlier.
> 
> I had couple questions/clarifications below. Thanks for working 
> through this.
> 
> On Fri, Nov 06, 2020 at 08:12:07PM -0400, Jason Gunthorpe wrote:
> > On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:
> > 
> > > Also feel free to straighten me out (Jason or Ashok) if I've botched
> > > the understanding of this.
> > 
> > It is pretty simple when you get down to it.
> > 
> > We have a new kernel API that Thomas added:
> > 
> >   pci_subdevice_msi_create_irq_domain()
> > 
> > This creates an IRQ domain that hands out addr/data pairs that
> > trigger interrupts.
> > 
> > On bare metal the addr/data pairs from the IRQ domain are programmed
> > into the HW in some HW specific way by the device driver that calls
> > the above function.
> > 
> > On (kvm) virtualization the addr/data pair the IRQ domain hands out
> > doesn't work. It is some fake thing.
> 
> Is it really some fake thing? I thought the vCPU and vector are real
> for a guest, and VMM ensures when interrupts are delivered they are either.
> 
> 1. Handled by VMM first and then injected to guest
> 2. Handled in a Posted Interrupt manner, and injected to guest
>    when it resumes. It can be delivered directly if guest was running
>    when the interrupt arrived.
> 
> > 
> > To make this work on normal MSI/MSI-X the VMM implements emulation of
> > the standard MSI/MSI-X programming and swaps the fake addr/data pair
> > for a real one obtained from the hypervisor IRQ domain.
> > 
> > To "deal" with this issue the SIOV spec suggests to add a per-device
> > PCI Capability that says "IMS works". Which means either:
> >  - This is bare metal, so of course it works
> >  - The VMM is trapping and emulating whatever the device specific IMS
> >    programming is.
> > 
> > The idea being that a VMM can never advertise the IMS cap flag to the
> > guest unles the VMM provides a device specific driver that does device
> > specific emulation to capture the addr/data pair. Remeber IMS doesn't
> > say how to program the addr/data pair! Every device is unique!
> > 
> > On something like IDXD this emulation is not so hard, on something
> > like mlx5 this is completely unworkable. Further we never do
> > emulation on our devices, they always pass native hardware through,
> > even for SIOV-like cases.
> 
> So is that true for interrupts too? Possibly you have the interrupt
> entries sitting in memory resident on the device? Don't we need the 
> VMM to ensure they are brokered by VMM in either one of the two ways 
> above? What if the guest creates some addr in the 0xfee... range
> how do we take care of interrupt remapping and such without any VMM 
> assist?
> 
> Its probably a gap in my understanding. 
> 
> > 
> > In the end pci_subdevice_msi_create_irq_domain() is a platform
> > function. Either it should work completely on every device with no
> > device-specific emulation required in the VMM, or it should not work
> > at all and return -EOPNOTSUPP.
> > 
> > The only sane way to implement this generically is for the VMM to
> > provide a hypercall to obtain a real *working* addr/data pair(s) and
> > then have the platform hand those out from
> > pci_subdevice_msi_create_irq_domain(). 
> > 
> > All IMS device drivers will work correctly. No VMM device emulation is
> > ever needed to translate addr/data pairs.
> > 
> 
> That's true. Probably this can work the same even for MSIx types too then?
> 
> When we do interrupt remapping support in guest which would be required 
> if we support x2apic in guest, I think this is something we should look into more 
> carefully to make this work.

No, interrupt remapping is not required for X2APIC in guests

They can have X2APIC and up to 32768 CPUs without needing interrupt
remapping at all. Only if they want more than 32768 vCPUs, or to do
nested virtualisation and actually remap for the benefit of *their*
(L2+) guests would they need IR.




[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-07  0:12                                   ` Jason Gunthorpe
  2020-11-07  1:42                                     ` Dan Williams
  2020-11-08 18:11                                     ` Raj, Ashok
@ 2020-11-08 18:47                                     ` Thomas Gleixner
  2020-11-08 19:36                                       ` David Woodhouse
                                                         ` (2 more replies)
  2 siblings, 3 replies; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-08 18:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Raj, Ashok, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey,
	Megha, maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine, linux-kernel,
	linux-pci, kvm

On Fri, Nov 06 2020 at 20:12, Jason Gunthorpe wrote:
> All IMS device drivers will work correctly. No VMM device emulation is
> ever needed to translate addr/data pairs.
>
> Earlier in this thread Kevin said hyper-v is already working this way,
> even for MSI/MSI-X. To me this says it is fundamentally a KVM platform
> problem and it should not be solved by PCI capability flags.

I mostly agree but want to add a few clarifications about the
terminology and the boundaries because I think there is where lot of the
confusion comes from.

Let me go back to the basic structure both at the hardware and at the
software level.

The basic structure is:

  [CPU] -- [Bridge] -- Bus -- [Device]

This applies to all kind of buses where the bridge directly translates
into the CPUs address space. Now let's look at the boundaries:

                |
                |
  [CPU] -- [Bri | dge] -- Bus -- [Device]
                |   
                |

The boundary is in the middle of the bridge because the CPU side of the
bridge is obviously CPU and therefore architecture specific. The Bus
side of the bridge is architecture agnostic.

Now let's add an IOMMU:

  [CPU] -- [IOMMU] -- [Bridge] -- Bus -- [Device]

and in theory the boundary moves now to:

               |
               |
  [CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device]
               |
               |

because with an IOMMU the bridge could become CPU and architecture
agnostic. In reality this is not the case as the bridge is still the
same thing.

Now let's look at MSI. As established above, the Bus and the Device are
CPU and architecture agnostic and the Device merily uses a composed
message which is stored at some place accessible to the device to send
that message when it raises an interrupt. So where is this message
composed?

The basic case:

                   |
                   |
  [CPU]    -- [Bri | dge] -- Bus -- [Device]
                   |
  Alloc +           
  Compose                   Store     Use

The Bridge is irrelevant here as it just is involved in the
transport. Nevertheless the Bridge is only transport in the view of the
interrupt subsystem.

The IOMMU case:

               |
               |
  [CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device]
               |
            Alloc +
  Alloc     Compose                 Store     Use


That's exactly reflected in hierarchical irq domains:

                       |
                       |
  [CPU]        -- [Bri | dge] --    Bus    -- [Device]
                       |   
  Alloc +           
  Compose                         Store        Use

  Vectordomain                   Busdomain

and:

                     |
                     |
  [CPU]       -- [IO | MMU]  -- [Bridge] --    Bus    -- [Device]
                     |
                  Alloc +   
  Alloc           Compose                    Store       Use

  Vectordomain   Remapdomain                Busdomain


Now if we look at the virtualization scenario and device hand through
then the structure in the guest view is not any different from the basic
case. This works with PCI-MSI[X] and the IDXD IMS variant because the
hypervisor can trap the access to the storage and translate the message:

                   |
                   |
  [CPU]    -- [Bri | dge] -- Bus -- [Device]
                   |
  Alloc +
  Compose                   Store     Use
                             |
                             | Trap
                             v
                             Hypervisor translates and stores

But obviously with an IMS storage location which is software controlled
by the guest side driver (the case Jason is interested in) the above
cannot work for obvious reasons.

That means the guest needs a way to ask the hypervisor for a proper
translation, i.e. a hypercall. Now where to do that? Looking at the
above remapping case it's pretty obvious:


                     |
                     |
  [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
                     |
  Alloc          "Compose"                   Store         Use

  Vectordomain   HCALLdomain                Busdomain
                 |        ^
                 |        |
                 v        | 
            Hypervisor    
               Alloc + Compose

Why? Because it reflects the boundaries and leaves the busdomain part
agnostic as it should be. And it works for _all_ variants of Busdomains.

Now the question which I can't answer is whether this can work correctly
in terms of isolation. If the IMS storage is in guest memory (queue
storage) then the guest driver can obviously write random crap into it
which the device will happily send. (For MSI and IDXD style IMS it
still can trap the store).

Is the IOMMU/Interrupt remapping unit able to catch such messages which
go outside the space to which the guest is allowed to signal to? If yes,
problem solved. If no, then IMS storage in guest memory can't ever work.

Coming back to this:

> In the end pci_subdevice_msi_create_irq_domain() is a platform
> function. Either it should work completely on every device with no
> device-specific emulation required in the VMM, or it should not work
> at all and return -EOPNOTSUPP.

The subdevice domain is a 'Busdomain' according to the structure
above. It does not and should never have any clue about the underlying
system. It's in the agnostic part and always works. It simply does not
care what's underneath. So it won't return -EOPNOTSUPP.

What it has to do is to transport the IMS in queue memory requirement to
the underlying parent domain.

So in case that the HCALL domain is missing, the Vector domain needs
return an error code on domain creation. If the HCALL domain is there
then the domain creation works and in case of actual interrupt
allocation the hypercall either returns a valid composed message or an
appropriate error code.

But there's a catch:

This only works when the guest OS actually knows that it runs in a
VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
solved because from the guest OS view that's the same as running on bare
metal. Obviously on bare metal the Vector domain can and must handle
this.

So this needs some thought.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 18:47                                     ` Thomas Gleixner
@ 2020-11-08 19:36                                       ` David Woodhouse
  2020-11-08 22:47                                         ` Thomas Gleixner
  2020-11-11 15:41                                         ` Christoph Hellwig
  2020-11-08 23:23                                       ` Jason Gunthorpe
  2020-11-08 23:58                                       ` Raj, Ashok
  2 siblings, 2 replies; 123+ messages in thread
From: David Woodhouse @ 2020-11-08 19:36 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe, Dan Williams
  Cc: Raj, Ashok, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey,
	Megha, maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine, linux-kernel,
	linux-pci, kvm

[-- Attachment #1: Type: text/plain, Size: 2241 bytes --]

On Sun, 2020-11-08 at 19:47 +0100, Thomas Gleixner wrote:
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> solved because from the guest OS view that's the same as running on bare
> metal. Obviously on bare metal the Vector domain can and must handle
> this.
> 
> So this needs some thought.

The problem here is that Intel implemented interrupt remapping in a way
which is anathema to structured, ordered IRQ domains.

When a guest writes an MSI message (addr/data) to the MSI table of a
PCI device which has been assigned to that guest, it *doesn't* properly
inherit the MSI composition from a parent irqdomain which knows about
the (host-side) IOMMU.

What actually happens is the hypervisor *traps* the writes to the
device's MSI table, and translates them *then*. In *precisely* the
fashion which we're trying to avoid for IMS.

Now, you can imagine a world where it wasn't like this, where
Remappable Format MSI messages don't exist, and where we let guests
write native MSI message to the device without trapping — and where the
IOMMU then sees the incoming interrupt and has to map the APIC ID to a
*virtual* CPU for that guest, based on the PCI source-id of the device.

In that world, IMS would work naturally. But that isn't how Intel
designed interrupt remapping. They *designed* to have to trap and
translate as the message is written to the device.

So it does look like we're going to need a hypercall interface to
compose an MSI message on behalf of the guest, for IMS to use. In fact
PCI devices assigned to a guest could use that too, and then we'd only
need to trap-and-remap any attempt to write a Compatibility Format MSI
to the device's MSI table, while letting Remappable Format messages get
written directly.

We'd also need a way for an OS running on bare metal to *know* that
it's on bare metal and can just compose MSI messages for itself. Since
we do expect bare metal to have an IOMMU, perhaps that is just a
feature flag on the IOMMU?

That or Intel needs to fix the IOMMU to do proper virtualisation and
actually translate "Compatibility Format" MSIs for a guest too.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-06 13:14                           ` Jason Gunthorpe
  2020-11-06 16:48                             ` Raj, Ashok
@ 2020-11-08 21:18                             ` Thomas Gleixner
  2020-11-08 22:09                               ` David Woodhouse
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-08 21:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Fri, Nov 06 2020 at 09:14, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
> For instance you could put a "disable IMS" flag in the ACPI tables, in
> the config space of the emuulated root port, or any other areas that
> clearly belong to the platform.
>
> The OS logic would be
>  - If no IMS information found then use IMS (Bare metal)
>  - If the IMS disable flag is found then
>    - If (future) hypercall available and the OS knows how to use it
>      then use IMS
>    - If no hypercall found, or no OS knowledge, fail IMS

That does not work because an older hypervisor would not have that
disable flag and the guest kernel would assume to be on bare metal (if
no other indicators are there).

Thanks

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 21:18                             ` Thomas Gleixner
@ 2020-11-08 22:09                               ` David Woodhouse
  2020-11-08 22:52                                 ` Thomas Gleixner
  0 siblings, 1 reply; 123+ messages in thread
From: David Woodhouse @ 2020-11-08 22:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jason Gunthorpe, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, alex.williamson, Pan, Jacob jun, Raj,
	Ashok, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	jing.lin, Williams, Dan J, kwankhede, eric.auger, parav, rafael,
	netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain,
	Mona, dmaengine, linux-kernel, linux-pci, kvm



> On Fri, Nov 06 2020 at 09:14, Jason Gunthorpe wrote:
>> On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
>> For instance you could put a "disable IMS" flag in the ACPI tables, in
>> the config space of the emuulated root port, or any other areas that
>> clearly belong to the platform.
>>
>> The OS logic would be
>>  - If no IMS information found then use IMS (Bare metal)
>>  - If the IMS disable flag is found then
>>    - If (future) hypercall available and the OS knows how to use it
>>      then use IMS
>>    - If no hypercall found, or no OS knowledge, fail IMS
>
> That does not work because an older hypervisor would not have that
> disable flag and the guest kernel would assume to be on bare metal (if
> no other indicators are there).

In the absence of a forward-thinking design from Intel perhaps we could
use the existence of an IOMMU with interrupt remapping and not caching
mode as the indication that it's bare metal?

-- 
dwmw2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 19:36                                       ` David Woodhouse
@ 2020-11-08 22:47                                         ` Thomas Gleixner
  2020-11-08 23:29                                           ` Jason Gunthorpe
  2020-11-11 15:41                                         ` Christoph Hellwig
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-08 22:47 UTC (permalink / raw)
  To: David Woodhouse, Jason Gunthorpe, Dan Williams
  Cc: Raj, Ashok, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey,
	Megha, maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L,
	Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine, linux-kernel,
	linux-pci, kvm

On Sun, Nov 08 2020 at 19:36, David Woodhouse wrote:
> On Sun, 2020-11-08 at 19:47 +0100, Thomas Gleixner wrote:
>> So this needs some thought.
>
> The problem here is that Intel implemented interrupt remapping in a way
> which is anathema to structured, ordered IRQ domains.
>
> When a guest writes an MSI message (addr/data) to the MSI table of a
> PCI device which has been assigned to that guest, it *doesn't* properly
> inherit the MSI composition from a parent irqdomain which knows about
> the (host-side) IOMMU.
>
> What actually happens is the hypervisor *traps* the writes to the
> device's MSI table, and translates them *then*.

That's what I showed in the ascii art :)

> In *precisely* the fashion which we're trying to avoid for IMS.

At least for the IMS variant where the storage is not in trappable
device memory.

> Now, you can imagine a world where it wasn't like this, where
> Remappable Format MSI messages don't exist, and where we let guests
> write native MSI message to the device without trapping — and where the
> IOMMU then sees the incoming interrupt and has to map the APIC ID to a
> *virtual* CPU for that guest, based on the PCI source-id of the
> device.

That would be not convoluted enough and make too much sense.

> In that world, IMS would work naturally. But that isn't how Intel
> designed interrupt remapping. They *designed* to have to trap and
> translate as the message is written to the device.
>
> So it does look like we're going to need a hypercall interface to
> compose an MSI message on behalf of the guest, for IMS to use. In fact
> PCI devices assigned to a guest could use that too, and then we'd only
> need to trap-and-remap any attempt to write a Compatibility Format MSI
> to the device's MSI table, while letting Remappable Format messages get
> written directly.

Yes, if we have the HCALL domain then the message composed by the
hypervisor is valid for everything not only IMS. That's why I left out
any specifics on the Busdomain side. It does not matter which kind of
bus that is. The only mechanics which is provided by the busdomain is
to store the precomposed message and eventually provide mask/unmask at
that level.

> We'd also need a way for an OS running on bare metal to *know* that
> it's on bare metal and can just compose MSI messages for itself. Since
> we do expect bare metal to have an IOMMU, perhaps that is just a
> feature flag on the IOMMU?

There are still CPUs w/o IOMMU out there and new ones are shipped.

So you would basically mandate that IMS with memory storage can only
work on bare metal when the CPU has an IOMMU.

Jason said in [1]: "For x86 I think we could accept linking this to
IOMMU, if really necessary."

OTOH, what's the chance that a guest runs on something which

  1) Does not have X86_FEATURE_HYPERVISOR set in cpuid 1/EDX

and

  2) Cannot be identified as Xen domain

and

  3) Does not have a DMI vendor entry which identifies the
     virtualization solution (we don't use that today, but
     adding that table is trivial enough)

and

  4) Has such an IMS device passed through?

Possible, yes. Likely, no. Do we care?

> That or Intel needs to fix the IOMMU to do proper virtualisation and
> actually translate "Compatibility Format" MSIs for a guest too.

Is that going to happen before I retire?

Thanks,

        tglx

[1] https://lore.kernel.org/r/20200822005125.GB1152540@nvidia.com

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 22:09                               ` David Woodhouse
@ 2020-11-08 22:52                                 ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-08 22:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jason Gunthorpe, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, alex.williamson, Pan, Jacob jun, Raj,
	Ashok, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	jing.lin, Williams, Dan J, kwankhede, eric.auger, parav, rafael,
	netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain,
	Mona, dmaengine, linux-kernel, linux-pci, kvm

On Sun, Nov 08 2020 at 22:09, David Woodhouse wrote:

>> On Fri, Nov 06 2020 at 09:14, Jason Gunthorpe wrote:
>>> On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
>>> For instance you could put a "disable IMS" flag in the ACPI tables, in
>>> the config space of the emuulated root port, or any other areas that
>>> clearly belong to the platform.
>>>
>>> The OS logic would be
>>>  - If no IMS information found then use IMS (Bare metal)
>>>  - If the IMS disable flag is found then
>>>    - If (future) hypercall available and the OS knows how to use it
>>>      then use IMS
>>>    - If no hypercall found, or no OS knowledge, fail IMS
>>
>> That does not work because an older hypervisor would not have that
>> disable flag and the guest kernel would assume to be on bare metal (if
>> no other indicators are there).
>
> In the absence of a forward-thinking design from Intel perhaps we could

Just to be fair the AMD interrupt remapping is not any better in that
regard.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 18:47                                     ` Thomas Gleixner
  2020-11-08 19:36                                       ` David Woodhouse
@ 2020-11-08 23:23                                       ` Jason Gunthorpe
  2020-11-08 23:36                                         ` Raj, Ashok
  2020-11-09  7:37                                         ` Tian, Kevin
  2020-11-08 23:58                                       ` Raj, Ashok
  2 siblings, 2 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-08 23:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dan Williams, Raj, Ashok, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, jing.lin, kwankhede, eric.auger, parav, rafael, netanelg,
	shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona,
	dmaengine, linux-kernel, linux-pci, kvm

On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> 
> That means the guest needs a way to ask the hypervisor for a proper
> translation, i.e. a hypercall. Now where to do that? Looking at the
> above remapping case it's pretty obvious:
> 
> 
>                      |
>                      |
>   [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
>                      |
>   Alloc          "Compose"                   Store         Use
> 
>   Vectordomain   HCALLdomain                Busdomain
>                  |        ^
>                  |        |
>                  v        | 
>             Hypervisor    
>                Alloc + Compose

Yes, this will describes what I have been thinking

> Now the question which I can't answer is whether this can work correctly
> in terms of isolation. If the IMS storage is in guest memory (queue
> storage) then the guest driver can obviously write random crap into it
> which the device will happily send. (For MSI and IDXD style IMS it
> still can trap the store).

There are four cases of interest here:

 1) Bare metal, PF and VF devices just deliver whatever addr/data pairs
    to the APIC. IMS works perfectly with pci_subdevice_msi_create_irq_domain()

 2) SRIOV VF assigned to the guest.

    The guest can cause any MemWr TLP to any addr/data pair
    and the iommu/platform/vmm is supposed to use the
    Bus/device/function to isolate & secure the interrupt address
    range.

    IMS can work in the guest if the guest knows the details of the
    address range and can make hypercalls to setup routing. So
    pci_subdevice_msi_create_irq_domain() works if the hypercalls
    exist and fails if they don't.

 3) SIOV sub device assigned to the guest.

    The difference between SIOV and SRIOV is the device must attach a
    PASID to every TLP triggered by the guest. Logically we'd expect
    when IMS is used in this situation the interrupt MemWr is tagged
    with bus/device/function/PASID to uniquly ID the guest and the same
    security protection scheme from #2 applies.

 4) SIOV sub device assigned to the guest, but with emulation.

    This SIOV device cannot tag interrupts with PASID so cannot do #2
    (or the platform cannot recieve a PASID tagged interrupt message).

    Since the interrupts are being delivered with TLPs pointing at the
    hypervisor the only solution is for the hypervisor to exclusively
    control the interrupt table. MSI table like emulation for IMS is
    needed and the hypervisor will use pci_subdevice_msi_create_irq_domain()
    to get the real interrupts.

    pci_subdevice_msi_create_irq_domain() needs to return the 'fake'
    addr/data pairs which are actually an ABI between the guest and
    hypervisor carried in the hidden hypercall of the emulation.
    (ie it works like MSI works today)

IDXD is worring about case #4, I think, but I didn't follow in that
whole discussion about the IMS table layout if they PASID tag the IMS
MemWr or not?? Ashok can you clarify?

> Is the IOMMU/Interrupt remapping unit able to catch such messages which
> go outside the space to which the guest is allowed to signal to? If yes,
> problem solved. If no, then IMS storage in guest memory can't ever work.

Right. Only PASID on the interrupt messages can resolve this securely.

> So in case that the HCALL domain is missing, the Vector domain needs
> return an error code on domain creation. If the HCALL domain is there
> then the domain creation works and in case of actual interrupt
> allocation the hypercall either returns a valid composed message or an
> appropriate error code.

Yes
 
> But there's a catch:
> 
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> solved because from the guest OS view that's the same as running on bare
> metal. Obviously on bare metal the Vector domain can and must handle
> this.

Yes

The flip side is today, the way pci_subdevice_msi_create_irq_domain()
works a VF using it on baremetal will succeed and if that same VF is
assigned to a guest then pci_subdevice_msi_create_irq_domain()
succeeds but the interrupt never comes - so the driver is broken.

Yes, if we add some ACPI/etc flag that is not going to magically fix
old kvm's, but at least *something* exists that works right and
generically.

If we follow Intel's path then we need special KVM level support for
*every* device, PCI cap mangling and so on. Forever. Sounds horrible
to me..

This feels like one of these things where no matter what we do
something is broken. Picking the least breakage is the challenge here.

> So this needs some thought.

I think your HAOLL diagram is the only sane architecture.

If go that way then case #4 will still work, in this case the HCALL
will return addr/data pairs that conform to what the emulation
expects. Or fail if the VMM can't do emulation for the device.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 18:34                                       ` David Woodhouse
@ 2020-11-08 23:25                                         ` Raj, Ashok
  2020-11-10 14:19                                           ` Raj, Ashok
  0 siblings, 1 reply; 123+ messages in thread
From: Raj, Ashok @ 2020-11-08 23:25 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jason Gunthorpe, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, tglx,
	alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, jing.lin, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm,
	Ashok Raj

On Sun, Nov 08, 2020 at 06:34:55PM +0000, David Woodhouse wrote:
> > 
> > When we do interrupt remapping support in guest which would be required 
> > if we support x2apic in guest, I think this is something we should look into more 
> > carefully to make this work.
> 
> No, interrupt remapping is not required for X2APIC in guests
> 
> They can have X2APIC and up to 32768 CPUs without needing interrupt

How is this made available today without interrupt remapping? 

I thought without IR, the destination ID is still limited to only 8 bits?

On native, even if you have less than 255 cpu's but the APICID are sparsly 
distributed due to platform rules, the x2apic id could be more than 8 bits. 
Which is why the spec requires IR when x2apic is enabled.

> remapping at all. Only if they want more than 32768 vCPUs, or to do
> nested virtualisation and actually remap for the benefit of *their*
> (L2+) guests would they need IR.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 22:47                                         ` Thomas Gleixner
@ 2020-11-08 23:29                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-08 23:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Woodhouse, Dan Williams, Raj, Ashok, Tian, Kevin, Jiang,
	Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, jing.lin, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm

On Sun, Nov 08, 2020 at 11:47:13PM +0100, Thomas Gleixner wrote:

> OTOH, what's the chance that a guest runs on something which
> 
>   1) Does not have X86_FEATURE_HYPERVISOR set in cpuid 1/EDX
> 
> and
> 
>   2) Cannot be identified as Xen domain
> 
> and
> 
>   3) Does not have a DMI vendor entry which identifies the
>      virtualization solution (we don't use that today, but
>      adding that table is trivial enough)
> 
> and
> 
>   4) Has such an IMS device passed through?
> 
> Possible, yes. Likely, no. Do we care?

This is exactly my thinking too. IMS is still very new, if we add some
platform flag to disable it then yes there are broken cases but enough
options for an unlucky user to deal with it:

 - Have their VMM set X86_FEATURE_HYPERVISOR
 - Updating the VMM to set the global disable flag
 - Add some "disable_subdevice_msi" kernel comand line flag in the guest

In exchange we get a much cleaner architecture for the next 10 years..

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 23:23                                       ` Jason Gunthorpe
@ 2020-11-08 23:36                                         ` Raj, Ashok
  2020-11-09  7:37                                         ` Tian, Kevin
  1 sibling, 0 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-08 23:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, jing.lin, kwankhede, eric.auger, parav, rafael, netanelg,
	shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona,
	dmaengine, linux-kernel, linux-pci, kvm, Ashok Raj

Hi Jason,

On Sun, Nov 08, 2020 at 07:23:41PM -0400, Jason Gunthorpe wrote:
> 
> IDXD is worring about case #4, I think, but I didn't follow in that
> whole discussion about the IMS table layout if they PASID tag the IMS
> MemWr or not?? Ashok can you clarify?
> 

The PASID in the interrupt store is for the IDXD to verify the interrupt handle
that came with the ENQCMD. User applications can obtain an interrupt handle and
ask for interrupt to be generated for transactions submitted via ENQCMD.

IDXD will compare the PASID that came with ENQCMD and verify if the PASID matches
the one stored in the Interrupt Table before generating the MemWr.

So MemWr for interrupts remains unchanged for IDXD on the wire. PASID is present in interrupt
store because the value was programmed by user space, and needs OS/hardware to ensure 
the entity asking for interrupts has ownership for the interrupt handle.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 18:11                                     ` Raj, Ashok
  2020-11-08 18:34                                       ` David Woodhouse
@ 2020-11-08 23:41                                       ` Jason Gunthorpe
  2020-11-09  0:05                                         ` Raj, Ashok
  1 sibling, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-08 23:41 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Dan Williams, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, tglx, alex.williamson, Pan, Jacob jun,
	Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Sun, Nov 08, 2020 at 10:11:24AM -0800, Raj, Ashok wrote:

> > On (kvm) virtualization the addr/data pair the IRQ domain hands out
> > doesn't work. It is some fake thing.
> 
> Is it really some fake thing? I thought the vCPU and vector are real
> for a guest, and VMM ensures when interrupts are delivered they are either.

It is fake in the sense it is programmed into no hardware.
 
It is real in the sense it is an ABI contract with the VMM.

> > On something like IDXD this emulation is not so hard, on something
> > like mlx5 this is completely unworkable. Further we never do
> > emulation on our devices, they always pass native hardware through,
> > even for SIOV-like cases.
> 
> So is that true for interrupts too? 

There is no *mlx5* emulation. We ride on the generic MSI emulation KVM
is going.

> Possibly you have the interrupt entries sitting in memory resident
> on the device?

For SRIOV, yes. The appeal of IMS is to move away from that.

> Don't we need the VMM to ensure they are brokered by VMM in either
> one of the two ways above?

Yes, no matter what the VMM has to know the guest wants an interrupt
routed in and setup the VMM part of the equation. With SRIOV this is
all done with the MSI trapping.

> What if the guest creates some addr in the 0xfee... range how do we
> take care of interrupt remapping and such without any VMM assist?

Not sure I understand this?

> That's true. Probably this can work the same even for MSIx types too then?

Yes, once you have the ability to hypercall to create the addr/data
pair then it can work with MSI and the VMM can stop emulation. It
would be a nice bit of uniformity to close this, but switching the VMM
from legacy to new mode is going to be tricky, I fear.

> I agree with the overall idea and we should certainly take that into
> consideration when we need IMS in guest support and in context of
> interrupt remapping.

The issue with things, as they sit now, is SRIOV.

If any driver starts using pci_subdevice_msi_create_irq_domain() then
it fails if the VF is assigned to a guest with SRVIO. This is a real
and important, use case for many devices today!

The "solution" can't be to go back and retroactively change every
shipping device to add PCI capability blocks, and ensure that every
existing VMM strips them out before assigning the device (including
Hyper-V!!)  :(

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 18:47                                     ` Thomas Gleixner
  2020-11-08 19:36                                       ` David Woodhouse
  2020-11-08 23:23                                       ` Jason Gunthorpe
@ 2020-11-08 23:58                                       ` Raj, Ashok
  2020-11-09  7:59                                         ` Tian, Kevin
  2020-11-09 11:21                                         ` Thomas Gleixner
  2 siblings, 2 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-08 23:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jason Gunthorpe, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

Hi Thomas,

[-] Jing, She isn't working at Intel anymore.

Now this is getting compiled as a book :-).. Thanks a ton!

One question on the hypercall case that isn't immediately
clear to me.

On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> 
> 
> Now if we look at the virtualization scenario and device hand through
> then the structure in the guest view is not any different from the basic
> case. This works with PCI-MSI[X] and the IDXD IMS variant because the
> hypervisor can trap the access to the storage and translate the message:
> 
>                    |
>                    |
>   [CPU]    -- [Bri | dge] -- Bus -- [Device]
>                    |
>   Alloc +
>   Compose                   Store     Use
>                              |
>                              | Trap
>                              v
>                              Hypervisor translates and stores
> 

The above case, VMM is responsible for writing to the message
store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
the writes to the device interrupt region and to the IRTE tables.

> But obviously with an IMS storage location which is software controlled
> by the guest side driver (the case Jason is interested in) the above
> cannot work for obvious reasons.
> 
> That means the guest needs a way to ask the hypervisor for a proper
> translation, i.e. a hypercall. Now where to do that? Looking at the
> above remapping case it's pretty obvious:
> 
> 
>                      |
>                      |
>   [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
>                      |
>   Alloc          "Compose"                   Store         Use
> 
>   Vectordomain   HCALLdomain                Busdomain
>                  |        ^
>                  |        |
>                  v        | 
>             Hypervisor    
>                Alloc + Compose
> 
> Why? Because it reflects the boundaries and leaves the busdomain part
> agnostic as it should be. And it works for _all_ variants of Busdomains.
> 
> Now the question which I can't answer is whether this can work correctly
> in terms of isolation. If the IMS storage is in guest memory (queue
> storage) then the guest driver can obviously write random crap into it
> which the device will happily send. (For MSI and IDXD style IMS it
> still can trap the store).

The isolation problem is not just the guest memory being used as interrrupt
store right? If the Store to device region is not trapped and controlled by 
VMM, there is no gaurantee the guest OS has done the right thing?


Thinking about it, guest memory might be more problematic since its not
trappable and VMM can't enforce what is written. This is something that
needs more attension. But for now the devices supporting memory on device
the trap and store by VMM seems to satisfy the security properties you
highlight here.

> 
> Is the IOMMU/Interrupt remapping unit able to catch such messages which
> go outside the space to which the guest is allowed to signal to? If yes,
> problem solved. If no, then IMS storage in guest memory can't ever work.

This can probably work for SRIOV devices where guest owns the entire device.
interrupt remap does have RID checks if interrupt arrives at an Interrupt handle
not allocated for that BDF.

But for SIOV devices there is no PASID filtering at the remap level since
interrupt messages don't carry PASID in the TLP.

> 
> Coming back to this:
> 
> > In the end pci_subdevice_msi_create_irq_domain() is a platform
> > function. Either it should work completely on every device with no
> > device-specific emulation required in the VMM, or it should not work
> > at all and return -EOPNOTSUPP.
> 
> The subdevice domain is a 'Busdomain' according to the structure
> above. It does not and should never have any clue about the underlying
> system. It's in the agnostic part and always works. It simply does not
> care what's underneath. So it won't return -EOPNOTSUPP.
> 
> What it has to do is to transport the IMS in queue memory requirement to
> the underlying parent domain.
> 
> So in case that the HCALL domain is missing, the Vector domain needs
> return an error code on domain creation. If the HCALL domain is there
> then the domain creation works and in case of actual interrupt
> allocation the hypercall either returns a valid composed message or an
> appropriate error code.
> 
> But there's a catch:
> 
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be

Precicely!. It might work if the OS is new, but for legacy the trap-emulate
seems both safe and works for legacy as well?


Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 23:41                                       ` Jason Gunthorpe
@ 2020-11-09  0:05                                         ` Raj, Ashok
  0 siblings, 0 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-09  0:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, tglx, alex.williamson, Pan, Jacob jun,
	Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine, linux-kernel,
	linux-pci, kvm, Ashok Raj

Hi Jason

On Sun, Nov 08, 2020 at 07:41:42PM -0400, Jason Gunthorpe wrote:
> On Sun, Nov 08, 2020 at 10:11:24AM -0800, Raj, Ashok wrote:
> 
> > > On (kvm) virtualization the addr/data pair the IRQ domain hands out
> > > doesn't work. It is some fake thing.
> > 
> > Is it really some fake thing? I thought the vCPU and vector are real
> > for a guest, and VMM ensures when interrupts are delivered they are either.
> 
> It is fake in the sense it is programmed into no hardware.
>  
> It is real in the sense it is an ABI contract with the VMM.

Ah.. its clear now. That clears up my question below as well.

> 
> Yes, no matter what the VMM has to know the guest wants an interrupt
> routed in and setup the VMM part of the equation. With SRIOV this is
> all done with the MSI trapping.
> 
> > What if the guest creates some addr in the 0xfee... range how do we
> > take care of interrupt remapping and such without any VMM assist?
> 
> Not sure I understand this?
> 

My question was based on mis-conception that interrupt entries are directly
written by guest OS for mlx*. My concern was about security isolation if guest OS
has full control of device interrupt store. 

I think you clarified it, that interrupts still are marshalled by the VMM
and not in direct control of guest OS. That makes my question moot.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-07  0:32                           ` Thomas Gleixner
@ 2020-11-09  5:25                             ` Tian, Kevin
  0 siblings, 0 replies; 123+ messages in thread
From: Tian, Kevin @ 2020-11-09  5:25 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Raj, Ashok, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin, Williams, Dan J,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm



> -----Original Message-----
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Saturday, November 7, 2020 8:32 AM
> To: Tian, Kevin <kevin.tian@intel.com>; Jason Gunthorpe <jgg@nvidia.com>
> Cc: Jiang, Dave <dave.jiang@intel.com>; Bjorn Helgaas <helgaas@kernel.org>;
> vkoul@kernel.org; Dey, Megha <megha.dey@intel.com>; maz@kernel.org;
> bhelgaas@google.com; alex.williamson@redhat.com; Pan, Jacob jun
> <jacob.jun.pan@intel.com>; Raj, Ashok <ashok.raj@intel.com>; Liu, Yi L
> <yi.l.liu@intel.com>; Lu, Baolu <baolu.lu@intel.com>; Kumar, Sanjay K
> <sanjay.k.kumar@intel.com>; Luck, Tony <tony.luck@intel.com>;
> jing.lin@intel.com; Williams, Dan J <dan.j.williams@intel.com>;
> kwankhede@nvidia.com; eric.auger@redhat.com; parav@mellanox.com;
> rafael@kernel.org; netanelg@mellanox.com; shahafs@mellanox.com;
> yan.y.zhao@linux.intel.com; pbonzini@redhat.com; Ortiz, Samuel
> <samuel.ortiz@intel.com>; Hossain, Mona <mona.hossain@intel.com>;
> dmaengine@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> pci@vger.kernel.org; kvm@vger.kernel.org
> Subject: RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
> 
> On Fri, Nov 06 2020 at 09:48, Kevin Tian wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> On Wed, Nov 04, 2020 at 01:34:08PM +0000, Tian, Kevin wrote:
> >> The interrupt controller is responsible to create an addr/data pair
> >> for an interrupt message. It sets the message format and ensures it
> >> routes to the proper CPU interrupt handler. Everything about the
> >> addr/data pair is owned by the platform interrupt controller.
> >>
> >> Devices do not create interrupts. They only trigger the addr/data pair
> >> the platform gives them.
> >
> > I guess that we may just view it from different angles. On x86 platform,
> > a MSI/IMS capable device directly composes interrupt messages, with
> > addr/data pair filled by OS. If there is no IOMMU remapping enabled in
> > the middle, the message just hits the CPU. Your description possibly
> > is from software side, e.g. describing the hierarchical IRQ domain
> > concept?
> 
> No. The device composes nothing. If the interrupt is raised in the
> device then the MSI block sends the message which was composed by the OS
> and stored in the device's message store. For PCI/MSI that's the MSI or
> MSIX table and for IMS that's either on device memory (as IDXD uses) or
> some completely different location which Jason described.

Sorry being inaccurate here. I actually meant the same thing as
you described since I did mention addr/data pair filled by OS. 
Unfortunately I mistakenly thought that 'compose' has similar
meaning to 'send' in English but clearly it's not and instead it's
just about the message content. and for sure I also agree with your
other clarifications regarding to architecture independent  manner.

Thanks
Kevin

> 
> This has absolutely nothing to do with the X86 platform. MSI is a
> architecture independent mechanism: Send whatever the OS put into the
> storage to raise an interrupt in the CPU. The device does neither know
> whether that message is going to be intercepted by an interrupt
> remapping unit or not.
> 
> Stop claiming that any of this has anything to do with x86. It has
> absolutely nothing to do with x86 and looking at MSI from an x86
> perspective instead of looking at it from the architecture agnostic
> technical reality of MSI is the reason why we have this discussion at
> all.
> 
> We had a similar discussion vs. the way how IMS interrupts have to be
> dealt with in terms of irq domains. Can you finally stop looking at
> everything as a big x86/intel/platform lump and understand that things
> are very well structured and seperated both at the hardware and at the
> software level?
> 
> > Do you mind providing the link? There were lots of discussions between
> > you and Thomas. I failed to locate the exact mail when searching above
> > keywords.
> 
> In this thread: 20200821002424.119492231@linutronix.de and you were on
> Cc
> 
> Thanks,
> 
>         tglx
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 23:23                                       ` Jason Gunthorpe
  2020-11-08 23:36                                         ` Raj, Ashok
@ 2020-11-09  7:37                                         ` Tian, Kevin
  2020-11-09 16:46                                           ` Jason Gunthorpe
  1 sibling, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2020-11-09  7:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Gleixner
  Cc: Williams, Dan J, Raj, Ashok, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine, linux-kernel,
	linux-pci, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, November 9, 2020 7:24 AM
> 
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> >                      |
> >                      |
> >   [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
> >                      |
> >   Alloc          "Compose"                   Store         Use
> >
> >   Vectordomain   HCALLdomain                Busdomain
> >                  |        ^
> >                  |        |
> >                  v        |
> >             Hypervisor
> >                Alloc + Compose
> 
> Yes, this will describes what I have been thinking

Agree

> 
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
> 
> There are four cases of interest here:
> 
>  1) Bare metal, PF and VF devices just deliver whatever addr/data pairs
>     to the APIC. IMS works perfectly with
> pci_subdevice_msi_create_irq_domain()
> 
>  2) SRIOV VF assigned to the guest.
> 
>     The guest can cause any MemWr TLP to any addr/data pair
>     and the iommu/platform/vmm is supposed to use the
>     Bus/device/function to isolate & secure the interrupt address
>     range.
> 
>     IMS can work in the guest if the guest knows the details of the
>     address range and can make hypercalls to setup routing. So
>     pci_subdevice_msi_create_irq_domain() works if the hypercalls
>     exist and fails if they don't.
> 
>  3) SIOV sub device assigned to the guest.
> 
>     The difference between SIOV and SRIOV is the device must attach a
>     PASID to every TLP triggered by the guest. Logically we'd expect
>     when IMS is used in this situation the interrupt MemWr is tagged
>     with bus/device/function/PASID to uniquly ID the guest and the same
>     security protection scheme from #2 applies.

Unfortunately no. Intel VT-d only treats MemWr w/o PASID to 0xFEExxxxx
as interrupt request. MemWr w/ PASID, even to 0xFEE, is translated
normally through DMA remapping page table. I don't know other IOMMU
vendors. But at least on Intel platform such device would not get the 
desired effect, since the IOMMU only guarantees interrupt isolation in 
BDF-level.

Does your device already implement such capability? We can bring this 
request back to the hardware team. 

> 
>  4) SIOV sub device assigned to the guest, but with emulation.
> 
>     This SIOV device cannot tag interrupts with PASID so cannot do #2
>     (or the platform cannot recieve a PASID tagged interrupt message).
> 
>     Since the interrupts are being delivered with TLPs pointing at the
>     hypervisor the only solution is for the hypervisor to exclusively
>     control the interrupt table. MSI table like emulation for IMS is
>     needed and the hypervisor will use
> pci_subdevice_msi_create_irq_domain()
>     to get the real interrupts.
> 
>     pci_subdevice_msi_create_irq_domain() needs to return the 'fake'
>     addr/data pairs which are actually an ABI between the guest and
>     hypervisor carried in the hidden hypercall of the emulation.
>     (ie it works like MSI works today)
> 
> IDXD is worring about case #4, I think, but I didn't follow in that
> whole discussion about the IMS table layout if they PASID tag the IMS
> MemWr or not?? Ashok can you clarify?
> 
> > Is the IOMMU/Interrupt remapping unit able to catch such messages which
> > go outside the space to which the guest is allowed to signal to? If yes,
> > problem solved. If no, then IMS storage in guest memory can't ever work.
> 
> Right. Only PASID on the interrupt messages can resolve this securely.
> 
> > So in case that the HCALL domain is missing, the Vector domain needs
> > return an error code on domain creation. If the HCALL domain is there
> > then the domain creation works and in case of actual interrupt
> > allocation the hypercall either returns a valid composed message or an
> > appropriate error code.
> 
> Yes
> 
> > But there's a catch:
> >
> > This only works when the guest OS actually knows that it runs in a
> > VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> > solved because from the guest OS view that's the same as running on bare
> > metal. Obviously on bare metal the Vector domain can and must handle
> > this.
> 
> Yes
> 
> The flip side is today, the way pci_subdevice_msi_create_irq_domain()
> works a VF using it on baremetal will succeed and if that same VF is
> assigned to a guest then pci_subdevice_msi_create_irq_domain()
> succeeds but the interrupt never comes - so the driver is broken.

Yes, this is the main worry here. While all agree that using hypercall is 
the proper way to virtualize IMS, how to disable it when hypercall is
not available is a more urgent demand at current stage.

> 
> Yes, if we add some ACPI/etc flag that is not going to magically fix
> old kvm's, but at least *something* exists that works right and
> generically.

Agree. We can work together on this definition. 

btw in reality such ACPI extension doesn't exist yet, which likely will
take some time. In the meantime we already have pending usages 
like IDXD. Do you suggest holding these patches until we get ASWG 
to accept the extension, or accept using Intel IMS cap as a vendor
specific mitigation to move forward while the platform flag is being 
worked on? Anyway the IMS cap is already defined and can help fix 
some broken cases.

> 
> If we follow Intel's path then we need special KVM level support for
> *every* device, PCI cap mangling and so on. Forever. Sounds horrible
> to me..
> 
> This feels like one of these things where no matter what we do
> something is broken. Picking the least breakage is the challenge here.
> 
> > So this needs some thought.
> 
> I think your HAOLL diagram is the only sane architecture.
> 
> If go that way then case #4 will still work, in this case the HCALL
> will return addr/data pairs that conform to what the emulation
> expects. Or fail if the VMM can't do emulation for the device.
> 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 23:58                                       ` Raj, Ashok
@ 2020-11-09  7:59                                         ` Tian, Kevin
  2020-11-09 11:21                                         ` Thomas Gleixner
  1 sibling, 0 replies; 123+ messages in thread
From: Tian, Kevin @ 2020-11-09  7:59 UTC (permalink / raw)
  To: Raj, Ashok, Thomas Gleixner
  Cc: Jason Gunthorpe, Williams, Dan J, Jiang, Dave, Bjorn Helgaas,
	vkoul, Dey, Megha, maz, bhelgaas, alex.williamson, Pan,
	Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Raj, Ashok

> From: Raj, Ashok <ashok.raj@intel.com>
> Sent: Monday, November 9, 2020 7:59 AM
> 
> Hi Thomas,
> 
> [-] Jing, She isn't working at Intel anymore.
> 
> Now this is getting compiled as a book :-).. Thanks a ton!
> 
> One question on the hypercall case that isn't immediately
> clear to me.
> 
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> >
> > Now if we look at the virtualization scenario and device hand through
> > then the structure in the guest view is not any different from the basic
> > case. This works with PCI-MSI[X] and the IDXD IMS variant because the
> > hypervisor can trap the access to the storage and translate the message:
> >
> >                    |
> >                    |
> >   [CPU]    -- [Bri | dge] -- Bus -- [Device]
> >                    |
> >   Alloc +
> >   Compose                   Store     Use
> >                              |
> >                              | Trap
> >                              v
> >                              Hypervisor translates and stores
> >
> 
> The above case, VMM is responsible for writing to the message
> store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
> the writes to the device interrupt region and to the IRTE tables.
> 
> > But obviously with an IMS storage location which is software controlled
> > by the guest side driver (the case Jason is interested in) the above
> > cannot work for obvious reasons.
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> >                      |
> >                      |
> >   [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
> >                      |
> >   Alloc          "Compose"                   Store         Use
> >
> >   Vectordomain   HCALLdomain                Busdomain
> >                  |        ^
> >                  |        |
> >                  v        |
> >             Hypervisor
> >                Alloc + Compose
> >
> > Why? Because it reflects the boundaries and leaves the busdomain part
> > agnostic as it should be. And it works for _all_ variants of Busdomains.
> >
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
> 
> The isolation problem is not just the guest memory being used as interrrupt
> store right? If the Store to device region is not trapped and controlled by
> VMM, there is no gaurantee the guest OS has done the right thing?
> 
> 
> Thinking about it, guest memory might be more problematic since its not
> trappable and VMM can't enforce what is written. This is something that
> needs more attension. But for now the devices supporting memory on device
> the trap and store by VMM seems to satisfy the security properties you
> highlight here.
> 

Just want to clarify the trap part.

Guest memory is not trappable in Jason's example, which has queue/IMS
storage swapped between device/memory and requires special command 
to sync the state.

But there is also other forms of in-memory IMS implementation. e.g. Some
devices serve work requests based on command buffers instead of HW work
queues. The command buffers are linked in per-process contexts (both in 
memory) thus similarly IMS could be stored in each context too. There is no
swap per se. The context is allocated by the driver and then registered to 
the device through a mgmt. interface. When the mgmt. interface is mediated, 
the hypervisor knows the IMS location and could mark it as read-only in 
EPT page table to enable trapping of guest writes. Of course this approach
is awkward if the complexity is paid just for virtualizing IMS.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 23:58                                       ` Raj, Ashok
  2020-11-09  7:59                                         ` Tian, Kevin
@ 2020-11-09 11:21                                         ` Thomas Gleixner
  2020-11-09 17:30                                           ` Jason Gunthorpe
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-09 11:21 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Jason Gunthorpe, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

On Sun, Nov 08 2020 at 15:58, Ashok Raj wrote:
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
>> 
>> 
>> Now if we look at the virtualization scenario and device hand through
>> then the structure in the guest view is not any different from the basic
>> case. This works with PCI-MSI[X] and the IDXD IMS variant because the
>> hypervisor can trap the access to the storage and translate the message:
>> 
>>                    |
>>                    |
>>   [CPU]    -- [Bri | dge] -- Bus -- [Device]
>>                    |
>>   Alloc +
>>   Compose                   Store     Use
>>                              |
>>                              | Trap
>>                              v
>>                              Hypervisor translates and stores
>> 
>
> The above case, VMM is responsible for writing to the message
> store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
> the writes to the device interrupt region and to the IRTE tables.

Yes, but that's just how it's done today and there is no real need to do
so.

>> Now the question which I can't answer is whether this can work correctly
>> in terms of isolation. If the IMS storage is in guest memory (queue
>> storage) then the guest driver can obviously write random crap into it
>> which the device will happily send. (For MSI and IDXD style IMS it
>> still can trap the store).
>
> The isolation problem is not just the guest memory being used as interrrupt
> store right? If the Store to device region is not trapped and controlled by 
> VMM, there is no gaurantee the guest OS has done the right thing?
>
> Thinking about it, guest memory might be more problematic since its not
> trappable and VMM can't enforce what is written. This is something that
> needs more attension. But for now the devices supporting memory on device
> the trap and store by VMM seems to satisfy the security properties you
> highlight here.

That's not the problem at all. The VMM is not responsible for the
correctness of the guest OS at all. All the VMM cares about is that the
guest cannot access anything which does not belong to the guest.

If the guest OS screws up the message (by stupidity or malice), then the
MSI sent from the passed through device has to be caught by the
IOMMU/remap unit if an _only_ if it writes to something which it is not
allowed to.

If it overwrites the guests memory then so be it. The VMM cannot prevent
the guest OS doing so by a stray pointer either. So why would it worry
about the MSI going into guest owned lala land?

>> Is the IOMMU/Interrupt remapping unit able to catch such messages which
>> go outside the space to which the guest is allowed to signal to? If yes,
>> problem solved. If no, then IMS storage in guest memory can't ever work.
>
> This can probably work for SRIOV devices where guest owns the entire device.
> interrupt remap does have RID checks if interrupt arrives at an Interrupt handle
> not allocated for that BDF.
>
> But for SIOV devices there is no PASID filtering at the remap level since
> interrupt messages don't carry PASID in the TLP.

PASID is irrelevant here.

If the device sends a message then the remap unit will see the requester
ID of the device and if the message it sends is not matching the remap
tables then it's caught and the guest is terminated. At least that's how
it should be.

>> But there's a catch:
>> 
>> This only works when the guest OS actually knows that it runs in a
>> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
>
> Precicely!. It might work if the OS is new, but for legacy the trap-emulate
> seems both safe and works for legacy as well?

Again, trap emulate does not work for IMS when the IMS store is software
managed guest memory and not part of the device. And that's the whole
reason why we are discussing this.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-09  7:37                                         ` Tian, Kevin
@ 2020-11-09 16:46                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-09 16:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Thomas Gleixner, Williams, Dan J, Raj, Ashok, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Mon, Nov 09, 2020 at 07:37:03AM +0000, Tian, Kevin wrote:
> >  3) SIOV sub device assigned to the guest.
> > 
> >     The difference between SIOV and SRIOV is the device must attach a
> >     PASID to every TLP triggered by the guest. Logically we'd expect
> >     when IMS is used in this situation the interrupt MemWr is tagged
> >     with bus/device/function/PASID to uniquly ID the guest and the same
> >     security protection scheme from #2 applies.
> 
> Unfortunately no. Intel VT-d only treats MemWr w/o PASID to 0xFEExxxxx
> as interrupt request. MemWr w/ PASID, even to 0xFEE, is translated
> normally through DMA remapping page table. 

I've heard that current IOMMUs are limited as well, but IMHO, as I
describe, if you want full symmetry then you want to route interrupts
via PASID for SIOV. Otherwise the architecture is incomplete.

At least from a Linux and VMM perspective this should be planned
for. It is the only generic way to have a sub device assigned to a
guest and still have access to IMS.

> Does your device already implement such capability? We can bring this 
> request back to the hardware team. 

In some cases we can generate PASID tagged TLPs for interrupt
messages, if there was a reason to do that.

> Yes, this is the main worry here. While all agree that using hypercall is 
> the proper way to virtualize IMS, how to disable it when hypercall is
> not available is a more urgent demand at current stage.

Hopefully Thomas's note about checking for virtualization will help..

> btw in reality such ACPI extension doesn't exist yet, which likely will
> take some time. In the meantime we already have pending usages 
> like IDXD. Do you suggest holding these patches until we get ASWG 
> to accept the extension, or accept using Intel IMS cap as a vendor
> specific mitigation to move forward while the platform flag is being 
> worked on? Anyway the IMS cap is already defined and can help fix 
> some broken cases.

I think you need to sort something generic out, these half baked
architectures just make it some other teams problem.

Thomas's suggestion to check cpuid seems reasonably workable

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-09 11:21                                         ` Thomas Gleixner
@ 2020-11-09 17:30                                           ` Jason Gunthorpe
  2020-11-09 22:40                                             ` Raj, Ashok
  2020-11-09 22:42                                             ` Thomas Gleixner
  0 siblings, 2 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-09 17:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Raj, Ashok, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Mon, Nov 09, 2020 at 12:21:22PM +0100, Thomas Gleixner wrote:

> >> Is the IOMMU/Interrupt remapping unit able to catch such messages which
> >> go outside the space to which the guest is allowed to signal to? If yes,
> >> problem solved. If no, then IMS storage in guest memory can't ever work.
> >
> > This can probably work for SRIOV devices where guest owns the entire device.
> > interrupt remap does have RID checks if interrupt arrives at an Interrupt handle
> > not allocated for that BDF.
> >
> > But for SIOV devices there is no PASID filtering at the remap level since
> > interrupt messages don't carry PASID in the TLP.
> 
> PASID is irrelevant here.
> 
> If the device sends a message then the remap unit will see the requester
> ID of the device and if the message it sends is not matching the remap
> tables then it's caught and the guest is terminated. At least that's how
> it should be.

The SIOV case is to take a single RID and split it to multiple
VMs and also to the hypervisor. All these things concurrently use the
same RID, and the IOMMU can't tell them apart.

The hypervisor security domain owns TLPs with no PASID. Each PASID is
assigned to a VM.

For interrupts, today, they are all generated, with no PASID, to the
same RID. There is no way for remapping to protect against a guest
without checking also PASID.

The relavance of PASID is this:

> Again, trap emulate does not work for IMS when the IMS store is software
> managed guest memory and not part of the device. And that's the whole
> reason why we are discussing this.

With PASID tagged interrupts and a IOMMU interrupt remapping
capability that can trigger on PASID, then the platform can provide
the same level of security as SRIOV - the above is no problem.

The device ensures that all DMAs and all interrupts program by the
guest are PASID tagged and the platform provides security by checking
the PASID when delivering the interrupt. Intel IOMMU doesn't work this
way today, but it makes alot of design sense.

Otherwise the interrupt is effectively delivered to the hypervisor. A
secure device can *never* allow a guest to specify an addr/data pair
for a non-PASID tagged TLP, so the device cannot offer IMS to the
guest.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-09 17:30                                           ` Jason Gunthorpe
@ 2020-11-09 22:40                                             ` Raj, Ashok
  2020-11-09 22:42                                             ` Thomas Gleixner
  1 sibling, 0 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-09 22:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

On Mon, Nov 09, 2020 at 01:30:34PM -0400, Jason Gunthorpe wrote:
> 
> > Again, trap emulate does not work for IMS when the IMS store is software
> > managed guest memory and not part of the device. And that's the whole
> > reason why we are discussing this.
> 
> With PASID tagged interrupts and a IOMMU interrupt remapping
> capability that can trigger on PASID, then the platform can provide
> the same level of security as SRIOV - the above is no problem.

You mean even if its stored in memory, as long as the MemWr comes with
PASID, and the hypercall has provisioned the IRTE properly?

that seems like a possiblity.

> 
> The device ensures that all DMAs and all interrupts program by the
> guest are PASID tagged and the platform provides security by checking
> the PASID when delivering the interrupt. Intel IOMMU doesn't work this
> way today, but it makes alot of design sense.
> 
> Otherwise the interrupt is effectively delivered to the hypervisor. A
> secure device can *never* allow a guest to specify an addr/data pair
> for a non-PASID tagged TLP, so the device cannot offer IMS to the
> guest.

Right, it seems like that's a limitation today. 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-09 17:30                                           ` Jason Gunthorpe
  2020-11-09 22:40                                             ` Raj, Ashok
@ 2020-11-09 22:42                                             ` Thomas Gleixner
  2020-11-10  5:14                                               ` Raj, Ashok
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-09 22:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> On Mon, Nov 09, 2020 at 12:21:22PM +0100, Thomas Gleixner wrote:
>> >> Is the IOMMU/Interrupt remapping unit able to catch such messages which
>> >> go outside the space to which the guest is allowed to signal to? If yes,
>> >> problem solved. If no, then IMS storage in guest memory can't ever
>> >> work.

> The SIOV case is to take a single RID and split it to multiple
> VMs and also to the hypervisor. All these things concurrently use the
> same RID, and the IOMMU can't tell them apart.
>
> The hypervisor security domain owns TLPs with no PASID. Each PASID is
> assigned to a VM.
>
> For interrupts, today, they are all generated, with no PASID, to the
> same RID. There is no way for remapping to protect against a guest
> without checking also PASID.
>
> The relavance of PASID is this:
>
>> Again, trap emulate does not work for IMS when the IMS store is software
>> managed guest memory and not part of the device. And that's the whole
>> reason why we are discussing this.
>
> With PASID tagged interrupts and a IOMMU interrupt remapping
> capability that can trigger on PASID, then the platform can provide
> the same level of security as SRIOV - the above is no problem.
>
> The device ensures that all DMAs and all interrupts program by the
> guest are PASID tagged and the platform provides security by checking
> the PASID when delivering the interrupt.

Correct.

> Intel IOMMU doesn't work this way today, but it makes alot of design
> sense.

Right.

> Otherwise the interrupt is effectively delivered to the hypervisor. A
> secure device can *never* allow a guest to specify an addr/data pair
> for a non-PASID tagged TLP, so the device cannot offer IMS to the
> guest.

Ok. Let me summarize the current state of supported scenarios:

 1) SRIOV works with any form of IMS storage because it does not require
    PASID and the VF devices have unique requester ids, which allows the
    remap unit to sanity check the message.

 2) SIOV with IMS when the hypervisor can manage the IMS store
    exclusively.

So #2 prevents a device which handles IMS storage in queue memory to
utilize IMS for SIOV in a guest because the hypervisor cannot manage the
IMS message store and the guest can write arbitrary crap to it which
violates the isolation principle.

And here is the relevant part of the SIOV spec:

 "IMS is managed by host driver software and is not accessible directly
  from guest or user-mode drivers.

  Within the device, IMS storage is not accessible from the ADIs. ADIs
  can request interrupt generation only through the device’s ‘Interrupt
  Message Generation Logic’, which allows an ADI to only generate
  interrupt messages that are associated with that specific ADI. These
  restrictions ensure that the host driver has complete control over
  which interrupt messages can be generated by each ADI.

  On Intel 64 architecture platforms, message signaled interrupts are
  issued as DWORD size untranslated memory writes without a PASID TLP
  Prefix, to address range 0xFEExxxxx. Since all memory requests
  generated by ADIs include a PASID TLP Prefix, it is not possible for
  an ADI to generate a DMA write that would be interpreted by the
  platform as an interrupt message."

That's the reductio ad absurdum for this sentence in the first paragraph
of the preceding chapter describing the concept of IMS:

  "IMS enables devices to store the interrupt messages for ADIs in a
   device-specific optimized manner without the scalability restrictions
   of the PCI Express defined MSI-X capability."

"Device-specific optimized manner" is either wishful thinking or
marketing induced verbal diarrhoea.

The current specification puts massive restrictions on IMS storage which
are _not_ allowing to optimize it in a device specific manner as
demonstrated in this discussion.

It also precludes obvious use cases like passing a full device to a
guest and let the guest manage SIOV subdevices for containers or nested
guests.

TBH, to me this is just another hastily cobbled together half thought
out misfeature cast in silicon. The proposed software support is
following the exactly same principle.

So before we go anywhere with this, I want to see a proper way forward
to support _all_ sensible use cases and to fulfil the promise of
"device-specific optimized manner" at the conceptual and specification
and also at the code level.

I'm not at all interested to rush in support for a half baken Intel
centric solution which other people have to clean up after the fact
(again).

IOW, it's time to go back to the drawing board.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-09 22:42                                             ` Thomas Gleixner
@ 2020-11-10  5:14                                               ` Raj, Ashok
  2020-11-10 10:27                                                 ` Thomas Gleixner
  2020-11-10 14:19                                                 ` Jason Gunthorpe
  0 siblings, 2 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-10  5:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jason Gunthorpe, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

Hi Thomas,

On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> >
> > The relavance of PASID is this:
> >
> >> Again, trap emulate does not work for IMS when the IMS store is software
> >> managed guest memory and not part of the device. And that's the whole
> >> reason why we are discussing this.
> >
> > With PASID tagged interrupts and a IOMMU interrupt remapping
> > capability that can trigger on PASID, then the platform can provide
> > the same level of security as SRIOV - the above is no problem.
> >
> > The device ensures that all DMAs and all interrupts program by the
> > guest are PASID tagged and the platform provides security by checking
> > the PASID when delivering the interrupt.
> 
> Correct.
> 
> > Intel IOMMU doesn't work this way today, but it makes alot of design
> > sense.

Approach to IMS is more of a phased approach. 

#1 Allow physical device to scale beyond limits of PCIe MSIx
   Follows current methodology for guest interrupt programming and
   evolutionary changes rather than drastic.
#2 Long term we should work together on enabling IMS in guest which
   requires changes in both HW and SW eco-system.

For #1, the immediate need is to find a way to limit guest from using IMS
due to current limitations. We have couple options.

a) CPUID based method to disallow IMS when running in a guest OS. Limiting
   use to existing virtual MSIx to guest devices. (Both you/Jason alluded)
b) We can extend DMAR table to have a flag for opt-out. So in real platform
   this flag is clear and in guest VMM will ensure vDMAR will have this flag
   set. Along the lines as Jason alluded, platform level and via ACPI
   methods. We have similar use for x2apic_optout today.

Think a) is probably more generic.

For #2 Long term goal of allowing IMS in guest for devices that require
them. This requires some extensive eco-system enabling. 

- Extending HW to understand PASID-tagged interrupt messages.
- Appropriate extensions to IOMMU to enforce such PASID based isolation.

From SW improvements:

- Hypercall to retrieve addr/data from host
- Ensure SW can provide guarantee that the interrupt address range will not
  be mapped in process space when SVM is in play. Otherwise its hard to
  distinguish between DMA and Interrupt. OS needs to opt-in to this
  behavior. Today we ensure IOVA space has this 0xFEExxxxx range carve out
  of the IOVA space.


Devices such as idxd that do not have these entries on page-boundaries for
isolation to permit direct programming from GuestOS will continue to use
trap-emulate as used today.

In the end, virtualizing IMS requires eco-system collaboration, and we are
very open to change hw when all the relevant pieces are in place.

Until then, IMS will be restricted to host VMM only, and we can use the
methods above to prevent IMS in guest and continue to use the legacy
virtual MSIx.

> 
> Right.
> 
> > Otherwise the interrupt is effectively delivered to the hypervisor. A
> > secure device can *never* allow a guest to specify an addr/data pair
> > for a non-PASID tagged TLP, so the device cannot offer IMS to the
> > guest.
> 
> Ok. Let me summarize the current state of supported scenarios:
> 
>  1) SRIOV works with any form of IMS storage because it does not require
>     PASID and the VF devices have unique requester ids, which allows the
>     remap unit to sanity check the message.
> 
>  2) SIOV with IMS when the hypervisor can manage the IMS store
>     exclusively.

Today this is true for all interrupt types, MSI/MSIx/IMS.

> 
> So #2 prevents a device which handles IMS storage in queue memory to
> utilize IMS for SIOV in a guest because the hypervisor cannot manage the
> IMS message store and the guest can write arbitrary crap to it which
> violates the isolation principle.
> 
> And here is the relevant part of the SIOV spec:
> 
>  "IMS is managed by host driver software and is not accessible directly
>   from guest or user-mode drivers.
> 
>   Within the device, IMS storage is not accessible from the ADIs. ADIs
>   can request interrupt generation only through the device’s ‘Interrupt
>   Message Generation Logic’, which allows an ADI to only generate
>   interrupt messages that are associated with that specific ADI. These
>   restrictions ensure that the host driver has complete control over
>   which interrupt messages can be generated by each ADI.
> 
>   On Intel 64 architecture platforms, message signaled interrupts are
>   issued as DWORD size untranslated memory writes without a PASID TLP
>   Prefix, to address range 0xFEExxxxx. Since all memory requests
>   generated by ADIs include a PASID TLP Prefix, it is not possible for
>   an ADI to generate a DMA write that would be interpreted by the
>   platform as an interrupt message."
> 
> That's the reductio ad absurdum for this sentence in the first paragraph
> of the preceding chapter describing the concept of IMS:
> 
>   "IMS enables devices to store the interrupt messages for ADIs in a
>    device-specific optimized manner without the scalability restrictions
>    of the PCI Express defined MSI-X capability."
> 
> "Device-specific optimized manner" is either wishful thinking or
> marketing induced verbal diarrhoea.

No comment on the adjectives above :-)

> 
> The current specification puts massive restrictions on IMS storage which
> are _not_ allowing to optimize it in a device specific manner as
> demonstrated in this discussion.

IMS doesn't restrict this optimization, but to allow it requires more OS support as
you had mentioned.

> 
> It also precludes obvious use cases like passing a full device to a
> guest and let the guest manage SIOV subdevices for containers or nested
> guests.
> 
> TBH, to me this is just another hastily cobbled together half thought
> out misfeature cast in silicon. The proposed software support is
> following the exactly same principle.

Current IMS support adds incremental feature capability. Works pretty much
following everything that was created for MSIx, but just adds some device
flexibility. 

Here are some reasons why PASID isn't used today for tagging interrupts.

Interrupt messages (as specified by MSI/MSI-X in PCI specification) are 
currently defined as DWORD DMA writes to a platform/architecture specific 
address (0xFEExxxxx on Intel platforms). Existing root-complexes detect
DWORD writes to 0xFEExxxxx (without a PASID in the transaction) as interrupt 
messages and route them to interrupt-remapping logic (as opposed to other 
DMA requests that are routed to IOMMU's DMA remapping logic). 

There are multiple tools (such as logic analyzers) and OEM test validation 
harnesses that depend on such DWORD sized DMA writes with no PASID as interrupt
messages. One of the feedback we had received in the development of the
specification was to avoid impacting such tools irrespective of MSI-X or IMS 
was used for interrupt message storage (on the wire they follow the same format), 
and also to ensure interoperability of devices supporting IMS across CPU vendors 
(who may not support PASID TLP prefix).  This is one reason that led to interrupts 
from IMS to not use PASID (and match the wire format of MSI/MSI-X generated interrupts). 
The other problem was disambiguation between DMA to SVM v/s interrupts.

> 
> So before we go anywhere with this, I want to see a proper way forward
> to support _all_ sensible use cases and to fulfil the promise of
> "device-specific optimized manner" at the conceptual and specification
> and also at the code level.
> 
> I'm not at all interested to rush in support for a half baken Intel
> centric solution which other people have to clean up after the fact
> (again).

Intel had published the specification almost 2 years back and have
comprehended all the feedback received from the ecosystem 
(both open-source and others), along with offering the specification 
to be implemented by any vendors (both device and CPU vendors). 
There are few device vendors who are implementing to the spec already and 
are being explored for support by other CPU vendors

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-10  5:14                                               ` Raj, Ashok
@ 2020-11-10 10:27                                                 ` Thomas Gleixner
  2020-11-10 14:13                                                   ` Raj, Ashok
  2020-11-10 14:19                                                 ` Jason Gunthorpe
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-10 10:27 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Jason Gunthorpe, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

Ashok,

On Mon, Nov 09 2020 at 21:14, Ashok Raj wrote:
> On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
>> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> Approach to IMS is more of a phased approach. 
>
> #1 Allow physical device to scale beyond limits of PCIe MSIx
>    Follows current methodology for guest interrupt programming and
>    evolutionary changes rather than drastic.

Trapping MSI[X] writes is there because it allows to hand a device to an
unmodified guest OS and to handle the case where the MSI[X] entries
storage cannot be mapped exclusively to the guest.

But aside of this, it's not required if the storage can be mapped
exclusively, the guest is hypervisor aware and can get a host composed
message via a hypercall. That works for physical functions and SRIOV,
but not for SIOV.

> #2 Long term we should work together on enabling IMS in guest which
>    requires changes in both HW and SW eco-system.
>
> For #1, the immediate need is to find a way to limit guest from using IMS
> due to current limitations. We have couple options.
>
> a) CPUID based method to disallow IMS when running in a guest OS. Limiting
>    use to existing virtual MSIx to guest devices. (Both you/Jason alluded)
> b) We can extend DMAR table to have a flag for opt-out. So in real platform
>    this flag is clear and in guest VMM will ensure vDMAR will have this flag
>    set. Along the lines as Jason alluded, platform level and via ACPI
>    methods. We have similar use for x2apic_optout today.
>
> Think a) is probably more generic.

But incomplete as I explained before. If the VMM does not set the
hypervisor bit in CPUID then the guest OS assumes to run on bare
metal. It needs more than just relying on CPUID.

Aside of that neither Jason nor myself said that IMS cannot be supported
in a guest. PF and VF IMS can and has to be supported. SIOV is a
different story due to the PASID requirement which obviously needs to be
managed host side and needs HW changes.

> From SW improvements:
>
> - Hypercall to retrieve addr/data from host

You need to have that even for the non SIOV case in order to hand in a
full device which has the IMS storage in queue memory.

> Devices such as idxd that do not have these entries on page-boundaries for
> isolation to permit direct programming from GuestOS will continue to use
> trap-emulate as used today.

That's a restriction of that particular hardware.

> Until then, IMS will be restricted to host VMM only, and we can use the
> methods above to prevent IMS in guest and continue to use the legacy
> virtual MSIx.

SIOV IMS.

But as things stand now not even PF/VF pass through are possible. This
might not be an issue for IDXD, but it's an issue in general and this
want's the be thought of _now_ before we put a lot of infrastructure in
to place which needs then to be ripped apart again.

>> The current specification puts massive restrictions on IMS storage which
>> are _not_ allowing to optimize it in a device specific manner as
>> demonstrated in this discussion.
>
> IMS doesn't restrict this optimization, but to allow it requires more
> OS support as you had mentioned.

Right, IMS per se does not put an restriction on it.

The specification and the HW limitations on the remapping unit put that
restriction into place.

OS support is an obvious requirement, but OS support cannot make
the restrictions of HW go away magically.

But again, we need to think about the path forward _now_.

Just slapping some 'works for IDXD' solution into place can severly
restrict the options for going beyond these limitations simply because
we have to support that 'works for IDXD thing' forever.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-10 10:27                                                 ` Thomas Gleixner
@ 2020-11-10 14:13                                                   ` Raj, Ashok
  2020-11-10 14:23                                                     ` Jason Gunthorpe
  2020-11-11  7:14                                                     ` Tian, Kevin
  0 siblings, 2 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-10 14:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jason Gunthorpe, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

Thomas,

With all these interrupt message storms ;-), I'm missing how to move towards
an end goal.

On Tue, Nov 10, 2020 at 11:27:29AM +0100, Thomas Gleixner wrote:
> Ashok,
> 
> On Mon, Nov 09 2020 at 21:14, Ashok Raj wrote:
> > On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> >> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> > Approach to IMS is more of a phased approach. 
> >
> > #1 Allow physical device to scale beyond limits of PCIe MSIx
> >    Follows current methodology for guest interrupt programming and
> >    evolutionary changes rather than drastic.
> 
> Trapping MSI[X] writes is there because it allows to hand a device to an
> unmodified guest OS and to handle the case where the MSI[X] entries
> storage cannot be mapped exclusively to the guest.
> 
> But aside of this, it's not required if the storage can be mapped
> exclusively, the guest is hypervisor aware and can get a host composed
> message via a hypercall. That works for physical functions and SRIOV,
> but not for SIOV.

It would greatly help if you can put down what you see is blocking 
to move forward in the following areas.

Address Gaps in Spec: 

Specs can accomodate change after review, as the number of ECN's that go on
with PCIe ;-). Please add what you like to see in the spec if you beleive
is a gap today.

Hardware Gaps?
- PASID tagged Interrupts.
- IOMMU Support for PASID based IR.

As i had called out, there are a lot of moving parts, and requires more
attention.

OS Gaps?
- Lack of ability to identify if platform can use IMS.
- Lack of hypercall.

We will always have devices that have more interrupts but their use doesn't
need IMS to be directly manipulated by the guest, or the fact those usages
require more than what is allowed by PCIe in a guest. These devices can 
scale by adding another sub-device and you get another block of 2048 if needed.

This isn't just for idxd, as I mentioned earlier, there are vendors other
than Intel already working on this. In all cases the need for guest direct
manipulation of interrupt store hasn't come up. From the discussion, it
seems like there are devices today or in future that will require direct
manipulation of interrupt store in the guest. This needs additional work
in both the device hardware providing the right plumbing and OS work to
comprehend those.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-10  5:14                                               ` Raj, Ashok
  2020-11-10 10:27                                                 ` Thomas Gleixner
@ 2020-11-10 14:19                                                 ` Jason Gunthorpe
  2020-11-11  2:35                                                   ` Tian, Kevin
  1 sibling, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-10 14:19 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Thomas Gleixner, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Mon, Nov 09, 2020 at 09:14:12PM -0800, Raj, Ashok wrote:

> There are multiple tools (such as logic analyzers) and OEM test validation 
> harnesses that depend on such DWORD sized DMA writes with no PASID as interrupt
> messages. One of the feedback we had received in the development of the
> specification was to avoid impacting such tools irrespective of
> MSI-X or IMS

This is a really bad reason to make a poor decision for system
security. Relying on trapping/emulation increases the attack surface
and complexity of the VMM and the device which now have to create this
artificial split, which does not exist in SRIOV.

Hopefully we won't see devices get this wrong, but any path that
allows the guest to cause the device to create TLPs outside its IOMMU
containment is security worrysome.

> was used for interrupt message storage (on the wire they follow the
> same format), and also to ensure interoperability of devices
> supporting IMS across CPU vendors (who may not support PASID TLP
> prefix).  This is one reason that led to interrupts from IMS to not
> use PASID (and match the wire format of MSI/MSI-X generated
> interrupts).  The other problem was disambiguation between DMA to
> SVM v/s interrupts.

This is a defect in the IOMMU, not something fundamental.

The IOMMU needs to know if the interrupt range is active or not for
each PASID. Process based SVA will, of course, not enable interrupts
on the PASID, VM Guest based PASID will.

> Intel had published the specification almost 2 years back and have
> comprehended all the feedback received from the ecosystem 
> (both open-source and others), along with offering the specification 
> to be implemented by any vendors (both device and CPU vendors). 
> There are few device vendors who are implementing to the spec already and 
> are being explored for support by other CPU vendors

Which is why it is such a shame that including PASID in the MSI was
deliberately skipped in the document, the ecosystem could have been
much aligned to this solution by now :(

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 23:25                                         ` Raj, Ashok
@ 2020-11-10 14:19                                           ` Raj, Ashok
  2020-11-10 14:41                                             ` David Woodhouse
  0 siblings, 1 reply; 123+ messages in thread
From: Raj, Ashok @ 2020-11-10 14:19 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jason Gunthorpe, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, tglx,
	alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, jing.lin, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm,
	Ashok Raj

Hi David

I did't follow the support for 32768 CPUs in guest without IR support.

Can you tell me how that is done?

On Sun, Nov 08, 2020 at 03:25:57PM -0800, Ashok Raj wrote:
> On Sun, Nov 08, 2020 at 06:34:55PM +0000, David Woodhouse wrote:
> > > 
> > > When we do interrupt remapping support in guest which would be required 
> > > if we support x2apic in guest, I think this is something we should look into more 
> > > carefully to make this work.
> > 
> > No, interrupt remapping is not required for X2APIC in guests
> > 
> > They can have X2APIC and up to 32768 CPUs without needing interrupt
> 
> How is this made available today without interrupt remapping? 
> 
> I thought without IR, the destination ID is still limited to only 8 bits?
> 
> On native, even if you have less than 255 cpu's but the APICID are sparsly 
> distributed due to platform rules, the x2apic id could be more than 8 bits. 
> Which is why the spec requires IR when x2apic is enabled.
> 
> > remapping at all. Only if they want more than 32768 vCPUs, or to do
> > nested virtualisation and actually remap for the benefit of *their*
> > (L2+) guests would they need IR.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-10 14:13                                                   ` Raj, Ashok
@ 2020-11-10 14:23                                                     ` Jason Gunthorpe
  2020-11-11  2:17                                                       ` Tian, Kevin
  2020-11-11  7:14                                                     ` Tian, Kevin
  1 sibling, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-10 14:23 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Thomas Gleixner, Dan Williams, Tian, Kevin, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Tue, Nov 10, 2020 at 06:13:23AM -0800, Raj, Ashok wrote:

> This isn't just for idxd, as I mentioned earlier, there are vendors other
> than Intel already working on this. In all cases the need for guest direct
> manipulation of interrupt store hasn't come up. From the discussion, it
> seems like there are devices today or in future that will require direct
> manipulation of interrupt store in the guest. This needs additional work
> in both the device hardware providing the right plumbing and OS work to
> comprehend those.

We'd want to see SRIOV's assigned to guests to be able to use
IMS. This allows a SRIOV instance in a guest to spawn SIOV's which is
useful.

SIOV's assigned to guests could use IMS, but the use cases we see in
the short term can be handled by using SRIOV instead.

I would expect in general for SIOV to use MSI-X emulation to expose
interrupts - it would be really weird for a SIOV emulator to do
something else and we should probably discourage that.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-10 14:19                                           ` Raj, Ashok
@ 2020-11-10 14:41                                             ` David Woodhouse
  0 siblings, 0 replies; 123+ messages in thread
From: David Woodhouse @ 2020-11-10 14:41 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: David Woodhouse, Jason Gunthorpe, Dan Williams, Tian, Kevin,
	Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	tglx, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu,
	Kumar, Sanjay K, Luck, Tony, jing.lin, kwankhede, eric.auger,
	parav, rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz,
	Samuel, Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm



> Hi David
>
> I did't follow the support for 32768 CPUs in guest without IR support.
>
> Can you tell me how that is done?



Using bits 11-5 of the MSI address bits (the other 7 bits of "Extended
Destination ID" that aren't the Remappable Format indicator).

And physical addressing mode, which is no loss for external interrupts
since they're all unicast dest_Fixed these days anyway.


-- 
dwmw2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-10 14:23                                                     ` Jason Gunthorpe
@ 2020-11-11  2:17                                                       ` Tian, Kevin
  2020-11-12 13:46                                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2020-11-11  2:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Raj, Ashok
  Cc: Thomas Gleixner, Williams, Dan J, Jiang, Dave, Bjorn Helgaas,
	vkoul, Dey, Megha, maz, bhelgaas, alex.williamson, Pan,
	Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 10, 2020 10:24 PM
> 
> On Tue, Nov 10, 2020 at 06:13:23AM -0800, Raj, Ashok wrote:
> 
> > This isn't just for idxd, as I mentioned earlier, there are vendors other
> > than Intel already working on this. In all cases the need for guest direct
> > manipulation of interrupt store hasn't come up. From the discussion, it
> > seems like there are devices today or in future that will require direct
> > manipulation of interrupt store in the guest. This needs additional work
> > in both the device hardware providing the right plumbing and OS work to
> > comprehend those.
> 
> We'd want to see SRIOV's assigned to guests to be able to use
> IMS. This allows a SRIOV instance in a guest to spawn SIOV's which is
> useful.

Does your VF support both MSI/IMS or IMS only? If it is the former can't
we adopt a phased approach or parallel effort between forcing guest
to use MSI and adding hypercall to enable IMS on VF? Finding a way
to disable IMS is anyway required per earlier discussion when hypercall
is not available, and it could still provide a functional though suboptimal
model for such VFs.

> 
> SIOV's assigned to guests could use IMS, but the use cases we see in
> the short term can be handled by using SRIOV instead.
> 
> I would expect in general for SIOV to use MSI-X emulation to expose
> interrupts - it would be really weird for a SIOV emulator to do
> something else and we should probably discourage that.
> 

I agree with this point. This leaves hardware gaps in IOMMU and root
complex less an immediate blocker and to be addressed in the long term.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-10 14:19                                                 ` Jason Gunthorpe
@ 2020-11-11  2:35                                                   ` Tian, Kevin
  0 siblings, 0 replies; 123+ messages in thread
From: Tian, Kevin @ 2020-11-11  2:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Raj, Ashok
  Cc: Thomas Gleixner, Williams, Dan J, Jiang, Dave, Bjorn Helgaas,
	vkoul, Dey, Megha, maz, bhelgaas, alex.williamson, Pan,
	Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 10, 2020 10:19 PM
> On Mon, Nov 09, 2020 at 09:14:12PM -0800, Raj, Ashok wrote:
> 
> > was used for interrupt message storage (on the wire they follow the
> > same format), and also to ensure interoperability of devices
> > supporting IMS across CPU vendors (who may not support PASID TLP
> > prefix).  This is one reason that led to interrupts from IMS to not
> > use PASID (and match the wire format of MSI/MSI-X generated
> > interrupts).  The other problem was disambiguation between DMA to
> > SVM v/s interrupts.
> 
> This is a defect in the IOMMU, not something fundamental.
> 
> The IOMMU needs to know if the interrupt range is active or not for
> each PASID. Process based SVA will, of course, not enable interrupts
> on the PASID, VM Guest based PASID will.
> 

Unfortunately it's more than that. The interrupt message is firstly recognized
at root complex today and then routed to the IOMMU, unlike other DMA
requests. I'm not saying it's an unsolvable limitation, but just wants to point
out that to achieve such goal there are more things to be considered beyond 
the IOMMU.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-10 14:13                                                   ` Raj, Ashok
  2020-11-10 14:23                                                     ` Jason Gunthorpe
@ 2020-11-11  7:14                                                     ` Tian, Kevin
  2020-11-12 19:32                                                       ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2020-11-11  7:14 UTC (permalink / raw)
  To: Raj, Ashok, Thomas Gleixner
  Cc: Jason Gunthorpe, Williams, Dan J, Jiang, Dave, Bjorn Helgaas,
	vkoul, Dey, Megha, maz, bhelgaas, alex.williamson, Pan,
	Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Raj, Ashok

> From: Raj, Ashok <ashok.raj@intel.com>
> Sent: Tuesday, November 10, 2020 10:13 PM
> 
> Thomas,
> 
> With all these interrupt message storms ;-), I'm missing how to move
> towards
> an end goal.
> 
> On Tue, Nov 10, 2020 at 11:27:29AM +0100, Thomas Gleixner wrote:
> > Ashok,
> >
> > On Mon, Nov 09 2020 at 21:14, Ashok Raj wrote:
> > > On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> > >> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> > > Approach to IMS is more of a phased approach.
> > >
> > > #1 Allow physical device to scale beyond limits of PCIe MSIx
> > >    Follows current methodology for guest interrupt programming and
> > >    evolutionary changes rather than drastic.
> >
> > Trapping MSI[X] writes is there because it allows to hand a device to an
> > unmodified guest OS and to handle the case where the MSI[X] entries
> > storage cannot be mapped exclusively to the guest.
> >
> > But aside of this, it's not required if the storage can be mapped
> > exclusively, the guest is hypervisor aware and can get a host composed
> > message via a hypercall. That works for physical functions and SRIOV,
> > but not for SIOV.
> 
> It would greatly help if you can put down what you see is blocking
> to move forward in the following areas.
> 

Agree. We really need some guidance on how to move forward. I think all
people in this thread are aligned now that it's not Intel or IDXD specific thing,
e.g. need architectural solution, enabling IMS on PF/VF is important, etc. But
what we are not sure is whether we need complete all requirements in one
batch, or could evolve step-by-step as long as the growing path is clearly
defined. 

IMHO finding a way to disable IMS in guest is more important than supporting
IMS on PF/VF, since the latter requires hypercall which is not always available
in all scenarios. Even if Linux includes hypercall support for all existing archs
and hypervisors, it could run as an unmodified guest on a new hypervisor 
before this hypervisor gets its enlightenments into the Linux. So it is more
prominent to find a way to force using MSI/MSI-x inside guest, as it allows
such PFs/VFs still functional though not benefiting all scalability merits of IMS.

If such two-step plans can be agreed, then the next open is about how to
disable IMS in guest. We need a sane solution when checking in the initial 
host-only-IMS support. There are several options discussed in this thread:

1. Industry standard (e.g. a vendor-agnostic ACPI flag) followed by all 
platforms, hypervisors and OSes. It will require collaboration beyond 
Linux community;

2. IOMMU-vendor specific standards (DMAR, IORT, etc.) to report whether
IMS is allowed, implying that IMS is tied to the IOMMU. This tradeoff is 
acceptable since IMS alone cannot make SIOV working which relies on the 
IOMMU anyway. and this might be an easier path to move forward and
even not require to wait for all vendors to extend their tables together.
On physical platform the FW always reports IMS as 'allowed' and there is
time to change it. On virtual platform the hypervisor can choose to hide 
IMS in three ways:
	a) do not expose IOMMU
	b) expose IOMMU, but using the old format
	c) expose IOMMU, using the new format with IMS reported 'disallowed'

a/b can well support legacy software stack.

However, there is one potential issue with option 1/2. The construction
of the virtual ACPI table is at VM creation time, likely based on whether a 
PV interrupt controller is exposed to this guest. However, in most cases the
hypervisor doesn't know which guest OS is running and whether it will
use the PV controller when the VM is being created. If IMS is marked as
'allowed' in the virtual DMAR table, an unmodified guest might just go to 
enable it as if it's on the native platform. Maybe what we really required is 
a flag to tell the guest that although IMS is available you cannot use it with 
traditional interrupt controllers?

3. Use IOMMU 'caching mode' as the hint of running as guest and disable
IMS by default as long as 'caching mode' is detected. iirc all IOMMU vendors 
provide such capability for constructing shadow IOMMU page table. Later
when hypercall support is detected for a specific hypervisor/arch, that path 
can override the IOMMU hint to enable IMS.

Unlike the first two options, this will be a Linux-specific policy but self
contained. Other guest OSes may not follow this way though.

4. Using CPUID to detect running as guest. But as Thomas pointed out, this
approach is less reliable as not all hypervisors do this way.

Thoughts?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-08 19:36                                       ` David Woodhouse
  2020-11-08 22:47                                         ` Thomas Gleixner
@ 2020-11-11 15:41                                         ` Christoph Hellwig
  2020-11-11 16:09                                           ` Raj, Ashok
  1 sibling, 1 reply; 123+ messages in thread
From: Christoph Hellwig @ 2020-11-11 15:41 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, Jason Gunthorpe, Dan Williams, Raj, Ashok, Tian,
	Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz,
	bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu,
	Kumar, Sanjay K, Luck, Tony, jing.lin, kwankhede, eric.auger,
	parav, rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz,
	Samuel, Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm

On Sun, Nov 08, 2020 at 07:36:34PM +0000, David Woodhouse wrote:
> So it does look like we're going to need a hypercall interface to
> compose an MSI message on behalf of the guest, for IMS to use. In fact
> PCI devices assigned to a guest could use that too, and then we'd only
> need to trap-and-remap any attempt to write a Compatibility Format MSI
> to the device's MSI table, while letting Remappable Format messages get
> written directly.
> 
> We'd also need a way for an OS running on bare metal to *know* that
> it's on bare metal and can just compose MSI messages for itself. Since
> we do expect bare metal to have an IOMMU, perhaps that is just a
> feature flag on the IOMMU?

Have the platform firmware advertise if it needs native or virtualized
IMS handling.  If it advertises neither don't support IMS?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-11 15:41                                         ` Christoph Hellwig
@ 2020-11-11 16:09                                           ` Raj, Ashok
  2020-11-11 22:27                                             ` Thomas Gleixner
  0 siblings, 1 reply; 123+ messages in thread
From: Raj, Ashok @ 2020-11-11 16:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Woodhouse, Thomas Gleixner, Jason Gunthorpe, Dan Williams,
	Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz,
	bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu,
	Kumar, Sanjay K, Luck, Tony, jing.lin, kwankhede, eric.auger,
	parav, rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz,
	Samuel, Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm,
	Ashok Raj

On Wed, Nov 11, 2020 at 03:41:59PM +0000, Christoph Hellwig wrote:
> On Sun, Nov 08, 2020 at 07:36:34PM +0000, David Woodhouse wrote:
> > So it does look like we're going to need a hypercall interface to
> > compose an MSI message on behalf of the guest, for IMS to use. In fact
> > PCI devices assigned to a guest could use that too, and then we'd only
> > need to trap-and-remap any attempt to write a Compatibility Format MSI
> > to the device's MSI table, while letting Remappable Format messages get
> > written directly.
> > 
> > We'd also need a way for an OS running on bare metal to *know* that
> > it's on bare metal and can just compose MSI messages for itself. Since
> > we do expect bare metal to have an IOMMU, perhaps that is just a
> > feature flag on the IOMMU?
> 
> Have the platform firmware advertise if it needs native or virtualized
> IMS handling.  If it advertises neither don't support IMS?

The platform hint can be easily accomplished via DMAR table flags. We could
have an IMS_OPTOUT(similart to x2apic optout flag) flag, when 0 its native 
and IMS is supported.

When vIOMMU is presented to guest, virtual DMAR table will have this flag
set to 1. Indicates to GuestOS, native IMS isn't supported.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-11 16:09                                           ` Raj, Ashok
@ 2020-11-11 22:27                                             ` Thomas Gleixner
  2020-11-11 23:03                                               ` Raj, Ashok
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-11 22:27 UTC (permalink / raw)
  To: Raj, Ashok, Christoph Hellwig
  Cc: David Woodhouse, Jason Gunthorpe, Dan Williams, Tian, Kevin,
	Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, jing.lin, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm,
	Ashok Raj

On Wed, Nov 11 2020 at 08:09, Ashok Raj wrote:
> On Wed, Nov 11, 2020 at 03:41:59PM +0000, Christoph Hellwig wrote:
>> On Sun, Nov 08, 2020 at 07:36:34PM +0000, David Woodhouse wrote:
>> > So it does look like we're going to need a hypercall interface to
>> > compose an MSI message on behalf of the guest, for IMS to use. In fact
>> > PCI devices assigned to a guest could use that too, and then we'd only
>> > need to trap-and-remap any attempt to write a Compatibility Format MSI
>> > to the device's MSI table, while letting Remappable Format messages get
>> > written directly.
>> > 
>> > We'd also need a way for an OS running on bare metal to *know* that
>> > it's on bare metal and can just compose MSI messages for itself. Since
>> > we do expect bare metal to have an IOMMU, perhaps that is just a
>> > feature flag on the IOMMU?
>> 
>> Have the platform firmware advertise if it needs native or virtualized
>> IMS handling.  If it advertises neither don't support IMS?
>
> The platform hint can be easily accomplished via DMAR table flags. We could
> have an IMS_OPTOUT(similart to x2apic optout flag) flag, when 0 its native 
> and IMS is supported.
>
> When vIOMMU is presented to guest, virtual DMAR table will have this flag
> set to 1. Indicates to GuestOS, native IMS isn't supported.

These opt-out bits suck by definition. It comes all back to the fact
that the whole virt thing didn't have a hardware defined way to tell
that the OS runs in a VM and not on bare metal. It wouldn't have been
rocket science to do so.

And because that does not exist, we need magic opt-out bits for every
other piece of functionality which gets added. Can we please stop this
and provide a well defined way to tell the OS whether it runs on bare
metal or not?

The point is that you really want opt-in bits so that decisions come
down to

     if (!virt || virt->supports_X)

which is the obvious sane and safe logic. But sure, why am I asking for
sane and safe in the context of virtualization?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-11 22:27                                             ` Thomas Gleixner
@ 2020-11-11 23:03                                               ` Raj, Ashok
  2020-11-12  1:13                                                 ` Thomas Gleixner
  2020-11-12 13:10                                                 ` Jason Gunthorpe
  0 siblings, 2 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-11 23:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, David Woodhouse, Jason Gunthorpe,
	Dan Williams, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

On Wed, Nov 11, 2020 at 11:27:28PM +0100, Thomas Gleixner wrote:
> On Wed, Nov 11 2020 at 08:09, Ashok Raj wrote:
> >> > We'd also need a way for an OS running on bare metal to *know* that
> >> > it's on bare metal and can just compose MSI messages for itself. Since
> >> > we do expect bare metal to have an IOMMU, perhaps that is just a
> >> > feature flag on the IOMMU?
> >> 
> >> Have the platform firmware advertise if it needs native or virtualized
> >> IMS handling.  If it advertises neither don't support IMS?
> >
> > The platform hint can be easily accomplished via DMAR table flags. We could
> > have an IMS_OPTOUT(similart to x2apic optout flag) flag, when 0 its native 
> > and IMS is supported.
> >
> > When vIOMMU is presented to guest, virtual DMAR table will have this flag
> > set to 1. Indicates to GuestOS, native IMS isn't supported.
> 
> These opt-out bits suck by definition. It comes all back to the fact
> that the whole virt thing didn't have a hardware defined way to tell
> that the OS runs in a VM and not on bare metal. It wouldn't have been
> rocket science to do so.

I'm sure everybody dislikes (hate being a strong word :-)). 
DVSEC capability. Real hardware always sets it to 1 for the IMS capability.

By default the DVSEC is not presented to guest even when the full PF is
presented to guest. I believe VFIO only builds and presents known standard
capabilities and specific extended capabilities. I'm a bit weak but maybe
@AlexWilliamson can confirm if I'm off track.

This tells the driver in guest that IMS is not available and will not
create those new dev_msi calls. 

Only if the VMM has build support to expose IMS for this device, guest SW
can even see DVSEC.SIOV.IMS=1. This also means the required plumbing, say
vIOMMU, or a hypercall has been provisioned, and adminstrator knows the
guest is compatible for these options. 

There maybe better ways to do this. If this has to be done differently
we certainly can and will do. 

> 
> And because that does not exist, we need magic opt-out bits for every
> other piece of functionality which gets added. Can we please stop this
> and provide a well defined way to tell the OS whether it runs on bare
> metal or not?
> 
> The point is that you really want opt-in bits so that decisions come
> down to

How would we opt-in when the feature is not available? You need someway to
tell the capability is available in the guest?, but then there is no reason
to opt-in though.. its ready for use isn't it?

> 
>      if (!virt || virt->supports_X)

The only closest thing that comes to mind is the CPUID bits, you had
mentioned they aren't reliable if the VMM didn't set those in an earlier
mail. If you want a platform level generic support.

- DMAR table optout's you had mentioned that's ugly
- We could use caching mode, but its not a platform level thing, and vendor
  specific. I'm not sure if other vendors have a similar feature. If there
  is a generic capabilty, we could expose via the iommu api's if we are in
  virt or real platform.
> 
> which is the obvious sane and safe logic. But sure, why am I asking for
> sane and safe in the context of virtualization?

We can pick how to solve this, and just waiting for you to tell, what
mechanism you prefer that's less painful and architecturally acceptible for
virtualization and linux. We are all ears!

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-11 23:03                                               ` Raj, Ashok
@ 2020-11-12  1:13                                                 ` Thomas Gleixner
  2020-11-12 13:10                                                 ` Jason Gunthorpe
  1 sibling, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-12  1:13 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Christoph Hellwig, David Woodhouse, Jason Gunthorpe,
	Dan Williams, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

Ashok,

On Wed, Nov 11 2020 at 15:03, Ashok Raj wrote:
> On Wed, Nov 11, 2020 at 11:27:28PM +0100, Thomas Gleixner wrote:
>> which is the obvious sane and safe logic. But sure, why am I asking for
>> sane and safe in the context of virtualization?
>
> We can pick how to solve this, and just waiting for you to tell, what
> mechanism you prefer that's less painful and architecturally acceptible for
> virtualization and linux. We are all ears!

Obviously we can't turn the time back. The point I was trying to make is
that the general approach of just bolting things on top of the exiting
maze is bad in general.

Opt-out bits are error prone simply because anything which exists before
that point does not know that it should set that bit. Obvious, right?

CPUID bits are 'Feature available' and not 'Feature not longer
available' for a reason.

So with the introduction of VT this stringent road was left and the
approach was: Don't tell the guest OS that it's not running on bare
metal.

That's a perfectly fine approach for running existing legacy OSes which
do not care at all because they don't know about anything of this
newfangled stuff.

But it's a falls flat on it's nose for anything which comes past that
point simply because there is no reliable way to tell in which context
the OS runs.

The VMM can decide not to set or is not having support for setting the
software CPUID bit which tells the guest OS that it does NOT run on bare
metal and still hand in new fangled PCI devices for which the guest OS
happens to have a driver which then falls flat on it's nose because some
magic functionality is not there.

So we have the following matrix:

VMM   		Guest OS
Old             Old             -> Fine, does not support any of that
New             Old             -> Fine, does not support any of that
New             New             -> Fine, works as expected
Old             New             -> FAIL

To fix this we have to come up with heuristics again to figure out which
context we are running in and whether some magic feature can be
supported or not:

probably_on_bare_metal()
{
        if (CPUID(FEATURE_HYPERVISOR))
        	return false;
       	if (dmi_match_hypervisor_vendor())
        	return false;

        return PROBABLY_RUNNING_ON_BARE_METAL;
}

Yes, it works probably in most cases, but it still works by chance and
that's what I really hate about this; indeed 'hate' is not a strong
enough word.

Why on earth did VT not introduce a reliable way (instruction, CPUID
leaf, MSR, whatever, which can't be manipulated by the VMM to let the OS
figure out where it runs?)

Just because the general approach to these problems is: We can fix that
in software.

No, you can't fix inconsistency in software at all.

This is not the first time that we tell HW folks to stop this 'Fix this
in software' attitude which has caused more problems than it solved.

And you can argue in circles until you are blue, that inconsistency is
not going away. 

Everytime new (mis)features are added which need awareness of the OS
whether it runs on bare-metal or in a VM we have this unsolvable dance
of requiring that the underlying VMM has to tell the guest OS NOT to use
it instead of having the guest OS making the simple decision:

   if (!definitely_on_bare_metal())
   	return -ENOTSUPP;

or with a newer version of the guest OS:

   if (!definitely_on_bare_metal() && !hypervisor->supportsthis())
   	return -ENOTSUPP;

I'm halfways content to go with the above probably_on_bare_metal()
function as a replacement for definitely_on_bare_metal() to go forward,
but only for the very simple reason that this is the only option we
have.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-11 23:03                                               ` Raj, Ashok
  2020-11-12  1:13                                                 ` Thomas Gleixner
@ 2020-11-12 13:10                                                 ` Jason Gunthorpe
  1 sibling, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-12 13:10 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Thomas Gleixner, Christoph Hellwig, David Woodhouse,
	Dan Williams, Tian, Kevin, Jiang, Dave, Bjorn Helgaas, vkoul,
	Dey, Megha, maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu,
	Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony, jing.lin,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Wed, Nov 11, 2020 at 03:03:21PM -0800, Raj, Ashok wrote:

> By default the DVSEC is not presented to guest even when the full PF is
> presented to guest. I believe VFIO only builds and presents known standard
> capabilities and specific extended capabilities. I'm a bit weak but maybe
> @AlexWilliamson can confirm if I'm off track.

This also need to work on Hyper-V and all other cases, you can't just
assume everything is vfio and kvm.

It is horrible to ask people to go back an retroactively change their
config space in a device just to work around all the design failings
Thomas eloquantly describes :(

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-11  2:17                                                       ` Tian, Kevin
@ 2020-11-12 13:46                                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-12 13:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Thomas Gleixner, Williams, Dan J, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Wed, Nov 11, 2020 at 02:17:48AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 10, 2020 10:24 PM
> > 
> > On Tue, Nov 10, 2020 at 06:13:23AM -0800, Raj, Ashok wrote:
> > 
> > > This isn't just for idxd, as I mentioned earlier, there are vendors other
> > > than Intel already working on this. In all cases the need for guest direct
> > > manipulation of interrupt store hasn't come up. From the discussion, it
> > > seems like there are devices today or in future that will require direct
> > > manipulation of interrupt store in the guest. This needs additional work
> > > in both the device hardware providing the right plumbing and OS work to
> > > comprehend those.
> > 
> > We'd want to see SRIOV's assigned to guests to be able to use
> > IMS. This allows a SRIOV instance in a guest to spawn SIOV's which is
> > useful.
> 
> Does your VF support both MSI/IMS or IMS only? 

Of course VF's support MSI..

> If it is the former can't we adopt a phased approach or parallel
> effort between forcing guest to use MSI and adding hypercall to
> enable IMS on VF? Finding a way to disable IMS is anyway required
> per earlier discussion when hypercall is not available, and it could
> still provide a functional though suboptimal model for such VFs.

Sure, I view that as the bare minimum

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-11  7:14                                                     ` Tian, Kevin
@ 2020-11-12 19:32                                                       ` Konrad Rzeszutek Wilk
  2020-11-12 22:42                                                         ` Thomas Gleixner
  0 siblings, 1 reply; 123+ messages in thread
From: Konrad Rzeszutek Wilk @ 2020-11-12 19:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Thomas Gleixner, Jason Gunthorpe, Williams, Dan J,
	Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, kwankhede, eric.auger, parav, rafael,
	netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain,
	Mona, dmaengine, linux-kernel, linux-pci, kvm

.monster snip..

> 4. Using CPUID to detect running as guest. But as Thomas pointed out, this
> approach is less reliable as not all hypervisors do this way.

Is that truly true? It is the first time I see the argument that extra
steps are needed and that checking for X86_FEATURE_HYPERVISOR is not enough.

Or is it more "Some hypervisor probably forgot about it, so lets make sure we patch
over that possible hole?"


Also is there anything in this spec that precludes this from working
on non-X86 architectures, say ARM systems?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-12 19:32                                                       ` Konrad Rzeszutek Wilk
@ 2020-11-12 22:42                                                         ` Thomas Gleixner
  2020-11-13  2:42                                                           ` Tian, Kevin
  2020-11-14 10:34                                                           ` Christoph Hellwig
  0 siblings, 2 replies; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-12 22:42 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Tian, Kevin
  Cc: Raj, Ashok, Jason Gunthorpe, Williams, Dan J, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Thu, Nov 12 2020 at 14:32, Konrad Rzeszutek Wilk wrote:
>> 4. Using CPUID to detect running as guest. But as Thomas pointed out, this
>> approach is less reliable as not all hypervisors do this way.
>
> Is that truly true? It is the first time I see the argument that extra
> steps are needed and that checking for X86_FEATURE_HYPERVISOR is not enough.
>
> Or is it more "Some hypervisor probably forgot about it, so lets make sure we patch
> over that possible hole?"

Nothing enforces that bit to be set. The bit is a pure software
convention and was proposed by VMWare in 2008 with the following
changelog:

 "This patch proposes to use a cpuid interface to detect if we are
  running on an hypervisor.

  The discovery of a hypervisor is determined by bit 31 of CPUID#1_ECX,
  which is defined to be "hypervisor present bit". For a VM, the bit is
  1, otherwise it is set to 0. This bit is not officially documented by
  either Intel/AMD yet, but they plan to do so some time soon, in the
  meanwhile they have promised to keep it reserved for virtualization."

The reserved promise seems to hold. AMDs APM has it documented. The
Intel SDM not so.

Also the kernel side of KVM does not enforce that bit, it's up to the user
space management to set it.

And yes, I've tripped over this with some hypervisors and even qemu KVM
failed to set it in the early days because it was masked with host CPUID
trimming as there the bit is obviously 0.

DMI vendor name is pretty good final check when the bit is 0. The
strings I'm aware of are:

QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH, Oracle
Corporation, Parallels, BHYVE, Microsoft Corporation

which is not complete but better than nothing ;)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-12 22:42                                                         ` Thomas Gleixner
@ 2020-11-13  2:42                                                           ` Tian, Kevin
  2020-11-13 12:57                                                             ` Jason Gunthorpe
  2020-11-13 13:32                                                             ` Thomas Gleixner
  2020-11-14 10:34                                                           ` Christoph Hellwig
  1 sibling, 2 replies; 123+ messages in thread
From: Tian, Kevin @ 2020-11-13  2:42 UTC (permalink / raw)
  To: Thomas Gleixner, Wilk, Konrad
  Cc: Raj, Ashok, Jason Gunthorpe, Williams, Dan J, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Friday, November 13, 2020 6:43 AM
> 
> On Thu, Nov 12 2020 at 14:32, Konrad Rzeszutek Wilk wrote:
> >> 4. Using CPUID to detect running as guest. But as Thomas pointed out, this
> >> approach is less reliable as not all hypervisors do this way.
> >
> > Is that truly true? It is the first time I see the argument that extra
> > steps are needed and that checking for X86_FEATURE_HYPERVISOR is not
> enough.
> >
> > Or is it more "Some hypervisor probably forgot about it, so lets make sure
> we patch
> > over that possible hole?"
> 
> Nothing enforces that bit to be set. The bit is a pure software
> convention and was proposed by VMWare in 2008 with the following
> changelog:
> 
>  "This patch proposes to use a cpuid interface to detect if we are
>   running on an hypervisor.
> 
>   The discovery of a hypervisor is determined by bit 31 of CPUID#1_ECX,
>   which is defined to be "hypervisor present bit". For a VM, the bit is
>   1, otherwise it is set to 0. This bit is not officially documented by
>   either Intel/AMD yet, but they plan to do so some time soon, in the
>   meanwhile they have promised to keep it reserved for virtualization."
> 
> The reserved promise seems to hold. AMDs APM has it documented. The
> Intel SDM not so.
> 
> Also the kernel side of KVM does not enforce that bit, it's up to the user
> space management to set it.
> 
> And yes, I've tripped over this with some hypervisors and even qemu KVM
> failed to set it in the early days because it was masked with host CPUID
> trimming as there the bit is obviously 0.
> 
> DMI vendor name is pretty good final check when the bit is 0. The
> strings I'm aware of are:
> 
> QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH,
> Oracle
> Corporation, Parallels, BHYVE, Microsoft Corporation
> 
> which is not complete but better than nothing ;)
> 
> Thanks,
> 
>         tglx

Hi, Thomas,

CPUID#1_ECX is a x86 thing. Do we need to figure out probably_on_
bare_metal for every architecture altogether, or is it OK to just
handle it for x86 arch at this stage? Based on previous discussions 
ims is just one piece of multiple technologies to enable SIOV-like
scalability. Ideally arch-specific enablement beyond ims (e.g. the 
IOMMU part) will be required for such scaled usage thus we 
may just leave ims disabled for non-x86 and wait until that time to 
figure out arch specific probably_on_bare_metal?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-13  2:42                                                           ` Tian, Kevin
@ 2020-11-13 12:57                                                             ` Jason Gunthorpe
  2020-11-13 13:32                                                             ` Thomas Gleixner
  1 sibling, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-13 12:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Thomas Gleixner, Wilk, Konrad, Raj, Ashok, Williams, Dan J,
	Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, kwankhede, eric.auger, parav, rafael,
	netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain,
	Mona, dmaengine, linux-kernel, linux-pci, kvm

On Fri, Nov 13, 2020 at 02:42:02AM +0000, Tian, Kevin wrote:

> CPUID#1_ECX is a x86 thing. Do we need to figure out probably_on_
> bare_metal for every architecture altogether, or is it OK to just
> handle it for x86 arch at this stage? Based on previous discussions 
> ims is just one piece of multiple technologies to enable SIOV-like
> scalability. Ideally arch-specific enablement beyond ims (e.g. the 
> IOMMU part) will be required for such scaled usage thus we 
> may just leave ims disabled for non-x86 and wait until that time to 
> figure out arch specific probably_on_bare_metal?

At the very least you need to ensure that
pci_subdevice_msi_create_irq_domain() fails entirely on other
architectures until they can sort out these sorts of issues..

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-13  2:42                                                           ` Tian, Kevin
  2020-11-13 12:57                                                             ` Jason Gunthorpe
@ 2020-11-13 13:32                                                             ` Thomas Gleixner
  2020-11-13 16:12                                                               ` Luck, Tony
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-13 13:32 UTC (permalink / raw)
  To: Tian, Kevin, Wilk, Konrad
  Cc: Raj, Ashok, Jason Gunthorpe, Williams, Dan J, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Marc Zyngier

On Fri, Nov 13 2020 at 02:42, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
> CPUID#1_ECX is a x86 thing. Do we need to figure out probably_on_
> bare_metal for every architecture altogether, or is it OK to just
> handle it for x86 arch at this stage? Based on previous discussions 
> ims is just one piece of multiple technologies to enable SIOV-like
> scalability. Ideally arch-specific enablement beyond ims (e.g. the 
> IOMMU part) will be required for such scaled usage thus we 
> may just leave ims disabled for non-x86 and wait until that time to 
> figure out arch specific probably_on_bare_metal?

Of course is this not only an x86 problem. Every architecture which
supports virtualization has the same issue. ARM(64) has no way to tell
for sure whether the machine runs bare metal either. No idea about the
other architectures.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-13 13:32                                                             ` Thomas Gleixner
@ 2020-11-13 16:12                                                               ` Luck, Tony
  2020-11-13 17:38                                                                 ` Raj, Ashok
  0 siblings, 1 reply; 123+ messages in thread
From: Luck, Tony @ 2020-11-13 16:12 UTC (permalink / raw)
  To: Thomas Gleixner, Tian, Kevin, Wilk, Konrad
  Cc: Raj, Ashok, Jason Gunthorpe, Williams, Dan J, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, kwankhede,
	eric.auger, parav, rafael, netanelg, shahafs, yan.y.zhao,
	pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine, linux-kernel,
	linux-pci, kvm, Marc Zyngier

> Of course is this not only an x86 problem. Every architecture which
> supports virtualization has the same issue. ARM(64) has no way to tell
> for sure whether the machine runs bare metal either. No idea about the
> other architectures.

Sounds like a hypervisor problem. If the VMM provides perfect emulation
of every weird quirk of h/w, then it is OK to let the guest believe that it is
running on bare metal.

If it isn't perfect, then it should make sure the guest knows *for sure*, so that
the guest can take appropriate actions to avoid the sharp edges.

-Tony

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-13 16:12                                                               ` Luck, Tony
@ 2020-11-13 17:38                                                                 ` Raj, Ashok
  0 siblings, 0 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-13 17:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Tian, Kevin, Wilk, Konrad, Jason Gunthorpe,
	Williams, Dan J, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha,
	maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, kwankhede, eric.auger, parav, rafael,
	netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain,
	Mona, dmaengine, linux-kernel, linux-pci, kvm, Ashok Raj,
	andrew.cooper3

On Fri, Nov 13, 2020 at 08:12:39AM -0800, Luck, Tony wrote:
> > Of course is this not only an x86 problem. Every architecture which
> > supports virtualization has the same issue. ARM(64) has no way to tell
> > for sure whether the machine runs bare metal either. No idea about the
> > other architectures.
> 
> Sounds like a hypervisor problem. If the VMM provides perfect emulation
> of every weird quirk of h/w, then it is OK to let the guest believe that it is
> running on bare metal.

That's true, which is why there isn't an immutable bit in cpuid or
otherwise telling you are running under a hypervisor. Providing something
like that would make certain features not virtualizable. Apparently before we
had faulting cpuid, what you had in guest was the real raw cpuid. 

Waiver: I'm not saying this is perfect, I'm just replaying the reason
behind it. Not trying to defend it... flames > /dev/null
> 
> If it isn't perfect, then it should make sure the guest knows *for sure*, so that
> the guest can take appropriate actions to avoid the sharp edges.
> 

There are indeed 2 problems to solve.

1. How does device driver know if device is IMS capable.

   IMS is a device attribute. Each vendor can provide its own method to
   provide that indication. One such mechanism is the DVSEC.SIOV.IMS
   property. Some might believe this is for use only by Intel. For DVSEC I
   don't believe there is such a connection as in device vendor id in
   standard header. TBH, there are other device vendors using the exact
   same method to indicate SIOV and IMS propeties. What a DVSEC vendor ID
   states is "As defined by Vendor X". 

   Why we choose a config vs something in device specific mmio is because
   today VFIO being that one common mechanism, it only exposes known
   standard and some extended headers to guest. When we expose a full PF,
   the guest doens't see the DVSEC, so drivers know this isn't available.

   This is our mechanism to stop drivers from calling
   pci_ims_array_create_msi_irq_domain(). It may not be perfect for all
   devices, it is a device specific mechanism. For devices under
   consideration following the SIOV spec it meets the sprit of the
   requirement even without #2 below. When devices have no way to detect
   this, #2 is required as a second way to block IMS.

2. How does platform component (IOMMU) inform if they can support all forms
   of IMS. (On device, or in memory). 
   
   On device would require some form trap/emulate. Legacy MSIx already has
   that solved, but for device specific store you need some additional
   work.

   When its system memory (say IMS is in GPA space), you need some form of
   hypercall. There is no way around it since we can't intercept. Yes, you
   can maybe map those as RO and trap, but its not pretty.

   To solve this rather than a generic platform capability, maybe we should
   flip this to IOMMU instead, because that's the one that offers this
   capability today.

   iommu_ims_supported() 
   	When platform has no IOMMU or no hypervisor calls, it returns
	false. So device driver can tell, even if it supports IMS
	capability deduction, does the platform support IMS.
   
        On platforms where iommu supports capability.

	Either there is a vIOMMU with a Virtual Command Register that can
	provide a way to get the interrupt handle similar to what you would
	get from an hypercall for instance. Or there is a real hypercall
	that supports giving the guest OS the physical IRTE handle. 


-- 
Cheers,
Ashok

[Forgiveness is the attribute of the STRONG - Gandhi]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-12 22:42                                                         ` Thomas Gleixner
  2020-11-13  2:42                                                           ` Tian, Kevin
@ 2020-11-14 10:34                                                           ` Christoph Hellwig
  2020-11-14 21:18                                                             ` Raj, Ashok
  1 sibling, 1 reply; 123+ messages in thread
From: Christoph Hellwig @ 2020-11-14 10:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Konrad Rzeszutek Wilk, Tian, Kevin, Raj, Ashok, Jason Gunthorpe,
	Williams, Dan J, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha,
	maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm

On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> DMI vendor name is pretty good final check when the bit is 0. The
> strings I'm aware of are:
> 
> QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH, Oracle
> Corporation, Parallels, BHYVE, Microsoft Corporation
> 
> which is not complete but better than nothing ;)

Which is why I really think we need explicit opt-ins for "native"
SIOV handling and for paravirtualized SIOV handling, with the kernel
not offering support at all without either or a manual override on
the command line.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-14 10:34                                                           ` Christoph Hellwig
@ 2020-11-14 21:18                                                             ` Raj, Ashok
  2020-11-15 11:26                                                               ` Thomas Gleixner
  2020-11-16  8:25                                                               ` Christoph Hellwig
  0 siblings, 2 replies; 123+ messages in thread
From: Raj, Ashok @ 2020-11-14 21:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thomas Gleixner, Konrad Rzeszutek Wilk, Tian, Kevin,
	Jason Gunthorpe, Williams, Dan J, Jiang, Dave, Bjorn Helgaas,
	vkoul, Dey, Megha, maz, bhelgaas, alex.williamson, Pan,
	Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

On Sat, Nov 14, 2020 at 10:34:30AM +0000, Christoph Hellwig wrote:
> On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> > DMI vendor name is pretty good final check when the bit is 0. The
> > strings I'm aware of are:
> > 
> > QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH, Oracle
> > Corporation, Parallels, BHYVE, Microsoft Corporation
> > 
> > which is not complete but better than nothing ;)
> 
> Which is why I really think we need explicit opt-ins for "native"
> SIOV handling and for paravirtualized SIOV handling, with the kernel
> not offering support at all without either or a manual override on
> the command line.

opt-in by device or kernel? The way we are planning to support this is:

Device support for IMS - Can discover in device specific means
Kernel support for IMS. - Supported by IOMMU driver.

each driver can check 

if (dev_supports_ims() && iommu_supports_ims()) {
	/* Then IMS is supported in the platform.*/
}


until we have vIOMMU support or a hypercall, iommu_supports_ims() will
check if X86_FEATURE_HYPERVISOR in addition to the platform id's Thomas
mentioned. or on intel platform check for cap.caching_mode=1 and return false.

When we add support for getting a native interrupt handle then we will plumb that
appropriately.

Does this match what you wanted?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-14 21:18                                                             ` Raj, Ashok
@ 2020-11-15 11:26                                                               ` Thomas Gleixner
  2020-11-15 19:31                                                                 ` Raj, Ashok
  2020-11-16  8:25                                                               ` Christoph Hellwig
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-15 11:26 UTC (permalink / raw)
  To: Raj, Ashok, Christoph Hellwig
  Cc: Konrad Rzeszutek Wilk, Tian, Kevin, Jason Gunthorpe, Williams,
	Dan J, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz,
	bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu,
	Kumar, Sanjay K, Luck, Tony, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm,
	Ashok Raj

On Sat, Nov 14 2020 at 13:18, Ashok Raj wrote:
> On Sat, Nov 14, 2020 at 10:34:30AM +0000, Christoph Hellwig wrote:
>> On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
>> Which is why I really think we need explicit opt-ins for "native"
>> SIOV handling and for paravirtualized SIOV handling, with the kernel
>> not offering support at all without either or a manual override on
>> the command line.
>
> opt-in by device or kernel? The way we are planning to support this is:
>
> Device support for IMS - Can discover in device specific means
> Kernel support for IMS. - Supported by IOMMU driver.

And why exactly do we have to enforce IOMMU support? Please stop looking
at IMS purely from the IDXD perspective. We are talking about the
general concept here and not about the restricted Intel universe.

> each driver can check 
>
> if (dev_supports_ims() && iommu_supports_ims()) {
> 	/* Then IMS is supported in the platform.*/
> }

Please forget this 'each driver can check'. That's just wrong.

The only thing the driver has to check is whether the device supports
IMS or not. Everything else has to be handled by the underlying
infrastructure.

That's pretty much the same thing like PCI/MSI[X]. The driver does not
have to check 'device_has_msix() && platform_supports_msix()'. Enabling
MSI[X] will simply fail if it's not supported.

So for IMS creating the underlying irqdomain has to fail when the
platform does not support it and the driver can act upon the fail and
fallback to MSI[X] or just refuse to load when IMS is required for the
device to be functional.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-15 11:26                                                               ` Thomas Gleixner
@ 2020-11-15 19:31                                                                 ` Raj, Ashok
  2020-11-15 22:11                                                                   ` Thomas Gleixner
  0 siblings, 1 reply; 123+ messages in thread
From: Raj, Ashok @ 2020-11-15 19:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, Konrad Rzeszutek Wilk, Tian, Kevin,
	Jason Gunthorpe, Williams, Dan J, Jiang, Dave, Bjorn Helgaas,
	vkoul, Dey, Megha, maz, bhelgaas, alex.williamson, Pan,
	Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
> On Sat, Nov 14 2020 at 13:18, Ashok Raj wrote:
> > On Sat, Nov 14, 2020 at 10:34:30AM +0000, Christoph Hellwig wrote:
> >> On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> >> Which is why I really think we need explicit opt-ins for "native"
> >> SIOV handling and for paravirtualized SIOV handling, with the kernel
> >> not offering support at all without either or a manual override on
> >> the command line.
> >
> > opt-in by device or kernel? The way we are planning to support this is:
> >
> > Device support for IMS - Can discover in device specific means
> > Kernel support for IMS. - Supported by IOMMU driver.
> 
> And why exactly do we have to enforce IOMMU support? Please stop looking
> at IMS purely from the IDXD perspective. We are talking about the
> general concept here and not about the restricted Intel universe.

I think you have mentioned it almost every reply :-)..Got that! Point taken
several emails ago!! :-)

I didn't mean just for idxd, I said for *ANY* device driver that wants to
use IMS.

> 
> > each driver can check 
> >
> > if (dev_supports_ims() && iommu_supports_ims()) {
> > 	/* Then IMS is supported in the platform.*/
> > }
> 
> Please forget this 'each driver can check'. That's just wrong.

Ok.

> 
> The only thing the driver has to check is whether the device supports
> IMS or not. Everything else has to be handled by the underlying
> infrastructure.

That's pretty much the same thing.. I guess you wanted to add 
"Does infrastructure support IMS" to be someplace else, instead
of device driver checking it. That's perfectly fine.

Until we support this natively via hypercall or vIOMMU we can use your
varient of finding if you are not on bare_metal to decide support for IMS.

How you highligted below:

https://lore.kernel.org/lkml/877dqrnzr3.fsf@nanos.tec.linutronix.de/

probably_on_bare_metal()
{
        if (CPUID(FEATURE_HYPERVISOR))
        	return false;
       	if (dmi_match_hypervisor_vendor())
        	return false;

        return PROBABLY_RUNNING_ON_BARE_METAL;
}

The above is all we need for now and will work in almost all cases. 
We will move forward with just the above in the next series.

Below is for future consideration.

Even the above isn't fool proof if both HYPERVISOR feature flag isn't set,
and the dmi_string doesn't match, say some new hypervisor. The only way 
we can figure that is

- If no iommu support, or iommu can tell if this is a virtualized iommu.
The presence of caching_mode is one such indication for Intel. 

PS: Other IOMMU's must have something like this to support virtualization.
    I'm not saying this is an Intel only feature just in case you interpret
    it that way! I'm only saying if there is a mechanism to distinguish
    native vs emulated platform.

When vIOMMU supports getting native interrupt handle via a virtual command
interface for Intel IOMMU's. OR some equivalent when other vedors provide
such capability. Even without a hypercall virtualizing IOMMU can provide
the same solution.

If we support hypercall then its more generic so it would fall into the
native all platforms/vendors. Certainly the most scalable long term
solution.


Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-15 19:31                                                                 ` Raj, Ashok
@ 2020-11-15 22:11                                                                   ` Thomas Gleixner
  2020-11-16  0:22                                                                     ` Raj, Ashok
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-15 22:11 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Christoph Hellwig, Konrad Rzeszutek Wilk, Tian, Kevin,
	Jason Gunthorpe, Williams, Dan J, Jiang, Dave, Bjorn Helgaas,
	vkoul, Dey, Megha, maz, bhelgaas, alex.williamson, Pan,
	Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

On Sun, Nov 15 2020 at 11:31, Ashok Raj wrote:
> On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
>> > opt-in by device or kernel? The way we are planning to support this is:
>> >
>> > Device support for IMS - Can discover in device specific means
>> > Kernel support for IMS. - Supported by IOMMU driver.
>> 
>> And why exactly do we have to enforce IOMMU support? Please stop looking
>> at IMS purely from the IDXD perspective. We are talking about the
>> general concept here and not about the restricted Intel universe.
>
> I think you have mentioned it almost every reply :-)..Got that! Point taken
> several emails ago!! :-)

You sure? I _try_ to not mention it again then. No promise though. :)

> I didn't mean just for idxd, I said for *ANY* device driver that wants to
> use IMS.

Which is wrong. Again:

A) For PF/VF on bare metal there is absolutely no IOMMU dependency
   because it does not have a PASID requirement. It's just an
   alternative solution to MSI[X], which allows optimizations like
   storing the message in driver manages queue memory or lifting the
   restriction of 2048 interrupts per device. Nothing else.

B) For PF/VF in a guest the IOMMU dependency of IMS is a red herring.
   There is no direct dependency on the IOMMU.

   The problem is the inability of the VMM to trap the message write to
   the IMS storage if the storage is in guest driver managed memory.
   This can be solved with either

   - a hypercall which translates the guest MSI message
   or
   - a vIOMMU which uses a hypercall or whatever to translate the guest
     MSI message

C) Subdevices ala mdev are a different story. They require PASID which
   enforces IOMMU and the IMS part is not managed by the users anyway.

So we have a couple of problems to solve:

  1) Figure out whether the OS runs on bare metal

     There is no reliable answer to that, so we either:

      - Use heuristics and assume that failure is unlikely and in case
        of failure blame the incompetence of VMM authors and/or
        sysadmins

     or
     
      - Default to IMS disabled and let the sysadmin enable it via
        command line option.

        If the kernel detects to run in a VM it yells and disables it
        unless the OS and the hypervisor agree to provide support for
        that scenario (see #2).

        That's fails as well if the sysadmin does so when the OS runs on
        a VMM which is not identifiable, but at least we can rightfully
        blame the sysadmin in that case.

     or

      - Declare that IMS always depends on IOMMU

        I personaly don't care, but people working on these kind of
        device already said, that they want to avoid it when possible.
        
        If you want to go that route, then please talk to those folks
        and ask them to agree in public.

     You also need to take into account that this must work on all
     architectures which support virtualization because IMS is
     architecture independent.

  2) Guest support for PF/VF

     Again we have several scenarios depending on the IMS storage
     type.

      - If the storage type is device memory then it's pretty much the
        same as MSI[X] just a different location.

      - If the storage is in driver managed memory then this needs
        #1 plus guest OS and hypervisor support (hypercall/vIOMMU)
        
  3) Guest support for PF/VF and guest managed subdevice (mdev)

     Depends on #1 and #2 and is an orthogonal problem if I'm not
     missing something.

To move forward we need to make a decision about #1 and #2 now.

This needs to be well thought out as changing it after the fact is
going to be a nightmare.

/me grudgingly refrains from mentioning the obvious once more.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-15 22:11                                                                   ` Thomas Gleixner
@ 2020-11-16  0:22                                                                     ` Raj, Ashok
  2020-11-16  7:31                                                                       ` Tian, Kevin
  0 siblings, 1 reply; 123+ messages in thread
From: Raj, Ashok @ 2020-11-16  0:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, Konrad Rzeszutek Wilk, Tian, Kevin,
	Jason Gunthorpe, Williams, Dan J, Jiang, Dave, Bjorn Helgaas,
	vkoul, Dey, Megha, maz, bhelgaas, alex.williamson, Pan,
	Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck, Tony,
	kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm, Ashok Raj

On Sun, Nov 15, 2020 at 11:11:27PM +0100, Thomas Gleixner wrote:
> On Sun, Nov 15 2020 at 11:31, Ashok Raj wrote:
> > On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
> >> > opt-in by device or kernel? The way we are planning to support this is:
> >> >
> >> > Device support for IMS - Can discover in device specific means
> >> > Kernel support for IMS. - Supported by IOMMU driver.
> >> 
> >> And why exactly do we have to enforce IOMMU support? Please stop looking
> >> at IMS purely from the IDXD perspective. We are talking about the
> >> general concept here and not about the restricted Intel universe.
> >
> > I think you have mentioned it almost every reply :-)..Got that! Point taken
> > several emails ago!! :-)
> 
> You sure? I _try_ to not mention it again then. No promise though. :)

Hey.. anything that's entertaining go for it :-)

> 
> > I didn't mean just for idxd, I said for *ANY* device driver that wants to
> > use IMS.
> 
> Which is wrong. Again:
> 
> A) For PF/VF on bare metal there is absolutely no IOMMU dependency
>    because it does not have a PASID requirement. It's just an
>    alternative solution to MSI[X], which allows optimizations like
>    storing the message in driver manages queue memory or lifting the
>    restriction of 2048 interrupts per device. Nothing else.

You are right.. my eyes were clouded by virtualization.. no dependency for
native absolutely.

> 
> B) For PF/VF in a guest the IOMMU dependency of IMS is a red herring.
>    There is no direct dependency on the IOMMU.
> 
>    The problem is the inability of the VMM to trap the message write to
>    the IMS storage if the storage is in guest driver managed memory.
>    This can be solved with either
> 
>    - a hypercall which translates the guest MSI message
>    or
>    - a vIOMMU which uses a hypercall or whatever to translate the guest
>      MSI message
> 
> C) Subdevices ala mdev are a different story. They require PASID which
>    enforces IOMMU and the IMS part is not managed by the users anyway.

You are right again :)

The subdevices require PASID & IOMMU in native, but inside the guest there is no
need for IOMMU unless you want to build SVM on top. subdevices work without
any vIOMMU or hypercall in the guest. Only because they look like normal
PCI devices we could map interrupts to legacy MSIx.

> 
> So we have a couple of problems to solve:
> 
>   1) Figure out whether the OS runs on bare metal
> 
>      There is no reliable answer to that, so we either:
> 
>       - Use heuristics and assume that failure is unlikely and in case
>         of failure blame the incompetence of VMM authors and/or
>         sysadmins
> 
>      or
>      
>       - Default to IMS disabled and let the sysadmin enable it via
>         command line option.
> 
>         If the kernel detects to run in a VM it yells and disables it
>         unless the OS and the hypervisor agree to provide support for
>         that scenario (see #2).
> 
>         That's fails as well if the sysadmin does so when the OS runs on
>         a VMM which is not identifiable, but at least we can rightfully
>         blame the sysadmin in that case.

cmdline isn't nice, best to have this functional out of box.

> 
>      or
> 
>       - Declare that IMS always depends on IOMMU

As you had mentioned IMS has no real dependency on IOMMU in native.

we just need to make sure if running in guest we have support for it
plumbed.

> 
>         I personaly don't care, but people working on these kind of
>         device already said, that they want to avoid it when possible.
>         
>         If you want to go that route, then please talk to those folks
>         and ask them to agree in public.
> 
>      You also need to take into account that this must work on all
>      architectures which support virtualization because IMS is
>      architecture independent.

What you suggest makes perfect sense. We can certainly get buy in from
iommu list and have this co-ordinated between all existing iommu varients.

> 
>   2) Guest support for PF/VF
> 
>      Again we have several scenarios depending on the IMS storage
>      type.
> 
>       - If the storage type is device memory then it's pretty much the
>         same as MSI[X] just a different location.

True, but still need to have some special handling for trapping those mmio
access. Unlike for MSIx VFIO already traps them and everything is
pre-plummbed. It isn't seamless as its for MSIx.

> 
>       - If the storage is in driver managed memory then this needs
>         #1 plus guest OS and hypervisor support (hypercall/vIOMMU)

Violent agreement here :-)

>         
>   3) Guest support for PF/VF and guest managed subdevice (mdev)
> 
>      Depends on #1 and #2 and is an orthogonal problem if I'm not
>      missing something.
> 
> To move forward we need to make a decision about #1 and #2 now.

Mostly in agreement. Except for mdev (current considered use case) have no
need for IMS in the guest. (Don't get me wrong, I'm not saying some odd
device managing sub-devices would need IMS in addition and that the 2048
MSIx emulation. 
> 
> This needs to be well thought out as changing it after the fact is
> going to be a nightmare.
> 
> /me grudgingly refrains from mentioning the obvious once more.
> 

So this isn't an idxd and Intel only thing :-)... 

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-16  0:22                                                                     ` Raj, Ashok
@ 2020-11-16  7:31                                                                       ` Tian, Kevin
  2020-11-16 15:46                                                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2020-11-16  7:31 UTC (permalink / raw)
  To: Raj, Ashok, Thomas Gleixner
  Cc: Christoph Hellwig, Wilk, Konrad, Jason Gunthorpe, Williams,
	Dan J, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz,
	bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu,
	Kumar, Sanjay K, Luck, Tony, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm, Raj,
	Ashok

> From: Raj, Ashok <ashok.raj@intel.com>
> Sent: Monday, November 16, 2020 8:23 AM
> 
> On Sun, Nov 15, 2020 at 11:11:27PM +0100, Thomas Gleixner wrote:
> > On Sun, Nov 15 2020 at 11:31, Ashok Raj wrote:
> > > On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
> > >> > opt-in by device or kernel? The way we are planning to support this is:
> > >> >
> > >> > Device support for IMS - Can discover in device specific means
> > >> > Kernel support for IMS. - Supported by IOMMU driver.
> > >>
> > >> And why exactly do we have to enforce IOMMU support? Please stop
> looking
> > >> at IMS purely from the IDXD perspective. We are talking about the
> > >> general concept here and not about the restricted Intel universe.
> > >
> > > I think you have mentioned it almost every reply :-)..Got that! Point taken
> > > several emails ago!! :-)
> >
> > You sure? I _try_ to not mention it again then. No promise though. :)
> 
> Hey.. anything that's entertaining go for it :-)
> 
> >
> > > I didn't mean just for idxd, I said for *ANY* device driver that wants to
> > > use IMS.
> >
> > Which is wrong. Again:
> >
> > A) For PF/VF on bare metal there is absolutely no IOMMU dependency
> >    because it does not have a PASID requirement. It's just an
> >    alternative solution to MSI[X], which allows optimizations like
> >    storing the message in driver manages queue memory or lifting the
> >    restriction of 2048 interrupts per device. Nothing else.
> 
> You are right.. my eyes were clouded by virtualization.. no dependency for
> native absolutely.
> 
> >
> > B) For PF/VF in a guest the IOMMU dependency of IMS is a red herring.
> >    There is no direct dependency on the IOMMU.
> >
> >    The problem is the inability of the VMM to trap the message write to
> >    the IMS storage if the storage is in guest driver managed memory.
> >    This can be solved with either
> >
> >    - a hypercall which translates the guest MSI message
> >    or
> >    - a vIOMMU which uses a hypercall or whatever to translate the guest
> >      MSI message
> >
> > C) Subdevices ala mdev are a different story. They require PASID which
> >    enforces IOMMU and the IMS part is not managed by the users anyway.
> 
> You are right again :)
> 
> The subdevices require PASID & IOMMU in native, but inside the guest there
> is no
> need for IOMMU unless you want to build SVM on top. subdevices work
> without
> any vIOMMU or hypercall in the guest. Only because they look like normal
> PCI devices we could map interrupts to legacy MSIx.

Guest managed subdevices on PF/VF requires vIOMMU. Anyway I think
Thomas was just pointing out that subdevices are the only category out
of above three which may have business tied to IOMMU. 😊

> 
> >
> > So we have a couple of problems to solve:
> >
> >   1) Figure out whether the OS runs on bare metal
> >
> >      There is no reliable answer to that, so we either:
> >
> >       - Use heuristics and assume that failure is unlikely and in case
> >         of failure blame the incompetence of VMM authors and/or
> >         sysadmins
> >
> >      or
> >
> >       - Default to IMS disabled and let the sysadmin enable it via
> >         command line option.
> >
> >         If the kernel detects to run in a VM it yells and disables it
> >         unless the OS and the hypervisor agree to provide support for
> >         that scenario (see #2).
> >
> >         That's fails as well if the sysadmin does so when the OS runs on
> >         a VMM which is not identifiable, but at least we can rightfully
> >         blame the sysadmin in that case.
> 
> cmdline isn't nice, best to have this functional out of box.
> 
> >
> >      or
> >
> >       - Declare that IMS always depends on IOMMU
> 
> As you had mentioned IMS has no real dependency on IOMMU in native.
> 
> we just need to make sure if running in guest we have support for it
> plumbed.
> 
> >
> >         I personaly don't care, but people working on these kind of
> >         device already said, that they want to avoid it when possible.
> >
> >         If you want to go that route, then please talk to those folks
> >         and ask them to agree in public.
> >
> >      You also need to take into account that this must work on all
> >      architectures which support virtualization because IMS is
> >      architecture independent.
> 
> What you suggest makes perfect sense. We can certainly get buy in from
> iommu list and have this co-ordinated between all existing iommu varients.

Does a hybrid scheme sound good here?

- Say a cmdline parameter: ims=[auto|on|off], with 'auto' as default;

- if ims=auto:

    * If arch doesn't implement probably_on_bare_metal, disallow ims;

    * If probably_on_bare_metal returns false, disallow ims;
	# (future) if hypercall is supported, allow ims;

    * If probably_on_bare_metal returns true, allow ims with caveat on
possible mis-interception of running on an old hypervisor. Sysadmin
may need to double-confirm in other means 
	# (future) if definitely_on_bare_metal is supported, no caveat;

- if ims=on:

    * If probably_on_bare_metal return false, yell and disable it until
hypercall is supported;

    * In all other cases allow ims. Sysadmin should be blamed if any
failure as doing so implies that extra confirmation has been done;

- if ims=off, then leave it off.

It's not necessary to claim strict dependency between ims and iommu.
Instead, we could leave iommu being an arch specific check when it
applies:

probably_on_bare_metal()
{
       if (CPUID(FEATURE_HYPERVISOR))
        	return false;
       if (dmi_match_hypervisor_vendor())
        	return false;
       if (iommu_existing() && iommu_in_guest())
        	return false;

        return PROBABLY_RUNNING_ON_BARE_METAL;
}
 
> 
> >
> >   2) Guest support for PF/VF
> >
> >      Again we have several scenarios depending on the IMS storage
> >      type.
> >
> >       - If the storage type is device memory then it's pretty much the
> >         same as MSI[X] just a different location.
> 
> True, but still need to have some special handling for trapping those mmio
> access. Unlike for MSIx VFIO already traps them and everything is
> pre-plummbed. It isn't seamless as its for MSIx.

yes. So what about tying guest IMS to hypercall even when emulation
is possible on some devices?  It's difficult for the guest to know that
its IMS is emulated by hypervisor. Adopting an unified policy for all
IMS-capable devices might be an easier path.

> 
> >
> >       - If the storage is in driver managed memory then this needs
> >         #1 plus guest OS and hypervisor support (hypercall/vIOMMU)
> 
> Violent agreement here :-)
> 
> >
> >   3) Guest support for PF/VF and guest managed subdevice (mdev)
> >
> >      Depends on #1 and #2 and is an orthogonal problem if I'm not
> >      missing something.
> >
> > To move forward we need to make a decision about #1 and #2 now.
> 
> Mostly in agreement. Except for mdev (current considered use case) have no
> need for IMS in the guest. (Don't get me wrong, I'm not saying some odd
> device managing sub-devices would need IMS in addition and that the 2048
> MSIx emulation.
> >
> > This needs to be well thought out as changing it after the fact is
> > going to be a nightmare.
> >
> > /me grudgingly refrains from mentioning the obvious once more.
> >
> 
> So this isn't an idxd and Intel only thing :-)...
> 
> Cheers,
> Ashok

Thanks
Kevin


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-14 21:18                                                             ` Raj, Ashok
  2020-11-15 11:26                                                               ` Thomas Gleixner
@ 2020-11-16  8:25                                                               ` Christoph Hellwig
  1 sibling, 0 replies; 123+ messages in thread
From: Christoph Hellwig @ 2020-11-16  8:25 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Christoph Hellwig, Thomas Gleixner, Konrad Rzeszutek Wilk, Tian,
	Kevin, Jason Gunthorpe, Williams, Dan J, Jiang, Dave,
	Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas, alex.williamson,
	Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar, Sanjay K, Luck,
	Tony, kwankhede, eric.auger, parav, rafael, netanelg, shahafs,
	yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain, Mona, dmaengine,
	linux-kernel, linux-pci, kvm

On Sat, Nov 14, 2020 at 01:18:37PM -0800, Raj, Ashok wrote:
> On Sat, Nov 14, 2020 at 10:34:30AM +0000, Christoph Hellwig wrote:
> > On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> > > DMI vendor name is pretty good final check when the bit is 0. The
> > > strings I'm aware of are:
> > > 
> > > QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH, Oracle
> > > Corporation, Parallels, BHYVE, Microsoft Corporation
> > > 
> > > which is not complete but better than nothing ;)
> > 
> > Which is why I really think we need explicit opt-ins for "native"
> > SIOV handling and for paravirtualized SIOV handling, with the kernel
> > not offering support at all without either or a manual override on
> > the command line.
> 
> opt-in by device or kernel? The way we are planning to support this is:

opt-in by the platform.  Not sure if an ACPI interface or something else
would be best.  But basically the kernel needs to be able to query:

Does this platform claim to support IMS, and if yes how.  If there is no
answer we need assume the platform doesn't.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-16  7:31                                                                       ` Tian, Kevin
@ 2020-11-16 15:46                                                                         ` Jason Gunthorpe
  2020-11-16 17:56                                                                           ` Thomas Gleixner
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-16 15:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Raj, Ashok, Thomas Gleixner, Christoph Hellwig, Wilk, Konrad,
	Williams, Dan J, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha,
	maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm

On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:

> > The subdevices require PASID & IOMMU in native, but inside the guest there
> > is no
> > need for IOMMU unless you want to build SVM on top. subdevices work
> > without
> > any vIOMMU or hypercall in the guest. Only because they look like normal
> > PCI devices we could map interrupts to legacy MSIx.
> 
> Guest managed subdevices on PF/VF requires vIOMMU. 

Why? I've never heard we need vIOMMU for our existing SRIOV flows in
VMs??

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-16 15:46                                                                         ` Jason Gunthorpe
@ 2020-11-16 17:56                                                                           ` Thomas Gleixner
  2020-11-16 18:02                                                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-16 17:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Raj, Ashok, Christoph Hellwig, Wilk, Konrad, Williams, Dan J,
	Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, kwankhede, eric.auger, parav, rafael,
	netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain,
	Mona, dmaengine, linux-kernel, linux-pci, kvm

On Mon, Nov 16 2020 at 11:46, Jason Gunthorpe wrote:

> On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:
>
>> > The subdevices require PASID & IOMMU in native, but inside the guest there
>> > is no
>> > need for IOMMU unless you want to build SVM on top. subdevices work
>> > without
>> > any vIOMMU or hypercall in the guest. Only because they look like normal
>> > PCI devices we could map interrupts to legacy MSIx.
>> 
>> Guest managed subdevices on PF/VF requires vIOMMU. 
>
> Why? I've never heard we need vIOMMU for our existing SRIOV flows in
> VMs??

Handing PF/VF into the guest does not require it.

But if the PF/VF driver in the guest wants to create and manage the
magic mdev subdevices which require PASID support then you surely need
it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-16 17:56                                                                           ` Thomas Gleixner
@ 2020-11-16 18:02                                                                             ` Jason Gunthorpe
  2020-11-16 20:37                                                                               ` Thomas Gleixner
  2020-11-16 23:51                                                                               ` Tian, Kevin
  0 siblings, 2 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2020-11-16 18:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tian, Kevin, Raj, Ashok, Christoph Hellwig, Wilk, Konrad,
	Williams, Dan J, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha,
	maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm

On Mon, Nov 16, 2020 at 06:56:33PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 16 2020 at 11:46, Jason Gunthorpe wrote:
> 
> > On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:
> >
> >> > The subdevices require PASID & IOMMU in native, but inside the guest there
> >> > is no
> >> > need for IOMMU unless you want to build SVM on top. subdevices work
> >> > without
> >> > any vIOMMU or hypercall in the guest. Only because they look like normal
> >> > PCI devices we could map interrupts to legacy MSIx.
> >> 
> >> Guest managed subdevices on PF/VF requires vIOMMU. 
> >
> > Why? I've never heard we need vIOMMU for our existing SRIOV flows in
> > VMs??
> 
> Handing PF/VF into the guest does not require it.
> 
> But if the PF/VF driver in the guest wants to create and manage the
> magic mdev subdevices which require PASID support then you surely need
> it.

'magic mdevs' are only one reason to use IMS in a guest. On mlx5 we
might want to use IMS for VPDA devices. mlx5 can spawn a VDPA device
in a guest, against a 'ADI', without ever requiring an IOMMU to do it.

We don't even need IOMMU in the hypervisor to create the ADI, mlx5 has
an internal secure IOMMU that can be used instead of the platform
IOMMU.

Not saying this is a major use case, or a reason not to link things to
IOMMU detection, but lets be clear that a hard need for IOMMU is a
another IDXD thing, not general.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-16 18:02                                                                             ` Jason Gunthorpe
@ 2020-11-16 20:37                                                                               ` Thomas Gleixner
  2020-11-16 23:51                                                                               ` Tian, Kevin
  1 sibling, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-16 20:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Raj, Ashok, Christoph Hellwig, Wilk, Konrad,
	Williams, Dan J, Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha,
	maz, bhelgaas, alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu,
	Baolu, Kumar, Sanjay K, Luck, Tony, kwankhede, eric.auger, parav,
	rafael, netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel,
	Hossain, Mona, dmaengine, linux-kernel, linux-pci, kvm

On Mon, Nov 16 2020 at 14:02, Jason Gunthorpe wrote:
> On Mon, Nov 16, 2020 at 06:56:33PM +0100, Thomas Gleixner wrote:
>> On Mon, Nov 16 2020 at 11:46, Jason Gunthorpe wrote:
>> 
>> > On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:
>> >
>> >> > The subdevices require PASID & IOMMU in native, but inside the guest there
>> >> > is no
>> >> > need for IOMMU unless you want to build SVM on top. subdevices work
>> >> > without
>> >> > any vIOMMU or hypercall in the guest. Only because they look like normal
>> >> > PCI devices we could map interrupts to legacy MSIx.
>> >> 
>> >> Guest managed subdevices on PF/VF requires vIOMMU. 
>> >
>> > Why? I've never heard we need vIOMMU for our existing SRIOV flows in
>> > VMs??
>> 
>> Handing PF/VF into the guest does not require it.
>> 
>> But if the PF/VF driver in the guest wants to create and manage the
>> magic mdev subdevices which require PASID support then you surely need
>> it.
>
> 'magic mdevs' are only one reason to use IMS in a guest. On mlx5 we
> might want to use IMS for VPDA devices. mlx5 can spawn a VDPA device
> in a guest, against a 'ADI', without ever requiring an IOMMU to do it.
>
> We don't even need IOMMU in the hypervisor to create the ADI, mlx5 has
> an internal secure IOMMU that can be used instead of the platform
> IOMMU.
>
> Not saying this is a major use case, or a reason not to link things to
> IOMMU detection, but lets be clear that a hard need for IOMMU is a
> another IDXD thing, not general.

Fair enough.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-16 18:02                                                                             ` Jason Gunthorpe
  2020-11-16 20:37                                                                               ` Thomas Gleixner
@ 2020-11-16 23:51                                                                               ` Tian, Kevin
  2020-11-17  9:21                                                                                 ` Thomas Gleixner
  1 sibling, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2020-11-16 23:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Gleixner
  Cc: Raj, Ashok, Christoph Hellwig, Wilk, Konrad, Williams, Dan J,
	Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, kwankhede, eric.auger, parav, rafael,
	netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain,
	Mona, dmaengine, linux-kernel, linux-pci, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 17, 2020 2:03 AM
> 
> On Mon, Nov 16, 2020 at 06:56:33PM +0100, Thomas Gleixner wrote:
> > On Mon, Nov 16 2020 at 11:46, Jason Gunthorpe wrote:
> >
> > > On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:
> > >
> > >> > The subdevices require PASID & IOMMU in native, but inside the guest
> there
> > >> > is no
> > >> > need for IOMMU unless you want to build SVM on top. subdevices
> work
> > >> > without
> > >> > any vIOMMU or hypercall in the guest. Only because they look like
> normal
> > >> > PCI devices we could map interrupts to legacy MSIx.
> > >>
> > >> Guest managed subdevices on PF/VF requires vIOMMU.
> > >
> > > Why? I've never heard we need vIOMMU for our existing SRIOV flows in
> > > VMs??
> >
> > Handing PF/VF into the guest does not require it.
> >
> > But if the PF/VF driver in the guest wants to create and manage the
> > magic mdev subdevices which require PASID support then you surely need
> > it.
> 
> 'magic mdevs' are only one reason to use IMS in a guest. On mlx5 we
> might want to use IMS for VPDA devices. mlx5 can spawn a VDPA device
> in a guest, against a 'ADI', without ever requiring an IOMMU to do it.
> 
> We don't even need IOMMU in the hypervisor to create the ADI, mlx5 has
> an internal secure IOMMU that can be used instead of the platform
> IOMMU.
> 
> Not saying this is a major use case, or a reason not to link things to
> IOMMU detection, but lets be clear that a hard need for IOMMU is a
> another IDXD thing, not general.
> 

I should use "may require" in original post. and one thing that I obviously
mixed is the requirement of PASID-granular interrupt isolation in the
physical IOMMU instead of virtual IOMMU. But anyway, I didn't attempt
to use above to build hard need for IOMMU, just the opposite when looking
at all three cases together.

btw Jason/Thomas, how do you think about the proposal down in this
thread (ims=[auto|on|off])? Does it sound a good tradeoff to move forward?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
  2020-11-16 23:51                                                                               ` Tian, Kevin
@ 2020-11-17  9:21                                                                                 ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2020-11-17  9:21 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Raj, Ashok, Christoph Hellwig, Wilk, Konrad, Williams, Dan J,
	Jiang, Dave, Bjorn Helgaas, vkoul, Dey, Megha, maz, bhelgaas,
	alex.williamson, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Kumar,
	Sanjay K, Luck, Tony, kwankhede, eric.auger, parav, rafael,
	netanelg, shahafs, yan.y.zhao, pbonzini, Ortiz, Samuel, Hossain,
	Mona, dmaengine, linux-kernel, linux-pci, kvm

On Mon, Nov 16 2020 at 23:51, Kevin Tian wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
> btw Jason/Thomas, how do you think about the proposal down in this
> thread (ims=[auto|on|off])? Does it sound a good tradeoff to move forward?

What does it solve? It defaults to auto and then you still need to solve
the problem of figuring out whether it's safe to use it or not.

The command line option is not a solution per se. It's the last resort
when the logic which decides whether IMS can be used or not fails to do
the right thing. Nothing more.

We clearly have outlined what needs to be done and you can come up with
as many magic bullets you want, they won't make the real problems go
away.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, other threads:[~2020-11-17  9:21 UTC | newest]

Thread overview: 123+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-30 18:50 [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
2020-10-30 18:50 ` [PATCH v4 01/17] irqchip: Add IMS (Interrupt Message Store) driver Dave Jiang
2020-10-30 22:01   ` Thomas Gleixner
2020-10-30 18:51 ` [PATCH v4 02/17] iommu/vt-d: Add DEV-MSI support Dave Jiang
2020-10-30 20:31   ` Thomas Gleixner
2020-10-30 20:52     ` Dave Jiang
2020-10-30 18:51 ` [PATCH v4 03/17] dmaengine: idxd: add theory of operation documentation for idxd mdev Dave Jiang
2020-10-30 18:51 ` [PATCH v4 04/17] dmaengine: idxd: add support for readonly config devices Dave Jiang
2020-10-30 18:51 ` [PATCH v4 05/17] dmaengine: idxd: add interrupt handle request support Dave Jiang
2020-10-30 18:51 ` [PATCH v4 06/17] PCI: add SIOV and IMS capability detection Dave Jiang
2020-10-30 19:51   ` Bjorn Helgaas
2020-10-30 21:20     ` Dave Jiang
2020-10-30 21:50       ` Bjorn Helgaas
2020-10-30 22:45       ` Jason Gunthorpe
2020-10-30 22:49         ` Dave Jiang
2020-11-02 13:21           ` Jason Gunthorpe
2020-11-03  2:49             ` Tian, Kevin
2020-11-03 12:43               ` Jason Gunthorpe
2020-11-04  3:41                 ` Tian, Kevin
2020-11-04 12:40                   ` Jason Gunthorpe
2020-11-04 13:34                     ` Tian, Kevin
2020-11-04 13:54                       ` Jason Gunthorpe
2020-11-06  9:48                         ` Tian, Kevin
2020-11-06 13:14                           ` Jason Gunthorpe
2020-11-06 16:48                             ` Raj, Ashok
2020-11-06 17:51                               ` Jason Gunthorpe
2020-11-06 23:47                                 ` Dan Williams
2020-11-07  0:12                                   ` Jason Gunthorpe
2020-11-07  1:42                                     ` Dan Williams
2020-11-08 18:11                                     ` Raj, Ashok
2020-11-08 18:34                                       ` David Woodhouse
2020-11-08 23:25                                         ` Raj, Ashok
2020-11-10 14:19                                           ` Raj, Ashok
2020-11-10 14:41                                             ` David Woodhouse
2020-11-08 23:41                                       ` Jason Gunthorpe
2020-11-09  0:05                                         ` Raj, Ashok
2020-11-08 18:47                                     ` Thomas Gleixner
2020-11-08 19:36                                       ` David Woodhouse
2020-11-08 22:47                                         ` Thomas Gleixner
2020-11-08 23:29                                           ` Jason Gunthorpe
2020-11-11 15:41                                         ` Christoph Hellwig
2020-11-11 16:09                                           ` Raj, Ashok
2020-11-11 22:27                                             ` Thomas Gleixner
2020-11-11 23:03                                               ` Raj, Ashok
2020-11-12  1:13                                                 ` Thomas Gleixner
2020-11-12 13:10                                                 ` Jason Gunthorpe
2020-11-08 23:23                                       ` Jason Gunthorpe
2020-11-08 23:36                                         ` Raj, Ashok
2020-11-09  7:37                                         ` Tian, Kevin
2020-11-09 16:46                                           ` Jason Gunthorpe
2020-11-08 23:58                                       ` Raj, Ashok
2020-11-09  7:59                                         ` Tian, Kevin
2020-11-09 11:21                                         ` Thomas Gleixner
2020-11-09 17:30                                           ` Jason Gunthorpe
2020-11-09 22:40                                             ` Raj, Ashok
2020-11-09 22:42                                             ` Thomas Gleixner
2020-11-10  5:14                                               ` Raj, Ashok
2020-11-10 10:27                                                 ` Thomas Gleixner
2020-11-10 14:13                                                   ` Raj, Ashok
2020-11-10 14:23                                                     ` Jason Gunthorpe
2020-11-11  2:17                                                       ` Tian, Kevin
2020-11-12 13:46                                                         ` Jason Gunthorpe
2020-11-11  7:14                                                     ` Tian, Kevin
2020-11-12 19:32                                                       ` Konrad Rzeszutek Wilk
2020-11-12 22:42                                                         ` Thomas Gleixner
2020-11-13  2:42                                                           ` Tian, Kevin
2020-11-13 12:57                                                             ` Jason Gunthorpe
2020-11-13 13:32                                                             ` Thomas Gleixner
2020-11-13 16:12                                                               ` Luck, Tony
2020-11-13 17:38                                                                 ` Raj, Ashok
2020-11-14 10:34                                                           ` Christoph Hellwig
2020-11-14 21:18                                                             ` Raj, Ashok
2020-11-15 11:26                                                               ` Thomas Gleixner
2020-11-15 19:31                                                                 ` Raj, Ashok
2020-11-15 22:11                                                                   ` Thomas Gleixner
2020-11-16  0:22                                                                     ` Raj, Ashok
2020-11-16  7:31                                                                       ` Tian, Kevin
2020-11-16 15:46                                                                         ` Jason Gunthorpe
2020-11-16 17:56                                                                           ` Thomas Gleixner
2020-11-16 18:02                                                                             ` Jason Gunthorpe
2020-11-16 20:37                                                                               ` Thomas Gleixner
2020-11-16 23:51                                                                               ` Tian, Kevin
2020-11-17  9:21                                                                                 ` Thomas Gleixner
2020-11-16  8:25                                                               ` Christoph Hellwig
2020-11-10 14:19                                                 ` Jason Gunthorpe
2020-11-11  2:35                                                   ` Tian, Kevin
2020-11-08 21:18                             ` Thomas Gleixner
2020-11-08 22:09                               ` David Woodhouse
2020-11-08 22:52                                 ` Thomas Gleixner
2020-11-07  0:32                           ` Thomas Gleixner
2020-11-09  5:25                             ` Tian, Kevin
2020-10-30 18:51 ` [PATCH v4 07/17] dmaengine: idxd: add IMS support in base driver Dave Jiang
2020-10-30 18:51 ` [PATCH v4 08/17] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
2020-10-30 18:51 ` [PATCH v4 09/17] dmaengine: idxd: add basic mdev registration and helper functions Dave Jiang
2020-10-30 18:51 ` [PATCH v4 10/17] dmaengine: idxd: add emulation rw routines Dave Jiang
2020-10-30 18:52 ` [PATCH v4 11/17] dmaengine: idxd: prep for virtual device commands Dave Jiang
2020-10-30 18:52 ` [PATCH v4 12/17] dmaengine: idxd: virtual device commands emulation Dave Jiang
2020-10-30 18:52 ` [PATCH v4 13/17] dmaengine: idxd: ims setup for the vdcm Dave Jiang
2020-10-30 21:26   ` Thomas Gleixner
2020-10-30 18:52 ` [PATCH v4 14/17] dmaengine: idxd: add mdev type as a new wq type Dave Jiang
2020-10-30 18:52 ` [PATCH v4 15/17] dmaengine: idxd: add dedicated wq mdev type Dave Jiang
2020-10-30 18:52 ` [PATCH v4 16/17] dmaengine: idxd: add new wq state for mdev Dave Jiang
2020-10-30 18:52 ` [PATCH v4 17/17] dmaengine: idxd: add error notification from host driver to mediated device Dave Jiang
2020-10-30 18:58 ` [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver Jason Gunthorpe
2020-10-30 19:13   ` Dave Jiang
2020-10-30 19:17     ` Jason Gunthorpe
2020-10-30 19:23       ` Raj, Ashok
2020-10-30 19:30         ` Jason Gunthorpe
2020-10-30 20:43           ` Raj, Ashok
2020-10-30 22:54             ` Jason Gunthorpe
2020-10-31  2:50             ` Thomas Gleixner
2020-10-31 23:53               ` Raj, Ashok
2020-11-02 13:20                 ` Jason Gunthorpe
2020-11-02 16:20                   ` Raj, Ashok
2020-11-02 17:19                     ` Jason Gunthorpe
2020-11-02 18:18                       ` Dave Jiang
2020-11-02 18:26                         ` Jason Gunthorpe
2020-11-02 18:38                           ` Dan Williams
2020-11-02 18:51                             ` Jason Gunthorpe
2020-11-02 19:26                               ` Dan Williams
2020-10-30 20:48 ` Thomas Gleixner
2020-10-30 20:59   ` Dave Jiang
2020-10-30 22:10     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).