linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v1 0/8] MSHV: add PV-IOMMU driver
@ 2021-07-09 11:43 Wei Liu
  2021-07-09 11:43 ` [RFC v1 1/8] x86/hyperv: export hv_build_pci_dev_id Wei Liu
                   ` (7 more replies)
  0 siblings, 8 replies; 35+ messages in thread
From: Wei Liu @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, Wei Liu

Hi all

Device passthrough is a critial feature for a virtualization stack. When
designing this feature for MSHV support on Linux, one important
considration is to not deviate from Linux's default VFIO stack. VFIO
relies on an IOMMU or IOMMUs in the system to manipulate DMA mappings.

In this series an IOMMU driver is implemented using a set of hypercall
interfaces provided by the Microsoft Hypervisor. At this stage only DMA
remapping is implemented. Interrupt remapping will come later.

With this series I'm able to passthrough an NVMe drive to a guest with VFIO on
a modified version of Cloud Hypervisor. From users' point of view, nothing
needs changing. Cloud Hypervisor and Rust-VMM changes, which depend on the new
kernel UAPIs from this series, will be upstreamed too.

This series is built on top of Nuno and Vineeth's patches [0][1].

The meat is in the patch named "mshv: add paravirtualized IOMMU
support".

The in-kernel device framework and the VFIO bridge device are heavily
inspired by KVM's code. I pondered whether it would be worth refactoring
the code in KVM but decided against that route for two reasons: 1. it
allowed faster prototyping and 2. I was not sure if that's something KVM
community would agree to.

For the VT-D changes, what we're after is to build the RMRR regions list
so that reserved regions are respected. Instead of doing a bad job
myself, I decided to piggy-back on Intel's own code. AMD support is to
be added until we have an AMD system.

Comments are welcome.

Thanks,
Wei.

[0] https://lore.kernel.org/linux-hyperv/1622241819-21155-1-git-send-email-nunodasneves@linux.microsoft.com/
[1] https://lore.kernel.org/linux-hyperv/cover.1622654100.git.viremana@linux.microsoft.com/

Wei Liu (8):
  x86/hyperv: export hv_build_pci_dev_id
  asm-generic/hyperv: add device domain definitions
  intel/vt-d: make DMAR table parsing code more flexible
  intel/vt-d: export intel_iommu_get_resv_regions
  mshv: add paravirtualized IOMMU support
  mshv: command line option to skip devices in PV-IOMMU
  mshv: implement in-kernel device framework
  mshv: add vfio bridge device

 Documentation/virt/mshv/api.rst     |  12 +
 arch/x86/hyperv/irqdomain.c         |   3 +-
 arch/x86/include/asm/mshyperv.h     |   1 +
 drivers/hv/Kconfig                  |   4 +
 drivers/hv/Makefile                 |   2 +-
 drivers/hv/mshv_main.c              | 186 ++++++++
 drivers/hv/vfio.c                   | 244 ++++++++++
 drivers/hv/vfio.h                   |  18 +
 drivers/iommu/Kconfig               |  14 +
 drivers/iommu/hyperv-iommu.c        | 673 ++++++++++++++++++++++++++++
 drivers/iommu/intel/dmar.c          |  38 +-
 drivers/iommu/intel/iommu.c         |   7 +-
 drivers/iommu/intel/irq_remapping.c |   2 +-
 include/asm-generic/hyperv-tlfs.h   | 144 ++++++
 include/linux/dmar.h                |   2 +-
 include/linux/intel-iommu.h         |   4 +
 include/linux/mshv.h                |  57 +++
 include/uapi/linux/mshv.h           |  36 ++
 18 files changed, 1429 insertions(+), 18 deletions(-)
 create mode 100644 drivers/hv/vfio.c
 create mode 100644 drivers/hv/vfio.h

-- 
2.30.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v1 1/8] x86/hyperv: export hv_build_pci_dev_id
  2021-07-09 11:43 [RFC v1 0/8] MSHV: add PV-IOMMU driver Wei Liu
@ 2021-07-09 11:43 ` Wei Liu
  2021-07-09 11:43 ` [RFC v1 2/8] asm-generic/hyperv: add device domain definitions Wei Liu
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, Wei Liu, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin

Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
 arch/x86/hyperv/irqdomain.c     | 3 ++-
 arch/x86/include/asm/mshyperv.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index 514fc64e23d5..5ea7a5145ea9 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -110,7 +110,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
 	return 0;
 }
 
-static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
+union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
 {
 	union hv_device_id dev_id;
 	struct rid_data data = {
@@ -168,6 +168,7 @@ static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
 
 	return dev_id;
 }
+EXPORT_SYMBOL_GPL(hv_build_pci_dev_id);
 
 static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
 				struct hv_interrupt_entry *entry)
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 21edef600729..b8f6b21e1fa5 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -196,6 +196,7 @@ struct irq_domain *hv_create_pci_msi_domain(void);
 int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
 		struct hv_interrupt_entry *entry);
 int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
+union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev);
 
 #else /* CONFIG_HYPERV */
 static inline void hyperv_init(void) {}
-- 
2.30.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v1 2/8] asm-generic/hyperv: add device domain definitions
  2021-07-09 11:43 [RFC v1 0/8] MSHV: add PV-IOMMU driver Wei Liu
  2021-07-09 11:43 ` [RFC v1 1/8] x86/hyperv: export hv_build_pci_dev_id Wei Liu
@ 2021-07-09 11:43 ` Wei Liu
  2021-07-09 11:43 ` [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible Wei Liu
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, Wei Liu, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Arnd Bergmann,
	open list:GENERIC INCLUDE/ASM HEADER FILES

These definitions are for the IOMMU device domain interfaces exposed by
Microsoft Hyperivsor. We will use them to implement DMA remapping.

Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
 include/asm-generic/hyperv-tlfs.h | 144 ++++++++++++++++++++++++++++++
 1 file changed, 144 insertions(+)

diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index c1d3424f9e00..a39b0e6393f2 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -93,6 +93,13 @@
 #define HV_ENABLE_EXTENDED_HYPERCALLS		BIT(20)
 #define HV_ISOLATION				BIT(22)
 
+/*
+ * Group D features
+ */
+#define HV_DEVICE_DOMAIN_AVAILABLE		BIT(24)
+#define HV_S1_DEVICE_DOMAIN_AVAILABLE		BIT(25)
+
+
 /*
  * TSC page layout.
  */
@@ -179,6 +186,17 @@ struct ms_hyperv_tsc_page {
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
 #define HVCALL_GET_GPA_PAGES_ACCESS_STATES 0x00c9
+#define HVCALL_CREATE_DEVICE_DOMAIN		0x00b1
+#define HVCALL_ATTACH_DEVICE_DOMAIN		0x00b2
+#define HVCALL_MAP_DEVICE_GPA_PAGES		0x00b3
+#define HVCALL_UNMAP_DEVICE_GPA_PAGES		0x00b4
+#define HVCALL_DETACH_DEVICE_DOMAIN		0x00c4
+#define HVCALL_DELETE_DEVICE_DOMAIN		0x00c5
+#define HVCALL_QUERY_DEVICE_DOMAIN		0x00c6
+#define HVCALL_MAP_SPARSE_DEVICE_GPA_PAGES	0x00c7
+#define HVCALL_UNMAP_SPARSE_DEVICE_GPA_PAGES	0x00c8
+#define HVCALL_CONFIGURE_DEVICE_DOMAIN		0x00ce
+#define HVCALL_FLUSH_DEVICE_DOMAIN		0x00d0
 #define HVCALL_MAP_VP_STATE_PAGE			0x00e1
 #define HVCALL_GET_VP_STATE				0x00e3
 #define HVCALL_SET_VP_STATE				0x00e4
@@ -1069,4 +1087,130 @@ union hv_disconnect_port {
 		u32 reserved: 31;
 	};
 } __packed;
+
+/* Device domain types */
+#define HV_DEVICE_DOMAIN_TYPE_S2  0 /* Stage 2 domain */
+#define HV_DEVICE_DOMAIN_TYPE_S1  1 /* Stage 1 domain */
+#define HV_DEVICE_DOMAIN_TYPE_SOC 2 /* SOC domain */
+
+/* ID for stage 2 default domain and NULL domain */
+#define HV_DEVICE_DOMAIN_ID_S2_DEFAULT 0
+#define HV_DEVICE_DOMAIN_ID_S2_NULL    0xFFFFFFFFULL
+
+union hv_device_domain_id {
+	u64 as_uint64;
+	struct {
+		u32 type: 4;
+		u32 reserved: 28;
+		u32 id;
+	};
+} __packed;
+
+union hv_input_vtl {
+	u8 as_uint8;
+	struct {
+		u8 target_vtl: 4;
+		u8 use_target_vtl: 1;
+		u8 reserved_z: 3;
+	};
+} __packed;
+
+struct hv_input_device_domain {
+	u64 partition_id;
+	union hv_input_vtl owner_vtl;
+	u8 padding[7];
+	union hv_device_domain_id domain_id;
+} __packed;
+
+union hv_create_device_domain_flags {
+	u32 as_uint32;
+	struct {
+		u32 forward_progress_required: 1;
+		u32 inherit_owning_vtl: 1;
+		u32 reserved: 30;
+	};
+} __packed;
+
+struct hv_input_create_device_domain {
+	struct hv_input_device_domain device_domain;
+	union hv_create_device_domain_flags create_device_domain_flags;
+} __packed;
+
+struct hv_input_delete_device_domain {
+	struct hv_input_device_domain device_domain;
+} __packed;
+
+struct hv_input_attach_device_domain {
+	struct hv_input_device_domain device_domain;
+	union hv_device_id device_id;
+} __packed;
+
+struct hv_input_detach_device_domain {
+	u64 partition_id;
+	union hv_device_id device_id;
+} __packed;
+
+struct hv_input_map_device_gpa_pages {
+	struct hv_input_device_domain device_domain;
+	union hv_input_vtl target_vtl;
+	u8 padding[3];
+	u32 map_flags;
+	u64 target_device_va_base;
+	u64 gpa_page_list[];
+} __packed;
+
+struct hv_input_unmap_device_gpa_pages {
+	struct hv_input_device_domain device_domain;
+	u64 target_device_va_base;
+} __packed;
+
+struct hv_input_query_device_domain {
+	struct hv_input_device_domain device_domain;
+	u64 target_device_va_list[];
+} __packed;
+
+struct hv_device_va_mapping {
+	u64 target_device_va;
+	u64 gpa_page_number;
+} __packed;
+
+struct hv_input_map_sparse_device_gpa_pages {
+	struct hv_input_device_domain device_domain;
+	union hv_input_vtl target_vtl;
+	u32 map_flags;
+	struct hv_device_va_mapping page_list[];
+} __packed;
+
+struct hv_input_unmap_sparse_device_gpa_pages {
+	struct hv_input_device_domain device_domain;
+	u64 target_device_va_list[];
+} __packed;
+
+struct hv_device_domain_settings_x64 {
+	struct {
+		/*
+		 * Enable translations. If not enabled, all transaction bypass
+		 * S1 translations.
+		 */
+		u64 translation_enabled: 1;
+		u64 reserved: 63;
+	} flags;
+	/* Page attribute table */
+	u64 pat;
+	/* Address of translation table */
+	u64 page_table_root;
+} __packed;
+
+struct hv_device_domain_settings {
+	union {
+		struct hv_device_domain_settings_x64 x64;
+		/* ARM TBD */
+	};
+} __packed;
+
+struct hv_input_configure_device_domain {
+	struct hv_input_device_domain device_domain;
+	struct hv_device_domain_settings settings;
+} __packed;
+
 #endif
-- 
2.30.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible
  2021-07-09 11:43 [RFC v1 0/8] MSHV: add PV-IOMMU driver Wei Liu
  2021-07-09 11:43 ` [RFC v1 1/8] x86/hyperv: export hv_build_pci_dev_id Wei Liu
  2021-07-09 11:43 ` [RFC v1 2/8] asm-generic/hyperv: add device domain definitions Wei Liu
@ 2021-07-09 11:43 ` Wei Liu
  2021-07-09 12:56   ` Robin Murphy
  2021-07-09 11:43 ` [RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions Wei Liu
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, Wei Liu, David Woodhouse, Lu Baolu, Joerg Roedel,
	Will Deacon, Yian Chen, open list:INTEL IOMMU (VT-d)

Microsoft Hypervisor provides a set of hypercalls to manage device
domains. The root kernel should parse the DMAR so that it can program
the IOMMU (with hypercalls) correctly.

The DMAR code was designed to work with Intel IOMMU only. Add two more
parameters to make it useful to Microsoft Hypervisor. Microsoft
Hypervisor does not need the DMAR parsing code to allocate an Intel
IOMMU structure; it also wishes to always reparse the DMAR table even
after it has been parsed before.

Adjust Intel IOMMU code to use the new dmar_table_init. There should be
no functional change to Intel IOMMU code.

Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
We may be able to combine alloc and force_parse?
---
 drivers/iommu/intel/dmar.c          | 38 ++++++++++++++++++++---------
 drivers/iommu/intel/iommu.c         |  2 +-
 drivers/iommu/intel/irq_remapping.c |  2 +-
 include/linux/dmar.h                |  2 +-
 4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 84057cb9596c..bd72f47c728b 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -408,7 +408,8 @@ dmar_find_dmaru(struct acpi_dmar_hardware_unit *drhd)
  * structure which uniquely represent one DMA remapping hardware unit
  * present in the platform
  */
-static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
+static int dmar_parse_one_drhd_internal(struct acpi_dmar_header *header,
+		void *arg, bool alloc)
 {
 	struct acpi_dmar_hardware_unit *drhd;
 	struct dmar_drhd_unit *dmaru;
@@ -440,12 +441,14 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
 		return -ENOMEM;
 	}
 
-	ret = alloc_iommu(dmaru);
-	if (ret) {
-		dmar_free_dev_scope(&dmaru->devices,
-				    &dmaru->devices_cnt);
-		kfree(dmaru);
-		return ret;
+	if (alloc) {
+		ret = alloc_iommu(dmaru);
+		if (ret) {
+			dmar_free_dev_scope(&dmaru->devices,
+					    &dmaru->devices_cnt);
+			kfree(dmaru);
+			return ret;
+		}
 	}
 	dmar_register_drhd_unit(dmaru);
 
@@ -456,6 +459,16 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
 	return 0;
 }
 
+static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
+{
+	return dmar_parse_one_drhd_internal(header, arg, true);
+}
+
+int dmar_parse_one_drhd_noalloc(struct acpi_dmar_header *header, void *arg)
+{
+	return dmar_parse_one_drhd_internal(header, arg, false);
+}
+
 static void dmar_free_drhd(struct dmar_drhd_unit *dmaru)
 {
 	if (dmaru->devices && dmaru->devices_cnt)
@@ -633,7 +646,7 @@ static inline int dmar_walk_dmar_table(struct acpi_table_dmar *dmar,
  * parse_dmar_table - parses the DMA reporting table
  */
 static int __init
-parse_dmar_table(void)
+parse_dmar_table(bool alloc)
 {
 	struct acpi_table_dmar *dmar;
 	int drhd_count = 0;
@@ -650,6 +663,9 @@ parse_dmar_table(void)
 		.cb[ACPI_DMAR_TYPE_SATC] = &dmar_parse_one_satc,
 	};
 
+	if (!alloc)
+		cb.cb[ACPI_DMAR_TYPE_HARDWARE_UNIT] = &dmar_parse_one_drhd_noalloc;
+
 	/*
 	 * Do it again, earlier dmar_tbl mapping could be mapped with
 	 * fixed map.
@@ -840,13 +856,13 @@ void __init dmar_register_bus_notifier(void)
 }
 
 
-int __init dmar_table_init(void)
+int __init dmar_table_init(bool alloc, bool force_parse)
 {
 	static int dmar_table_initialized;
 	int ret;
 
-	if (dmar_table_initialized == 0) {
-		ret = parse_dmar_table();
+	if (dmar_table_initialized == 0 || force_parse) {
+		ret = parse_dmar_table(alloc);
 		if (ret < 0) {
 			if (ret != -ENODEV)
 				pr_info("Parse DMAR table failure.\n");
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index be35284a2016..a4294d310b93 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4310,7 +4310,7 @@ int __init intel_iommu_init(void)
 	}
 
 	down_write(&dmar_global_lock);
-	if (dmar_table_init()) {
+	if (dmar_table_init(true, false)) {
 		if (force_on)
 			panic("tboot: Failed to initialize DMAR table\n");
 		goto out_free_dmar;
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index f912fe45bea2..0e8abef862e4 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -732,7 +732,7 @@ static int __init intel_prepare_irq_remapping(void)
 		return -ENODEV;
 	}
 
-	if (dmar_table_init() < 0)
+	if (dmar_table_init(true, false) < 0)
 		return -ENODEV;
 
 	if (intel_cap_audit(CAP_AUDIT_STATIC_IRQR, NULL))
diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index e04436a7ff27..f88535d41a6e 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -106,7 +106,7 @@ static inline bool dmar_rcu_check(void)
 	for_each_dev_scope((devs), (cnt), (i), (tmp))			\
 		if (!(tmp)) { continue; } else
 
-extern int dmar_table_init(void);
+extern int dmar_table_init(bool alloc, bool force_parse);
 extern int dmar_dev_scope_init(void);
 extern void dmar_register_bus_notifier(void);
 extern int dmar_parse_dev_scope(void *start, void *end, int *cnt,
-- 
2.30.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions
  2021-07-09 11:43 [RFC v1 0/8] MSHV: add PV-IOMMU driver Wei Liu
                   ` (2 preceding siblings ...)
  2021-07-09 11:43 ` [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible Wei Liu
@ 2021-07-09 11:43 ` Wei Liu
  2021-07-09 14:17   ` Lu Baolu
  2021-07-09 11:43 ` [RFC v1 5/8] mshv: add paravirtualized IOMMU support Wei Liu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, Wei Liu, David Woodhouse, Lu Baolu, Joerg Roedel,
	Will Deacon, open list:INTEL IOMMU (VT-d)

When Microsoft Hypervisor runs on Intel platforms it needs to know the
reserved regions to program devices correctly. There is no reason to
duplicate intel_iommu_get_resv_regions. Export it.

Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
 drivers/iommu/intel/iommu.c | 5 +++--
 include/linux/intel-iommu.h | 4 ++++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index a4294d310b93..01973bc20080 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5176,8 +5176,8 @@ static void intel_iommu_probe_finalize(struct device *dev)
 		set_dma_ops(dev, NULL);
 }
 
-static void intel_iommu_get_resv_regions(struct device *device,
-					 struct list_head *head)
+void intel_iommu_get_resv_regions(struct device *device,
+				 struct list_head *head)
 {
 	int prot = DMA_PTE_READ | DMA_PTE_WRITE;
 	struct iommu_resv_region *reg;
@@ -5232,6 +5232,7 @@ static void intel_iommu_get_resv_regions(struct device *device,
 		return;
 	list_add_tail(&reg->list, head);
 }
+EXPORT_SYMBOL_GPL(intel_iommu_get_resv_regions);
 
 int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
 {
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 03faf20a6817..f91869f765bc 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -814,6 +814,8 @@ extern int iommu_calculate_max_sagaw(struct intel_iommu *iommu);
 extern int dmar_disabled;
 extern int intel_iommu_enabled;
 extern int intel_iommu_gfx_mapped;
+extern void intel_iommu_get_resv_regions(struct device *device,
+				 struct list_head *head);
 #else
 static inline int iommu_calculate_agaw(struct intel_iommu *iommu)
 {
@@ -825,6 +827,8 @@ static inline int iommu_calculate_max_sagaw(struct intel_iommu *iommu)
 }
 #define dmar_disabled	(1)
 #define intel_iommu_enabled (0)
+static inline void intel_iommu_get_resv_regions(struct device *device,
+				 struct list_head *head) {}
 #endif
 
 #endif
-- 
2.30.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v1 5/8] mshv: add paravirtualized IOMMU support
  2021-07-09 11:43 [RFC v1 0/8] MSHV: add PV-IOMMU driver Wei Liu
                   ` (3 preceding siblings ...)
  2021-07-09 11:43 ` [RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions Wei Liu
@ 2021-07-09 11:43 ` Wei Liu
  2021-08-03 18:40   ` Praveen Kumar
  2021-07-09 11:43 ` [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU Wei Liu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, Wei Liu, Joerg Roedel, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	open list:IOMMU DRIVERS

Microsoft Hypervisor provides a set of hypercalls to manage device
domains. Implement a type-1 IOMMU using those hypercalls.

Implement DMA remapping as the first step for this driver. Interrupt
remapping will come in a later stage.

Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
 drivers/iommu/Kconfig        |  14 +
 drivers/iommu/hyperv-iommu.c | 628 +++++++++++++++++++++++++++++++++++
 2 files changed, 642 insertions(+)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 1f111b399bca..e230c4536825 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -397,6 +397,20 @@ config HYPERV_IOMMU
 	  Stub IOMMU driver to handle IRQs as to allow Hyper-V Linux
 	  guests to run with x2APIC mode enabled.
 
+config HYPERV_ROOT_PVIOMMU
+	bool "Microsoft Hypervisor paravirtualized IOMMU support"
+	depends on HYPERV && X86
+	select IOMMU_API
+	select IOMMU_DMA
+	select DMA_OPS
+	select IOMMU_IOVA
+	select INTEL_IOMMU
+	select DMAR_TABLE
+	default HYPERV
+	help
+	  A paravirtualized IOMMU for Microsoft Hypervisor. It supports DMA
+	  remapping.
+
 config VIRTIO_IOMMU
 	tristate "Virtio IOMMU driver"
 	depends on VIRTIO
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index e285a220c913..043dcff06511 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -12,7 +12,12 @@
 #include <linux/interrupt.h>
 #include <linux/irq.h>
 #include <linux/iommu.h>
+#include <linux/dma-iommu.h>
 #include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/interval_tree.h>
+#include <linux/dma-map-ops.h>
+#include <linux/dmar.h>
 
 #include <asm/apic.h>
 #include <asm/cpu.h>
@@ -21,6 +26,9 @@
 #include <asm/irq_remapping.h>
 #include <asm/hypervisor.h>
 #include <asm/mshyperv.h>
+#include <asm/iommu_table.h>
+
+#include <linux/intel-iommu.h>
 
 #include "irq_remapping.h"
 
@@ -338,3 +346,623 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
 };
 
 #endif
+
+#ifdef CONFIG_HYPERV_ROOT_PVIOMMU
+
+/* DMA remapping support */
+struct hv_iommu_domain {
+	struct iommu_domain domain;
+	struct hv_iommu_dev *hv_iommu;
+
+	struct hv_input_device_domain device_domain;
+
+	spinlock_t mappings_lock;
+	struct rb_root_cached mappings;
+
+	u32 map_flags;
+	u64 pgsize_bitmap;
+};
+
+/* What hardware IOMMU? */
+enum {
+	INVALID = 0,
+	INTEL, /* Only Intel for now */
+} hw_iommu_type = { INVALID };
+
+static struct hv_iommu_domain hv_identity_domain, hv_null_domain;
+
+#define to_hv_iommu_domain(d) \
+	container_of(d, struct hv_iommu_domain, domain)
+
+struct hv_iommu_mapping {
+	phys_addr_t paddr;
+	struct interval_tree_node iova;
+	u32 flags;
+};
+
+struct hv_iommu_dev {
+	struct iommu_device iommu;
+
+	struct ida domain_ids;
+
+	/* Device configuration */
+	struct iommu_domain_geometry geometry;
+	u64 first_domain;
+	u64 last_domain;
+
+	u32 map_flags;
+	u64 pgsize_bitmap;
+};
+
+static struct hv_iommu_dev *hv_iommu_device;
+
+struct hv_iommu_endpoint {
+	struct device *dev;
+	struct hv_iommu_dev *hv_iommu;
+	struct hv_iommu_domain *domain;
+};
+
+static void __init hv_initialize_special_domains(void)
+{
+	struct hv_iommu_domain *domain;
+
+	/* Default passthrough domain */
+	domain = &hv_identity_domain;
+
+	memset(domain, 0, sizeof(*domain));
+
+	domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
+	domain->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+	domain->device_domain.domain_id.id = HV_DEVICE_DOMAIN_ID_S2_DEFAULT;
+
+	domain->domain.geometry = hv_iommu_device->geometry;
+
+	/* NULL domain that blocks all DMA transactions */
+	domain = &hv_null_domain;
+
+	memset(domain, 0, sizeof(*domain));
+
+	domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
+	domain->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+	domain->device_domain.domain_id.id = HV_DEVICE_DOMAIN_ID_S2_NULL;
+
+	domain->domain.geometry = hv_iommu_device->geometry;
+}
+
+static bool is_identity_domain(struct hv_iommu_domain *d)
+{
+	return d->device_domain.domain_id.id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT;
+}
+
+static bool is_null_domain(struct hv_iommu_domain *d)
+{
+	return d->device_domain.domain_id.id == HV_DEVICE_DOMAIN_ID_S2_NULL;
+}
+
+static struct iommu_domain *hv_iommu_domain_alloc(unsigned type)
+{
+	struct hv_iommu_domain *domain;
+	int ret;
+	u64 status;
+	unsigned long flags;
+	struct hv_input_create_device_domain *input;
+
+	if (type == IOMMU_DOMAIN_IDENTITY)
+		return &hv_identity_domain.domain;
+
+	if (type == IOMMU_DOMAIN_BLOCKED)
+		return &hv_null_domain.domain;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		goto out;
+
+	spin_lock_init(&domain->mappings_lock);
+	domain->mappings = RB_ROOT_CACHED;
+
+	if (type == IOMMU_DOMAIN_DMA &&
+		iommu_get_dma_cookie(&domain->domain)) {
+		goto out_free;
+	}
+
+	ret = ida_alloc_range(&hv_iommu_device->domain_ids,
+			hv_iommu_device->first_domain, hv_iommu_device->last_domain,
+			GFP_KERNEL);
+	if (ret < 0)
+		goto out_put_cookie;
+
+	domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
+	domain->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+	domain->device_domain.domain_id.id = ret;
+
+	domain->hv_iommu = hv_iommu_device;
+	domain->map_flags = hv_iommu_device->map_flags;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain = domain->device_domain;
+
+	input->create_device_domain_flags.forward_progress_required = 1;
+	input->create_device_domain_flags.inherit_owning_vtl = 0;
+
+	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status)) {
+		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
+		goto out_free_id;
+	}
+
+	domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
+	domain->domain.geometry = hv_iommu_device->geometry;
+
+	return &domain->domain;
+
+out_free_id:
+	ida_free(&hv_iommu_device->domain_ids, domain->device_domain.domain_id.id);
+out_put_cookie:
+	iommu_put_dma_cookie(&domain->domain);
+out_free:
+	kfree(domain);
+out:
+	return NULL;
+}
+
+static void hv_iommu_domain_free(struct iommu_domain *d)
+{
+	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
+	unsigned long flags;
+	u64 status;
+	struct hv_input_delete_device_domain *input;
+
+	if (is_identity_domain(domain) || is_null_domain(domain))
+		return;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain= domain->device_domain;
+
+	status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
+
+	ida_free(&domain->hv_iommu->domain_ids, domain->device_domain.domain_id.id);
+
+	iommu_put_dma_cookie(d);
+
+	kfree(domain);
+}
+
+static int hv_iommu_attach_dev(struct iommu_domain *d, struct device *dev)
+{
+	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
+	u64 status;
+	unsigned long flags;
+	struct hv_input_attach_device_domain *input;
+	struct pci_dev *pdev;
+	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
+
+	/* Only allow PCI devices for now */
+	if (!dev_is_pci(dev))
+		return -EINVAL;
+
+	pdev = to_pci_dev(dev);
+
+	dev_dbg(dev, "Attaching (%strusted) to %d\n", pdev->untrusted ? "un" : "",
+		domain->device_domain.domain_id.id);
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain = domain->device_domain;
+	input->device_id = hv_build_pci_dev_id(pdev);
+
+	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
+	else
+		vdev->domain = domain;
+
+	return hv_status_to_errno(status);
+}
+
+static void hv_iommu_detach_dev(struct iommu_domain *d, struct device *dev)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_detach_device_domain *input;
+	struct pci_dev *pdev;
+	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
+	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
+
+	/* See the attach function, only PCI devices for now */
+	if (!dev_is_pci(dev))
+		return;
+
+	pdev = to_pci_dev(dev);
+
+	dev_dbg(dev, "Detaching from %d\n", domain->device_domain.domain_id.id);
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->partition_id = HV_PARTITION_ID_SELF;
+	input->device_id = hv_build_pci_dev_id(pdev);
+
+	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
+
+	vdev->domain = NULL;
+}
+
+static int hv_iommu_add_mapping(struct hv_iommu_domain *domain, unsigned long iova,
+		phys_addr_t paddr, size_t size, u32 flags)
+{
+	unsigned long irqflags;
+	struct hv_iommu_mapping *mapping;
+
+	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
+	if (!mapping)
+		return -ENOMEM;
+
+	mapping->paddr = paddr;
+	mapping->iova.start = iova;
+	mapping->iova.last = iova + size - 1;
+	mapping->flags = flags;
+
+	spin_lock_irqsave(&domain->mappings_lock, irqflags);
+	interval_tree_insert(&mapping->iova, &domain->mappings);
+	spin_unlock_irqrestore(&domain->mappings_lock, irqflags);
+
+	return 0;
+}
+
+static size_t hv_iommu_del_mappings(struct hv_iommu_domain *domain,
+		unsigned long iova, size_t size)
+{
+	unsigned long flags;
+	size_t unmapped = 0;
+	unsigned long last = iova + size - 1;
+	struct hv_iommu_mapping *mapping = NULL;
+	struct interval_tree_node *node, *next;
+
+	spin_lock_irqsave(&domain->mappings_lock, flags);
+	next = interval_tree_iter_first(&domain->mappings, iova, last);
+	while (next) {
+		node = next;
+		mapping = container_of(node, struct hv_iommu_mapping, iova);
+		next = interval_tree_iter_next(node, iova, last);
+
+		/* Trying to split a mapping? Not supported for now. */
+		if (mapping->iova.start < iova)
+			break;
+
+		unmapped += mapping->iova.last - mapping->iova.start + 1;
+
+		interval_tree_remove(node, &domain->mappings);
+		kfree(mapping);
+	}
+	spin_unlock_irqrestore(&domain->mappings_lock, flags);
+
+	return unmapped;
+}
+
+static int hv_iommu_map(struct iommu_domain *d, unsigned long iova,
+			phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+{
+	u32 map_flags;
+	unsigned long flags, pfn, npages;
+	int ret, i;
+	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
+	struct hv_input_map_device_gpa_pages *input;
+	u64 status;
+
+	/* Reject size that's not a whole page */
+	if (size & ~HV_HYP_PAGE_MASK)
+		return -EINVAL;
+
+	map_flags = HV_MAP_GPA_READABLE; /* Always required */
+	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
+
+	ret = hv_iommu_add_mapping(domain, iova, paddr, size, flags);
+	if (ret)
+		return ret;
+
+	npages = size >> HV_HYP_PAGE_SHIFT;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain = domain->device_domain;
+	input->map_flags = map_flags;
+	input->target_device_va_base = iova;
+
+	pfn = paddr >> HV_HYP_PAGE_SHIFT;
+	for (i = 0; i < npages; i++) {
+		input->gpa_page_list[i] = pfn;
+		pfn += 1;
+	}
+
+	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
+			input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status)) {
+		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
+		hv_iommu_del_mappings(domain, iova, size);
+	}
+
+	return hv_status_to_errno(status);
+}
+
+static size_t hv_iommu_unmap(struct iommu_domain *d, unsigned long iova,
+			   size_t size, struct iommu_iotlb_gather *gather)
+{
+	size_t unmapped;
+	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
+	unsigned long flags, npages;
+	struct hv_input_unmap_device_gpa_pages *input;
+	u64 status;
+
+	unmapped = hv_iommu_del_mappings(domain, iova, size);
+	if (unmapped < size)
+		return 0;
+
+	npages = size >> HV_HYP_PAGE_SHIFT;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain = domain->device_domain;
+	input->target_device_va_base = iova;
+
+	/* Unmap `npages` pages starting from VA base */
+	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
+			0, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
+
+	return hv_result_success(status) ? unmapped : 0;
+}
+
+static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *d,
+				       dma_addr_t iova)
+{
+	u64 paddr = 0;
+	unsigned long flags;
+	struct hv_iommu_mapping *mapping;
+	struct interval_tree_node *node;
+	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
+
+	spin_lock_irqsave(&domain->mappings_lock, flags);
+	node = interval_tree_iter_first(&domain->mappings, iova, iova);
+	if (node) {
+		mapping = container_of(node, struct hv_iommu_mapping, iova);
+		paddr = mapping->paddr + (iova - mapping->iova.start);
+	}
+	spin_unlock_irqrestore(&domain->mappings_lock, flags);
+
+	return paddr;
+}
+
+static struct iommu_device *hv_iommu_probe_device(struct device *dev)
+{
+	struct hv_iommu_endpoint *vdev;
+
+	if (!dev_is_pci(dev))
+		return ERR_PTR(-ENODEV);
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return ERR_PTR(-ENOMEM);
+
+	vdev->dev = dev;
+	vdev->hv_iommu = hv_iommu_device;
+	dev_iommu_priv_set(dev, vdev);
+
+	return &vdev->hv_iommu->iommu;
+}
+
+static void hv_iommu_probe_finalize(struct device *dev)
+{
+	struct iommu_domain *d = iommu_get_domain_for_dev(dev);
+
+	if (d && d->type == IOMMU_DOMAIN_DMA)
+		iommu_setup_dma_ops(dev, 1 << PAGE_SHIFT, 0);
+	else
+		set_dma_ops(dev, NULL);
+}
+
+static void hv_iommu_release_device(struct device *dev)
+{
+	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
+
+	/* Need to detach device from device domain if necessary. */
+	if (vdev->domain)
+		hv_iommu_detach_dev(&vdev->domain->domain, dev);
+
+	dev_iommu_priv_set(dev, NULL);
+	set_dma_ops(dev, NULL);
+
+	kfree(vdev);
+}
+
+static struct iommu_group *hv_iommu_device_group(struct device *dev)
+{
+	if (dev_is_pci(dev))
+		return pci_device_group(dev);
+	else
+		return generic_device_group(dev);
+}
+
+/*
+ * This bitmap is used to advertise the page sizes our hardware support to the
+ * IOMMU core, which will then use this information to split physically
+ * contiguous memory regions it is mapping into page sizes that we support.
+ *
+ * Given the size of the PFN array we can accommodate less than 512 4KiB pages.
+ */
+#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
+
+static void hv_iommu_get_resv_regions(struct device *dev,
+		struct list_head *head)
+{
+	switch (hw_iommu_type) {
+	case INTEL:
+		intel_iommu_get_resv_regions(dev, head);
+		break;
+	default:
+		/* Do nothing */;
+	}
+}
+
+static int hv_iommu_def_domain_type(struct device *dev)
+{
+	/* The hypervisor has created a default passthrough domain */
+	return IOMMU_DOMAIN_IDENTITY;
+}
+
+static struct iommu_ops hv_iommu_ops = {
+	.domain_alloc     = hv_iommu_domain_alloc,
+	.domain_free      = hv_iommu_domain_free,
+	.attach_dev       = hv_iommu_attach_dev,
+	.detach_dev       = hv_iommu_detach_dev,
+	.map              = hv_iommu_map,
+	.unmap            = hv_iommu_unmap,
+	.iova_to_phys     = hv_iommu_iova_to_phys,
+	.probe_device     = hv_iommu_probe_device,
+	.probe_finalize   = hv_iommu_probe_finalize,
+	.release_device   = hv_iommu_release_device,
+	.def_domain_type  = hv_iommu_def_domain_type,
+	.device_group     = hv_iommu_device_group,
+	.get_resv_regions = hv_iommu_get_resv_regions,
+	.put_resv_regions = generic_iommu_put_resv_regions,
+	.pgsize_bitmap    = HV_IOMMU_PGSIZES,
+	.owner            = THIS_MODULE,
+};
+
+static void __init hv_initalize_resv_regions_intel(void)
+{
+	int ret;
+
+	down_write(&dmar_global_lock);
+	if (dmar_table_init(false, true)) {
+		pr_err("Failed to initialize DMAR table\n");
+		up_write(&dmar_global_lock);
+		return;
+	}
+
+	ret = dmar_dev_scope_init();
+	if (ret)
+		pr_err("Failed to initialize device scope\n");
+
+	up_write(&dmar_global_lock);
+
+	hw_iommu_type = INTEL;
+}
+
+static void __init hv_initialize_resv_regions(void)
+{
+	hv_initalize_resv_regions_intel();
+}
+
+int __init hv_iommu_init(void)
+{
+	int ret = 0;
+	struct hv_iommu_dev *hv_iommu = NULL;
+
+	if (!hv_is_hyperv_initialized())
+		return -ENODEV;
+
+	hv_initialize_resv_regions();
+
+	hv_iommu = kzalloc(sizeof(*hv_iommu), GFP_KERNEL);
+	if (!hv_iommu)
+		return -ENOMEM;
+
+	ida_init(&hv_iommu->domain_ids);
+	hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_S2_DEFAULT + 1;
+	hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_S2_NULL - 1;
+
+	hv_iommu->geometry = (struct iommu_domain_geometry) {
+		.aperture_start = 0,
+		.aperture_end = -1UL,
+		.force_aperture = true,
+	};
+
+	hv_iommu->map_flags = IOMMU_READ | IOMMU_WRITE;
+	hv_iommu->pgsize_bitmap = HV_IOMMU_PGSIZES;
+
+	ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL, "%s", "hv-iommu");
+	if (ret) {
+		pr_err("iommu_device_sysfs_add failed: %d\n", ret);
+		goto err_free;
+	}
+
+	ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops, NULL);
+	if (ret) {
+		pr_err("iommu_device_register failed: %d\n", ret);
+		goto err_sysfs_remove;
+	}
+
+	/* This must come before bus_set_iommu because it calls into the hooks. */
+	hv_iommu_device = hv_iommu;
+	hv_initialize_special_domains();
+
+#ifdef CONFIG_PCI
+	ret = bus_set_iommu(&pci_bus_type, &hv_iommu_ops);
+	if (ret) {
+		pr_err("bus_set_iommu failed: %d\n", ret);
+		goto err_unregister;
+	}
+#endif
+	pr_info("Microsoft Hypervisor IOMMU initialized\n");
+
+	return 0;
+
+#ifdef CONFIG_PCI
+err_unregister:
+	iommu_device_unregister(&hv_iommu->iommu);
+#endif
+err_sysfs_remove:
+	iommu_device_sysfs_remove(&hv_iommu->iommu);
+err_free:
+	kfree(hv_iommu);
+	return ret;
+}
+
+int __init hv_iommu_detect(void)
+{
+	if (!(ms_hyperv.misc_features & HV_DEVICE_DOMAIN_AVAILABLE))
+		return -ENODEV;
+
+	iommu_detected = 1;
+	x86_init.iommu.iommu_init = hv_iommu_init;
+
+	return 1;
+}
+IOMMU_INIT_POST(hv_iommu_detect);
+
+#endif /* CONFIG_HYPERV_ROOT_PVIOMMU */
-- 
2.30.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU
  2021-07-09 11:43 [RFC v1 0/8] MSHV: add PV-IOMMU driver Wei Liu
                   ` (4 preceding siblings ...)
  2021-07-09 11:43 ` [RFC v1 5/8] mshv: add paravirtualized IOMMU support Wei Liu
@ 2021-07-09 11:43 ` Wei Liu
  2021-07-09 12:46   ` Robin Murphy
  2021-08-03 18:50   ` Praveen Kumar
  2021-07-09 11:43 ` [RFC v1 7/8] mshv: implement in-kernel device framework Wei Liu
  2021-07-09 11:43 ` [RFC v1 8/8] mshv: add vfio bridge device Wei Liu
  7 siblings, 2 replies; 35+ messages in thread
From: Wei Liu @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, Wei Liu, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Joerg Roedel, Will Deacon,
	open list:IOMMU DRIVERS

Some devices may have been claimed by the hypervisor already. One such
example is a user can assign a NIC for debugging purpose.

Ideally Linux should be able to tell retrieve that information, but
there is no way to do that yet. And designing that new mechanism is
going to take time.

Provide a command line option for skipping devices. This is a stopgap
solution, so it is intentionally undocumented. Hopefully we can retire
it in the future.

Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
 drivers/iommu/hyperv-iommu.c | 45 ++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index 043dcff06511..353da5036387 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -349,6 +349,16 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
 
 #ifdef CONFIG_HYPERV_ROOT_PVIOMMU
 
+/* The IOMMU will not claim these PCI devices. */
+static char *pci_devs_to_skip;
+static int __init mshv_iommu_setup_skip(char *str) {
+	pci_devs_to_skip = str;
+
+	return 0;
+}
+/* mshv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
+__setup("mshv_iommu_skip=", mshv_iommu_setup_skip);
+
 /* DMA remapping support */
 struct hv_iommu_domain {
 	struct iommu_domain domain;
@@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct device *dev)
 	if (!dev_is_pci(dev))
 		return ERR_PTR(-ENODEV);
 
+	/*
+	 * Skip the PCI device specified in `pci_devs_to_skip`. This is a
+	 * temporary solution until we figure out a way to extract information
+	 * from the hypervisor what devices it is already using.
+	 */
+	if (pci_devs_to_skip && *pci_devs_to_skip) {
+		int pos = 0;
+		int parsed;
+		int segment, bus, slot, func;
+		struct pci_dev *pdev = to_pci_dev(dev);
+
+		do {
+			parsed = 0;
+
+			sscanf(pci_devs_to_skip + pos,
+				" (%x:%x:%x.%x) %n",
+				&segment, &bus, &slot, &func, &parsed);
+
+			if (parsed <= 0)
+				break;
+
+			if (pci_domain_nr(pdev->bus) == segment &&
+				pdev->bus->number == bus &&
+				PCI_SLOT(pdev->devfn) == slot &&
+				PCI_FUNC(pdev->devfn) == func)
+			{
+				dev_info(dev, "skipped by MSHV IOMMU\n");
+				return ERR_PTR(-ENODEV);
+			}
+
+			pos += parsed;
+
+		} while (pci_devs_to_skip[pos]);
+	}
+
 	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
 	if (!vdev)
 		return ERR_PTR(-ENOMEM);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 11:43 [RFC v1 0/8] MSHV: add PV-IOMMU driver Wei Liu
                   ` (5 preceding siblings ...)
  2021-07-09 11:43 ` [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU Wei Liu
@ 2021-07-09 11:43 ` Wei Liu
  2021-07-09 13:02   ` Matthew Wilcox
  2021-08-03 19:12   ` Praveen Kumar
  2021-07-09 11:43 ` [RFC v1 8/8] mshv: add vfio bridge device Wei Liu
  7 siblings, 2 replies; 35+ messages in thread
From: Wei Liu @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, Wei Liu, Jonathan Corbet, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	Lillian Grassin-Drake, Muminul Islam, open list:DOCUMENTATION

This is basically the same code adopted from KVM. The main user case is
the future MSHV-VFIO bridge device. We don't have any plan to support
in-kernel device emulation yet, but it wouldn't hurt to make the code
more flexible.

Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
 Documentation/virt/mshv/api.rst |  12 +++
 drivers/hv/mshv_main.c          | 181 ++++++++++++++++++++++++++++++++
 include/linux/mshv.h            |  57 ++++++++++
 include/uapi/linux/mshv.h       |  36 +++++++
 4 files changed, 286 insertions(+)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index 56a6edfcfe29..7d35dd589831 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -170,4 +170,16 @@ Can be used to get/set various properties of a partition.
 Some properties can only be set at partition creation. For these, there are
 parameters in MSHV_CREATE_PARTITION.
 
+3.12 MSHV_CREATE_DEVICE
+-----------------------
+:Type: partition ioctl
+:Parameters: struct mshv_create_device
+:Returns: 0 on success
+
+Can be used to create an in-kernel device.
+
+If the MSHV_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
 
+Currently only supports VFIO type device.
diff --git a/drivers/hv/mshv_main.c b/drivers/hv/mshv_main.c
index 4cbc520471e4..84c774a561de 100644
--- a/drivers/hv/mshv_main.c
+++ b/drivers/hv/mshv_main.c
@@ -20,6 +20,8 @@
 #include <linux/random.h>
 #include <linux/mshv.h>
 #include <linux/mshv_eventfd.h>
+#include <linux/hyperv.h>
+#include <linux/nospec.h>
 #include <asm/mshyperv.h>
 
 #include "mshv.h"
@@ -33,6 +35,7 @@ static int mshv_vp_release(struct inode *inode, struct file *filp);
 static long mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
 static struct mshv_partition *mshv_partition_get(struct mshv_partition *partition);
 static void mshv_partition_put(struct mshv_partition *partition);
+static void mshv_partition_put_no_destroy(struct mshv_partition *partition);
 static int mshv_partition_release(struct inode *inode, struct file *filp);
 static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
 static int mshv_dev_open(struct inode *inode, struct file *filp);
@@ -912,6 +915,172 @@ mshv_partition_ioctl_set_msi_routing(struct mshv_partition *partition,
 	return ret;
 }
 
+static int mshv_device_ioctl_attr(struct mshv_device *dev,
+				 int (*accessor)(struct mshv_device *dev,
+						 struct mshv_device_attr *attr),
+				 unsigned long arg)
+{
+	struct mshv_device_attr attr;
+
+	if (!accessor)
+		return -EPERM;
+
+	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
+		return -EFAULT;
+
+	return accessor(dev, &attr);
+}
+
+static long mshv_device_ioctl(struct file *filp, unsigned int ioctl,
+			      unsigned long arg)
+{
+	struct mshv_device *dev = filp->private_data;
+
+	switch (ioctl) {
+	case MSHV_SET_DEVICE_ATTR:
+		return mshv_device_ioctl_attr(dev, dev->ops->set_attr, arg);
+	case MSHV_GET_DEVICE_ATTR:
+		return mshv_device_ioctl_attr(dev, dev->ops->get_attr, arg);
+	case MSHV_HAS_DEVICE_ATTR:
+		return mshv_device_ioctl_attr(dev, dev->ops->has_attr, arg);
+	default:
+		if (dev->ops->ioctl)
+			return dev->ops->ioctl(dev, ioctl, arg);
+
+		return -ENOTTY;
+	}
+}
+
+static int mshv_device_release(struct inode *inode, struct file *filp)
+{
+	struct mshv_device *dev = filp->private_data;
+	struct mshv_partition *partition = dev->partition;
+
+	if (dev->ops->release) {
+		mutex_lock(&partition->mutex);
+		list_del(&dev->partition_node);
+		dev->ops->release(dev);
+		mutex_unlock(&partition->mutex);
+	}
+
+	mshv_partition_put(partition);
+	return 0;
+}
+
+static const struct file_operations mshv_device_fops = {
+	.unlocked_ioctl = mshv_device_ioctl,
+	.release = mshv_device_release,
+};
+
+static const struct mshv_device_ops *mshv_device_ops_table[MSHV_DEV_TYPE_MAX];
+
+int mshv_register_device_ops(const struct mshv_device_ops *ops, u32 type)
+{
+	if (type >= ARRAY_SIZE(mshv_device_ops_table))
+		return -ENOSPC;
+
+	if (mshv_device_ops_table[type] != NULL)
+		return -EEXIST;
+
+	mshv_device_ops_table[type] = ops;
+	return 0;
+}
+
+void mshv_unregister_device_ops(u32 type)
+{
+	if (type >= ARRAY_SIZE(mshv_device_ops_table))
+		return;
+	mshv_device_ops_table[type] = NULL;
+}
+
+static long
+mshv_partition_ioctl_create_device(struct mshv_partition *partition,
+	void __user *user_args)
+{
+	long r;
+	struct mshv_create_device tmp, *cd;
+	struct mshv_device *dev;
+	const struct mshv_device_ops *ops;
+	int type;
+
+	if (copy_from_user(&tmp, user_args, sizeof(tmp))) {
+		r = -EFAULT;
+		goto out;
+	}
+
+	cd = &tmp;
+
+	if (cd->type >= ARRAY_SIZE(mshv_device_ops_table)) {
+		r = -ENODEV;
+		goto out;
+	}
+
+	type = array_index_nospec(cd->type, ARRAY_SIZE(mshv_device_ops_table));
+	ops = mshv_device_ops_table[type];
+	if (ops == NULL) {
+		r = -ENODEV;
+		goto out;
+	}
+
+	if (cd->flags & MSHV_CREATE_DEVICE_TEST) {
+		r = 0;
+		goto out;
+	}
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL_ACCOUNT);
+	if (!dev) {
+		r = -ENOMEM;
+		goto out;
+	}
+
+	dev->ops = ops;
+	dev->partition = partition;
+
+	r = ops->create(dev, type);
+	if (r < 0) {
+		kfree(dev);
+		goto out;
+	}
+
+	list_add(&dev->partition_node, &partition->devices);
+
+	if (ops->init)
+		ops->init(dev);
+
+	mshv_partition_get(partition);
+	r = anon_inode_getfd(ops->name, &mshv_device_fops, dev, O_RDWR | O_CLOEXEC);
+	if (r < 0) {
+		mshv_partition_put_no_destroy(partition);
+		list_del(&dev->partition_node);
+		ops->destroy(dev);
+		goto out;
+	}
+
+	cd->fd = r;
+	r = 0;
+
+	if (copy_to_user(user_args, &tmp, sizeof(tmp))) {
+		r = -EFAULT;
+		goto out;
+	}
+out:
+	return r;
+}
+
+static void mshv_destroy_devices(struct mshv_partition *partition)
+{
+	struct mshv_device *dev, *tmp;
+
+	/*
+	 * No need to take any lock since at this point nobody else can
+	 * reference this partition.
+	 */
+	list_for_each_entry_safe(dev, tmp, &partition->devices, partition_node) {
+		list_del(&dev->partition_node);
+		dev->ops->destroy(dev);
+	}
+}
+
 static long
 mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
@@ -965,6 +1134,9 @@ mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 	case MSHV_GET_GPA_ACCESS_STATES:
 		ret = mshv_partition_ioctl_get_gpa_access_state(partition,
 							   (void __user *)arg);
+	case MSHV_CREATE_DEVICE:
+		ret = mshv_partition_ioctl_create_device(partition,
+							 (void __user *)arg);
 		break;
 	default:
 		ret = -ENOTTY;
@@ -1033,6 +1205,7 @@ destroy_partition(struct mshv_partition *partition)
 		vfree(region->pages);
 	}
 
+	mshv_destroy_devices(partition);
 	mshv_free_msi_routing(partition);
 	kfree(partition);
 }
@@ -1052,6 +1225,12 @@ mshv_partition_put(struct mshv_partition *partition)
 		destroy_partition(partition);
 }
 
+static void
+mshv_partition_put_no_destroy(struct mshv_partition *partition)
+{
+	WARN_ON(refcount_dec_and_test(&partition->ref_count));
+}
+
 static int
 mshv_partition_release(struct inode *inode, struct file *filp)
 {
@@ -1122,6 +1301,8 @@ mshv_ioctl_create_partition(void __user *user_arg)
 
 	INIT_HLIST_HEAD(&partition->irq_ack_notifier_list);
 
+	INIT_LIST_HEAD(&partition->devices);
+
 	fd = get_unused_fd_flags(O_CLOEXEC);
 	if (fd < 0) {
 		ret = fd;
diff --git a/include/linux/mshv.h b/include/linux/mshv.h
index fc655b60c5cd..c557ffeec90c 100644
--- a/include/linux/mshv.h
+++ b/include/linux/mshv.h
@@ -59,6 +59,8 @@ struct mshv_partition {
 	struct srcu_struct irq_srcu;
 	struct hlist_head irq_ack_notifier_list;
 
+	struct list_head devices;
+
 	struct {
 		spinlock_t        lock;
 		struct list_head  items;
@@ -121,4 +123,59 @@ struct mshv {
 	} partitions;
 };
 
+struct mshv_device {
+	const struct mshv_device_ops *ops;
+	struct mshv_partition *partition;
+	void *private;
+	struct list_head partition_node;
+
+};
+
+/* create, destroy, and name are mandatory */
+struct mshv_device_ops {
+	const char *name;
+
+	/*
+	 * create is called holding partition->mutex and any operations not suitable
+	 * to do while holding the lock should be deferred to init (see
+	 * below).
+	 */
+	int (*create)(struct mshv_device *dev, u32 type);
+
+	/*
+	 * init is called after create if create is successful and is called
+	 * outside of holding partition->mutex.
+	 */
+	void (*init)(struct mshv_device *dev);
+
+	/*
+	 * Destroy is responsible for freeing dev.
+	 *
+	 * Destroy may be called before or after destructors are called
+	 * on emulated I/O regions, depending on whether a reference is
+	 * held by a vcpu or other mshv component that gets destroyed
+	 * after the emulated I/O.
+	 */
+	void (*destroy)(struct mshv_device *dev);
+
+	/*
+	 * Release is an alternative method to free the device. It is
+	 * called when the device file descriptor is closed. Once
+	 * release is called, the destroy method will not be called
+	 * anymore as the device is removed from the device list of
+	 * the VM. partition->mutex is held.
+	 */
+	void (*release)(struct mshv_device *dev);
+
+	int (*set_attr)(struct mshv_device *dev, struct mshv_device_attr *attr);
+	int (*get_attr)(struct mshv_device *dev, struct mshv_device_attr *attr);
+	int (*has_attr)(struct mshv_device *dev, struct mshv_device_attr *attr);
+	long (*ioctl)(struct mshv_device *dev, unsigned int ioctl,
+		      unsigned long arg);
+	int (*mmap)(struct mshv_device *dev, struct vm_area_struct *vma);
+};
+
+int mshv_register_device_ops(const struct mshv_device_ops *ops, u32 type);
+void mshv_unregister_device_ops(u32 type);
+
 #endif
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index cba318ee7cf5..ed110109492f 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -158,6 +158,14 @@ struct mshv_msi_routing {
 #define MSHV_SET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0B, struct mshv_vp_state)
 #define MSHV_TRANSLATE_GVA	_IOWR(MSHV_IOCTL, 0x0E, struct mshv_translate_gva)
 
+/* ioctl for device fd */
+#define MSHV_CREATE_DEVICE	  _IOWR(MSHV_IOCTL, 0x13, struct mshv_create_device)
+
+/* ioctls for fds returned by MSHV_CREATE_DEVICE */
+#define MSHV_SET_DEVICE_ATTR	  _IOW(MSHV_IOCTL, 0x14, struct mshv_device_attr)
+#define MSHV_GET_DEVICE_ATTR	  _IOW(MSHV_IOCTL, 0x15, struct mshv_device_attr)
+#define MSHV_HAS_DEVICE_ATTR	  _IOW(MSHV_IOCTL, 0x16, struct mshv_device_attr)
+
 /* register page mapping example:
  * struct hv_vp_register_page *regs = mmap(NULL,
  *					   4096,
@@ -184,4 +192,32 @@ union mshv_partition_property_page_access_tracking_config {
 	__u64 as_uint64;
 } __packed;
 
+/*
+ * Device control API.
+ */
+#define MSHV_CREATE_DEVICE_TEST		1
+
+struct mshv_create_device {
+	__u32	type;	/* in: MSHV_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: MSHV_CREATE_DEVICE_xxx */
+};
+
+#define  MSHV_DEV_VFIO_GROUP			1
+#define   MSHV_DEV_VFIO_GROUP_ADD			1
+#define   MSHV_DEV_VFIO_GROUP_DEL			2
+
+enum mshv_device_type {
+	MSHV_DEV_TYPE_VFIO,
+#define MSHV_DEV_TYPE_VFIO		MSHV_DEV_TYPE_VFIO
+	MSHV_DEV_TYPE_MAX,
+};
+
+struct mshv_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
 #endif
-- 
2.30.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v1 8/8] mshv: add vfio bridge device
  2021-07-09 11:43 [RFC v1 0/8] MSHV: add PV-IOMMU driver Wei Liu
                   ` (6 preceding siblings ...)
  2021-07-09 11:43 ` [RFC v1 7/8] mshv: implement in-kernel device framework Wei Liu
@ 2021-07-09 11:43 ` Wei Liu
  2021-08-03 19:27   ` Praveen Kumar
  7 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, Wei Liu, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui

The main purpose of this device at this stage is to hold a reference to
the vfio_group to avoid it disappearing under our feet.

Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
Is there value in unifying with KVM?
---
 drivers/hv/Kconfig     |   4 +
 drivers/hv/Makefile    |   2 +-
 drivers/hv/mshv_main.c |   5 +
 drivers/hv/vfio.c      | 244 +++++++++++++++++++++++++++++++++++++++++
 drivers/hv/vfio.h      |  18 +++
 5 files changed, 272 insertions(+), 1 deletion(-)
 create mode 100644 drivers/hv/vfio.c
 create mode 100644 drivers/hv/vfio.h

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 3bf911aac5c7..d3542a818ede 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -2,6 +2,9 @@
 
 menu "Microsoft Hyper-V guest support"
 
+config MSHV_VFIO
+	bool
+
 config HYPERV
 	tristate "Microsoft Hyper-V client drivers"
 	depends on X86 && ACPI && X86_LOCAL_APIC && HYPERVISOR_GUEST
@@ -31,6 +34,7 @@ config HYPERV_ROOT_API
 	tristate "Microsoft Hypervisor root partition interfaces: /dev/mshv"
 	depends on HYPERV
 	select EVENTFD
+	select MSHV_VFIO
 	help
 	  Provides access to interfaces for managing guest virtual machines
 	  running under the Microsoft Hypervisor.
diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index 370d126252ef..c2ae17076b68 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -14,4 +14,4 @@ hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
 hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_fcopy.o hv_utils_transport.o
 
 mshv-y                          += mshv_main.o hv_call.o hv_synic.o hv_portid_table.o  \
-					hv_eventfd.o mshv_msi.o
+					hv_eventfd.o mshv_msi.o vfio.o
diff --git a/drivers/hv/mshv_main.c b/drivers/hv/mshv_main.c
index 84c774a561de..49bc7442921f 100644
--- a/drivers/hv/mshv_main.c
+++ b/drivers/hv/mshv_main.c
@@ -25,6 +25,7 @@
 #include <asm/mshyperv.h>
 
 #include "mshv.h"
+#include "vfio.h"
 
 MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
@@ -1439,6 +1440,8 @@ __init mshv_init(void)
 	if (mshv_irqfd_wq_init())
 		mshv_irqfd_wq_cleanup();
 
+	mshv_vfio_ops_init();
+
 	return 0;
 }
 
@@ -1452,6 +1455,8 @@ __exit mshv_exit(void)
 
 	hv_port_table_fini();
 
+	mshv_vfio_ops_exit();
+
 	misc_deregister(&mshv_dev);
 }
 
diff --git a/drivers/hv/vfio.c b/drivers/hv/vfio.c
new file mode 100644
index 000000000000..281374a4386e
--- /dev/null
+++ b/drivers/hv/vfio.c
@@ -0,0 +1,244 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VFIO-MSHV bridge pseudo device
+ *
+ * Heavily inspired by the VFIO-KVM bridge pseudo device.
+ * Copyright (C) 2013 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/mshv.h>
+
+#include "vfio.h"
+
+
+struct mshv_vfio_group {
+	struct list_head node;
+	struct vfio_group *vfio_group;
+};
+
+struct mshv_vfio {
+	struct list_head group_list;
+	struct mutex lock;
+};
+
+static struct vfio_group *mshv_vfio_group_get_external_user(struct file *filep)
+{
+	struct vfio_group *vfio_group;
+	struct vfio_group *(*fn)(struct file *);
+
+	fn = symbol_get(vfio_group_get_external_user);
+	if (!fn)
+		return ERR_PTR(-EINVAL);
+
+	vfio_group = fn(filep);
+
+	symbol_put(vfio_group_get_external_user);
+
+	return vfio_group;
+}
+
+static bool mshv_vfio_external_group_match_file(struct vfio_group *group,
+						struct file *filep)
+{
+	bool ret, (*fn)(struct vfio_group *, struct file *);
+
+	fn = symbol_get(vfio_external_group_match_file);
+	if (!fn)
+		return false;
+
+	ret = fn(group, filep);
+
+	symbol_put(vfio_external_group_match_file);
+
+	return ret;
+}
+
+static void mshv_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (!fn)
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
+
+static int mshv_vfio_set_group(struct mshv_device *dev, long attr, u64 arg)
+{
+	struct mshv_vfio *mv = dev->private;
+	struct vfio_group *vfio_group;
+	struct mshv_vfio_group *mvg;
+	int32_t __user *argp = (int32_t __user *)(unsigned long)arg;
+	struct fd f;
+	int32_t fd;
+	int ret;
+
+	switch (attr) {
+	case MSHV_DEV_VFIO_GROUP_ADD:
+		if (get_user(fd, argp))
+			return -EFAULT;
+
+		f = fdget(fd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = mshv_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		mutex_lock(&mv->lock);
+
+		list_for_each_entry(mvg, &mv->group_list, node) {
+			if (mvg->vfio_group == vfio_group) {
+				mutex_unlock(&mv->lock);
+				mshv_vfio_group_put_external_user(vfio_group);
+				return -EEXIST;
+			}
+		}
+
+		mvg = kzalloc(sizeof(*mvg), GFP_KERNEL_ACCOUNT);
+		if (!mvg) {
+			mutex_unlock(&mv->lock);
+			mshv_vfio_group_put_external_user(vfio_group);
+			return -ENOMEM;
+		}
+
+		list_add_tail(&mvg->node, &mv->group_list);
+		mvg->vfio_group = vfio_group;
+
+		mutex_unlock(&mv->lock);
+
+		return 0;
+
+	case MSHV_DEV_VFIO_GROUP_DEL:
+		if (get_user(fd, argp))
+			return -EFAULT;
+
+		f = fdget(fd);
+		if (!f.file)
+			return -EBADF;
+
+		ret = -ENOENT;
+
+		mutex_lock(&mv->lock);
+
+		list_for_each_entry(mvg, &mv->group_list, node) {
+			if (!mshv_vfio_external_group_match_file(mvg->vfio_group,
+								 f.file))
+				continue;
+
+			list_del(&mvg->node);
+			mshv_vfio_group_put_external_user(mvg->vfio_group);
+			kfree(mvg);
+			ret = 0;
+			break;
+		}
+
+		mutex_unlock(&mv->lock);
+
+		fdput(f);
+
+		return ret;
+	}
+
+	return -ENXIO;
+}
+
+static int mshv_vfio_set_attr(struct mshv_device *dev,
+			      struct mshv_device_attr *attr)
+{
+	switch (attr->group) {
+	case MSHV_DEV_VFIO_GROUP:
+		return mshv_vfio_set_group(dev, attr->attr, attr->addr);
+	}
+
+	return -ENXIO;
+}
+
+static int mshv_vfio_has_attr(struct mshv_device *dev,
+			      struct mshv_device_attr *attr)
+{
+	switch (attr->group) {
+	case MSHV_DEV_VFIO_GROUP:
+		switch (attr->attr) {
+		case MSHV_DEV_VFIO_GROUP_ADD:
+		case MSHV_DEV_VFIO_GROUP_DEL:
+			return 0;
+		}
+
+		break;
+	}
+
+	return -ENXIO;
+}
+
+static void mshv_vfio_destroy(struct mshv_device *dev)
+{
+	struct mshv_vfio *mv = dev->private;
+	struct mshv_vfio_group *mvg, *tmp;
+
+	list_for_each_entry_safe(mvg, tmp, &mv->group_list, node) {
+		mshv_vfio_group_put_external_user(mvg->vfio_group);
+		list_del(&mvg->node);
+		kfree(mvg);
+	}
+
+	kfree(mv);
+	kfree(dev);
+}
+
+static int mshv_vfio_create(struct mshv_device *dev, u32 type);
+
+static struct mshv_device_ops mshv_vfio_ops = {
+	.name = "mshv-vfio",
+	.create = mshv_vfio_create,
+	.destroy = mshv_vfio_destroy,
+	.set_attr = mshv_vfio_set_attr,
+	.has_attr = mshv_vfio_has_attr,
+};
+
+static int mshv_vfio_create(struct mshv_device *dev, u32 type)
+{
+	struct mshv_device *tmp;
+	struct mshv_vfio *mv;
+
+	/* Only one VFIO "device" per VM */
+	list_for_each_entry(tmp, &dev->partition->devices, partition_node)
+		if (tmp->ops == &mshv_vfio_ops)
+			return -EBUSY;
+
+	mv = kzalloc(sizeof(*mv), GFP_KERNEL_ACCOUNT);
+	if (!mv)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&mv->group_list);
+	mutex_init(&mv->lock);
+
+	dev->private = mv;
+
+	return 0;
+}
+
+int mshv_vfio_ops_init(void)
+{
+	return mshv_register_device_ops(&mshv_vfio_ops, MSHV_DEV_TYPE_VFIO);
+}
+
+void mshv_vfio_ops_exit(void)
+{
+	mshv_unregister_device_ops(MSHV_DEV_TYPE_VFIO);
+}
diff --git a/drivers/hv/vfio.h b/drivers/hv/vfio.h
new file mode 100644
index 000000000000..0544476e6629
--- /dev/null
+++ b/drivers/hv/vfio.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __MSHV_VFIO_H
+#define __MSHV_VFIO_H
+
+#ifdef CONFIG_MSHV_VFIO
+int mshv_vfio_ops_init(void);
+void mshv_vfio_ops_exit(void);
+#else
+static inline int mshv_vfio_ops_init(void)
+{
+	return 0;
+}
+static inline void mshv_vfio_ops_exit(void)
+{
+}
+#endif
+
+#endif
-- 
2.30.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU
  2021-07-09 11:43 ` [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU Wei Liu
@ 2021-07-09 12:46   ` Robin Murphy
  2021-07-09 13:34     ` Wei Liu
  2021-08-03 18:50   ` Praveen Kumar
  1 sibling, 1 reply; 35+ messages in thread
From: Robin Murphy @ 2021-07-09 12:46 UTC (permalink / raw)
  To: Wei Liu, Linux on Hyper-V List
  Cc: K. Y. Srinivasan, Stephen Hemminger, pasha.tatashin,
	kumarpraveen, Will Deacon, Haiyang Zhang, Dexuan Cui,
	Linux Kernel List, Michael Kelley, open list:IOMMU DRIVERS,
	Nuno Das Neves, Sunil Muthuswamy, virtualization, Vineeth Pillai

On 2021-07-09 12:43, Wei Liu wrote:
> Some devices may have been claimed by the hypervisor already. One such
> example is a user can assign a NIC for debugging purpose.
> 
> Ideally Linux should be able to tell retrieve that information, but
> there is no way to do that yet. And designing that new mechanism is
> going to take time.
> 
> Provide a command line option for skipping devices. This is a stopgap
> solution, so it is intentionally undocumented. Hopefully we can retire
> it in the future.

Huh? If the host is using a device, why the heck is it exposing any 
knowledge of that device to the guest at all, let alone allowing the 
guest to do anything that could affect its operation!?

Robin.

> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> ---
>   drivers/iommu/hyperv-iommu.c | 45 ++++++++++++++++++++++++++++++++++++
>   1 file changed, 45 insertions(+)
> 
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> index 043dcff06511..353da5036387 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -349,6 +349,16 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
>   
>   #ifdef CONFIG_HYPERV_ROOT_PVIOMMU
>   
> +/* The IOMMU will not claim these PCI devices. */
> +static char *pci_devs_to_skip;
> +static int __init mshv_iommu_setup_skip(char *str) {
> +	pci_devs_to_skip = str;
> +
> +	return 0;
> +}
> +/* mshv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> +__setup("mshv_iommu_skip=", mshv_iommu_setup_skip);
> +
>   /* DMA remapping support */
>   struct hv_iommu_domain {
>   	struct iommu_domain domain;
> @@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct device *dev)
>   	if (!dev_is_pci(dev))
>   		return ERR_PTR(-ENODEV);
>   
> +	/*
> +	 * Skip the PCI device specified in `pci_devs_to_skip`. This is a
> +	 * temporary solution until we figure out a way to extract information
> +	 * from the hypervisor what devices it is already using.
> +	 */
> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> +		int pos = 0;
> +		int parsed;
> +		int segment, bus, slot, func;
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		do {
> +			parsed = 0;
> +
> +			sscanf(pci_devs_to_skip + pos,
> +				" (%x:%x:%x.%x) %n",
> +				&segment, &bus, &slot, &func, &parsed);
> +
> +			if (parsed <= 0)
> +				break;
> +
> +			if (pci_domain_nr(pdev->bus) == segment &&
> +				pdev->bus->number == bus &&
> +				PCI_SLOT(pdev->devfn) == slot &&
> +				PCI_FUNC(pdev->devfn) == func)
> +			{
> +				dev_info(dev, "skipped by MSHV IOMMU\n");
> +				return ERR_PTR(-ENODEV);
> +			}
> +
> +			pos += parsed;
> +
> +		} while (pci_devs_to_skip[pos]);
> +	}
> +
>   	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
>   	if (!vdev)
>   		return ERR_PTR(-ENOMEM);
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible
  2021-07-09 11:43 ` [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible Wei Liu
@ 2021-07-09 12:56   ` Robin Murphy
  2021-07-09 13:42     ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Robin Murphy @ 2021-07-09 12:56 UTC (permalink / raw)
  To: Wei Liu, Linux on Hyper-V List
  Cc: pasha.tatashin, Will Deacon, kumarpraveen, David Woodhouse,
	Linux Kernel List, Michael Kelley, open list:INTEL IOMMU VT-d,
	Nuno Das Neves, Sunil Muthuswamy, virtualization, Vineeth Pillai,
	Jean-Philippe Brucker

On 2021-07-09 12:43, Wei Liu wrote:
> Microsoft Hypervisor provides a set of hypercalls to manage device
> domains. The root kernel should parse the DMAR so that it can program
> the IOMMU (with hypercalls) correctly.
> 
> The DMAR code was designed to work with Intel IOMMU only. Add two more
> parameters to make it useful to Microsoft Hypervisor. Microsoft
> Hypervisor does not need the DMAR parsing code to allocate an Intel
> IOMMU structure; it also wishes to always reparse the DMAR table even
> after it has been parsed before.

We've recently defined the VIOT table for describing paravirtualised 
IOMMUs - would it make more sense to extend that to support the 
Microsoft implementation than to abuse a hardware-specific table? Am I 
right in assuming said hypervisor isn't intended to only ever run on 
Intel hardware?

Robin.

> Adjust Intel IOMMU code to use the new dmar_table_init. There should be
> no functional change to Intel IOMMU code.
> 
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> ---
> We may be able to combine alloc and force_parse?
> ---
>   drivers/iommu/intel/dmar.c          | 38 ++++++++++++++++++++---------
>   drivers/iommu/intel/iommu.c         |  2 +-
>   drivers/iommu/intel/irq_remapping.c |  2 +-
>   include/linux/dmar.h                |  2 +-
>   4 files changed, 30 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
> index 84057cb9596c..bd72f47c728b 100644
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -408,7 +408,8 @@ dmar_find_dmaru(struct acpi_dmar_hardware_unit *drhd)
>    * structure which uniquely represent one DMA remapping hardware unit
>    * present in the platform
>    */
> -static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
> +static int dmar_parse_one_drhd_internal(struct acpi_dmar_header *header,
> +		void *arg, bool alloc)
>   {
>   	struct acpi_dmar_hardware_unit *drhd;
>   	struct dmar_drhd_unit *dmaru;
> @@ -440,12 +441,14 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
>   		return -ENOMEM;
>   	}
>   
> -	ret = alloc_iommu(dmaru);
> -	if (ret) {
> -		dmar_free_dev_scope(&dmaru->devices,
> -				    &dmaru->devices_cnt);
> -		kfree(dmaru);
> -		return ret;
> +	if (alloc) {
> +		ret = alloc_iommu(dmaru);
> +		if (ret) {
> +			dmar_free_dev_scope(&dmaru->devices,
> +					    &dmaru->devices_cnt);
> +			kfree(dmaru);
> +			return ret;
> +		}
>   	}
>   	dmar_register_drhd_unit(dmaru);
>   
> @@ -456,6 +459,16 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
>   	return 0;
>   }
>   
> +static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
> +{
> +	return dmar_parse_one_drhd_internal(header, arg, true);
> +}
> +
> +int dmar_parse_one_drhd_noalloc(struct acpi_dmar_header *header, void *arg)
> +{
> +	return dmar_parse_one_drhd_internal(header, arg, false);
> +}
> +
>   static void dmar_free_drhd(struct dmar_drhd_unit *dmaru)
>   {
>   	if (dmaru->devices && dmaru->devices_cnt)
> @@ -633,7 +646,7 @@ static inline int dmar_walk_dmar_table(struct acpi_table_dmar *dmar,
>    * parse_dmar_table - parses the DMA reporting table
>    */
>   static int __init
> -parse_dmar_table(void)
> +parse_dmar_table(bool alloc)
>   {
>   	struct acpi_table_dmar *dmar;
>   	int drhd_count = 0;
> @@ -650,6 +663,9 @@ parse_dmar_table(void)
>   		.cb[ACPI_DMAR_TYPE_SATC] = &dmar_parse_one_satc,
>   	};
>   
> +	if (!alloc)
> +		cb.cb[ACPI_DMAR_TYPE_HARDWARE_UNIT] = &dmar_parse_one_drhd_noalloc;
> +
>   	/*
>   	 * Do it again, earlier dmar_tbl mapping could be mapped with
>   	 * fixed map.
> @@ -840,13 +856,13 @@ void __init dmar_register_bus_notifier(void)
>   }
>   
>   
> -int __init dmar_table_init(void)
> +int __init dmar_table_init(bool alloc, bool force_parse)
>   {
>   	static int dmar_table_initialized;
>   	int ret;
>   
> -	if (dmar_table_initialized == 0) {
> -		ret = parse_dmar_table();
> +	if (dmar_table_initialized == 0 || force_parse) {
> +		ret = parse_dmar_table(alloc);
>   		if (ret < 0) {
>   			if (ret != -ENODEV)
>   				pr_info("Parse DMAR table failure.\n");
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index be35284a2016..a4294d310b93 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4310,7 +4310,7 @@ int __init intel_iommu_init(void)
>   	}
>   
>   	down_write(&dmar_global_lock);
> -	if (dmar_table_init()) {
> +	if (dmar_table_init(true, false)) {
>   		if (force_on)
>   			panic("tboot: Failed to initialize DMAR table\n");
>   		goto out_free_dmar;
> diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
> index f912fe45bea2..0e8abef862e4 100644
> --- a/drivers/iommu/intel/irq_remapping.c
> +++ b/drivers/iommu/intel/irq_remapping.c
> @@ -732,7 +732,7 @@ static int __init intel_prepare_irq_remapping(void)
>   		return -ENODEV;
>   	}
>   
> -	if (dmar_table_init() < 0)
> +	if (dmar_table_init(true, false) < 0)
>   		return -ENODEV;
>   
>   	if (intel_cap_audit(CAP_AUDIT_STATIC_IRQR, NULL))
> diff --git a/include/linux/dmar.h b/include/linux/dmar.h
> index e04436a7ff27..f88535d41a6e 100644
> --- a/include/linux/dmar.h
> +++ b/include/linux/dmar.h
> @@ -106,7 +106,7 @@ static inline bool dmar_rcu_check(void)
>   	for_each_dev_scope((devs), (cnt), (i), (tmp))			\
>   		if (!(tmp)) { continue; } else
>   
> -extern int dmar_table_init(void);
> +extern int dmar_table_init(bool alloc, bool force_parse);
>   extern int dmar_dev_scope_init(void);
>   extern void dmar_register_bus_notifier(void);
>   extern int dmar_parse_dev_scope(void *start, void *end, int *cnt,
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 11:43 ` [RFC v1 7/8] mshv: implement in-kernel device framework Wei Liu
@ 2021-07-09 13:02   ` Matthew Wilcox
  2021-07-09 13:50     ` Wei Liu
  2021-08-03 19:12   ` Praveen Kumar
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2021-07-09 13:02 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linux on Hyper-V List, virtualization, Linux Kernel List,
	Michael Kelley, Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves,
	kumarpraveen, pasha.tatashin, Jonathan Corbet, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	Lillian Grassin-Drake, Muminul Islam, open list:DOCUMENTATION

On Fri, Jul 09, 2021 at 11:43:38AM +0000, Wei Liu wrote:
> +static long
> +mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> +	void __user *user_args)
> +{
[...]
> +	mshv_partition_get(partition);
> +	r = anon_inode_getfd(ops->name, &mshv_device_fops, dev, O_RDWR | O_CLOEXEC);
> +	if (r < 0) {
> +		mshv_partition_put_no_destroy(partition);
> +		list_del(&dev->partition_node);
> +		ops->destroy(dev);
> +		goto out;
> +	}
> +
> +	cd->fd = r;
> +	r = 0;

Why return the fd in memory instead of returning the fd as the return
value from the ioctl?

> +	if (copy_to_user(user_args, &tmp, sizeof(tmp))) {
> +		r = -EFAULT;
> +		goto out;
> +	}

... this could then disappear.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU
  2021-07-09 12:46   ` Robin Murphy
@ 2021-07-09 13:34     ` Wei Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-07-09 13:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Wei Liu, Linux on Hyper-V List, K. Y. Srinivasan,
	Stephen Hemminger, pasha.tatashin, kumarpraveen, Will Deacon,
	Haiyang Zhang, Dexuan Cui, Linux Kernel List, Michael Kelley,
	open list:IOMMU DRIVERS, Nuno Das Neves, Sunil Muthuswamy,
	virtualization, Vineeth Pillai

On Fri, Jul 09, 2021 at 01:46:19PM +0100, Robin Murphy wrote:
> On 2021-07-09 12:43, Wei Liu wrote:
> > Some devices may have been claimed by the hypervisor already. One such
> > example is a user can assign a NIC for debugging purpose.
> > 
> > Ideally Linux should be able to tell retrieve that information, but
> > there is no way to do that yet. And designing that new mechanism is
> > going to take time.
> > 
> > Provide a command line option for skipping devices. This is a stopgap
> > solution, so it is intentionally undocumented. Hopefully we can retire
> > it in the future.
> 
> Huh? If the host is using a device, why the heck is it exposing any
> knowledge of that device to the guest at all, let alone allowing the guest
> to do anything that could affect its operation!?

The host in this setup consists of the hypervisor, the root kernel and a
bunch of user space programs.

Root is not an ordinary guest. It does need to know all the hardware to
manage the platform. Hypervisor does not claim more devices than it
needs to, nor does it try to hide hardware details from the root.

The hypervisor can protect itself just fine. Any attempt to use the
already claimed devices will be blocked or rejected, so are the attempts
to attach them to device domains.

That, however, leads to some interesting interactions between the
hypervisor and Linux kernel.  When kernel initializes IOMMU during boot,
it will try to attach all devices in one go. Any failure there will
cause kernel to detach the already attached devices. That's not fatal to
kernel, and is only a minor annoyance to our current use case, because
the default domain is a passthrough domain anyway. It will become
problematic once we switch the default domain to a DMA domain to further
tighten security during Linux boot.

Wei.

> 
> Robin.
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible
  2021-07-09 12:56   ` Robin Murphy
@ 2021-07-09 13:42     ` Wei Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-07-09 13:42 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Wei Liu, Linux on Hyper-V List, pasha.tatashin, Will Deacon,
	kumarpraveen, David Woodhouse, Linux Kernel List, Michael Kelley,
	open list:INTEL IOMMU VT-d, Nuno Das Neves, Sunil Muthuswamy,
	virtualization, Vineeth Pillai, Jean-Philippe Brucker

On Fri, Jul 09, 2021 at 01:56:46PM +0100, Robin Murphy wrote:
> On 2021-07-09 12:43, Wei Liu wrote:
> > Microsoft Hypervisor provides a set of hypercalls to manage device
> > domains. The root kernel should parse the DMAR so that it can program
> > the IOMMU (with hypercalls) correctly.
> > 
> > The DMAR code was designed to work with Intel IOMMU only. Add two more
> > parameters to make it useful to Microsoft Hypervisor. Microsoft
> > Hypervisor does not need the DMAR parsing code to allocate an Intel
> > IOMMU structure; it also wishes to always reparse the DMAR table even
> > after it has been parsed before.
> 
> We've recently defined the VIOT table for describing paravirtualised IOMMUs
> - would it make more sense to extend that to support the Microsoft
> implementation than to abuse a hardware-specific table? Am I right in

I searched for VIOT and believed I found the correct link
https://lwn.net/Articles/859291/. My understanding is based on the
reading of that series.

VIOT is useful. I think it solves the problem for guests.

It does not solve the problem we have though. The DMAR tables are not
conjured up by some backend software running on the host side. They are
the real tables provided by the firmware. The kernel here is part of the
host setup, dealing with physical hardware.

No matter how much I wish all vendors unified their tables, I don't see
how that's going to happen for readily available servers. :-(

> assuming said hypervisor isn't intended to only ever run on Intel hardware?

Yes, that's correct. We also plan to add support AMD and ARM64.

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 13:02   ` Matthew Wilcox
@ 2021-07-09 13:50     ` Wei Liu
  2021-07-09 15:32       ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2021-07-09 13:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, kumarpraveen, pasha.tatashin,
	Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Lillian Grassin-Drake,
	Muminul Islam, open list:DOCUMENTATION

On Fri, Jul 09, 2021 at 02:02:04PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 09, 2021 at 11:43:38AM +0000, Wei Liu wrote:
> > +static long
> > +mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> > +	void __user *user_args)
> > +{
> [...]
> > +	mshv_partition_get(partition);
> > +	r = anon_inode_getfd(ops->name, &mshv_device_fops, dev, O_RDWR | O_CLOEXEC);
> > +	if (r < 0) {
> > +		mshv_partition_put_no_destroy(partition);
> > +		list_del(&dev->partition_node);
> > +		ops->destroy(dev);
> > +		goto out;
> > +	}
> > +
> > +	cd->fd = r;
> > +	r = 0;
> 
> Why return the fd in memory instead of returning the fd as the return
> value from the ioctl?
> 
> > +	if (copy_to_user(user_args, &tmp, sizeof(tmp))) {
> > +		r = -EFAULT;
> > +		goto out;
> > +	}
> 
> ... this could then disappear.

Thanks for your comment, Matthew.

This is intentionally because I didn't want to deviate from KVM's API.
The fewer differences the better.

Wei.


> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions
  2021-07-09 11:43 ` [RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions Wei Liu
@ 2021-07-09 14:17   ` Lu Baolu
  2021-07-09 14:21     ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Lu Baolu @ 2021-07-09 14:17 UTC (permalink / raw)
  To: Wei Liu, Linux on Hyper-V List
  Cc: baolu.lu, virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, kumarpraveen,
	pasha.tatashin, David Woodhouse, Joerg Roedel, Will Deacon,
	open list:INTEL IOMMU (VT-d)

On 2021/7/9 19:43, Wei Liu wrote:
> When Microsoft Hypervisor runs on Intel platforms it needs to know the
> reserved regions to program devices correctly. There is no reason to
> duplicate intel_iommu_get_resv_regions. Export it.

Why not using iommu_get_resv_regions()?

Best regards,
baolu

> 
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> ---
>   drivers/iommu/intel/iommu.c | 5 +++--
>   include/linux/intel-iommu.h | 4 ++++
>   2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index a4294d310b93..01973bc20080 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -5176,8 +5176,8 @@ static void intel_iommu_probe_finalize(struct device *dev)
>   		set_dma_ops(dev, NULL);
>   }
>   
> -static void intel_iommu_get_resv_regions(struct device *device,
> -					 struct list_head *head)
> +void intel_iommu_get_resv_regions(struct device *device,
> +				 struct list_head *head)
>   {
>   	int prot = DMA_PTE_READ | DMA_PTE_WRITE;
>   	struct iommu_resv_region *reg;
> @@ -5232,6 +5232,7 @@ static void intel_iommu_get_resv_regions(struct device *device,
>   		return;
>   	list_add_tail(&reg->list, head);
>   }
> +EXPORT_SYMBOL_GPL(intel_iommu_get_resv_regions);
>   
>   int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
>   {
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 03faf20a6817..f91869f765bc 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -814,6 +814,8 @@ extern int iommu_calculate_max_sagaw(struct intel_iommu *iommu);
>   extern int dmar_disabled;
>   extern int intel_iommu_enabled;
>   extern int intel_iommu_gfx_mapped;
> +extern void intel_iommu_get_resv_regions(struct device *device,
> +				 struct list_head *head);
>   #else
>   static inline int iommu_calculate_agaw(struct intel_iommu *iommu)
>   {
> @@ -825,6 +827,8 @@ static inline int iommu_calculate_max_sagaw(struct intel_iommu *iommu)
>   }
>   #define dmar_disabled	(1)
>   #define intel_iommu_enabled (0)
> +static inline void intel_iommu_get_resv_regions(struct device *device,
> +				 struct list_head *head) {}
>   #endif
>   
>   #endif
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions
  2021-07-09 14:17   ` Lu Baolu
@ 2021-07-09 14:21     ` Wei Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-07-09 14:21 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, kumarpraveen, pasha.tatashin,
	David Woodhouse, Joerg Roedel, Will Deacon,
	open list:INTEL IOMMU (VT-d)

On Fri, Jul 09, 2021 at 10:17:25PM +0800, Lu Baolu wrote:
> On 2021/7/9 19:43, Wei Liu wrote:
> > When Microsoft Hypervisor runs on Intel platforms it needs to know the
> > reserved regions to program devices correctly. There is no reason to
> > duplicate intel_iommu_get_resv_regions. Export it.
> 
> Why not using iommu_get_resv_regions()?

That calls into ops->get_resv_regions.

In this patch series, get_resv_regions is hv_iommu_resv_regions, which
wants to use intel_iommu_get_resv_regions when it detects the underlying
hardware platform is from Intel.

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 13:50     ` Wei Liu
@ 2021-07-09 15:32       ` Matthew Wilcox
  2021-07-09 16:27         ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2021-07-09 15:32 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linux on Hyper-V List, virtualization, Linux Kernel List,
	Michael Kelley, Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves,
	kumarpraveen, pasha.tatashin, Jonathan Corbet, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	Lillian Grassin-Drake, Muminul Islam, open list:DOCUMENTATION

On Fri, Jul 09, 2021 at 01:50:13PM +0000, Wei Liu wrote:
> On Fri, Jul 09, 2021 at 02:02:04PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 09, 2021 at 11:43:38AM +0000, Wei Liu wrote:
> > > +static long
> > > +mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> > > +	void __user *user_args)
> > > +{
> > [...]
> > > +	mshv_partition_get(partition);
> > > +	r = anon_inode_getfd(ops->name, &mshv_device_fops, dev, O_RDWR | O_CLOEXEC);
> > > +	if (r < 0) {
> > > +		mshv_partition_put_no_destroy(partition);
> > > +		list_del(&dev->partition_node);
> > > +		ops->destroy(dev);
> > > +		goto out;
> > > +	}
> > > +
> > > +	cd->fd = r;
> > > +	r = 0;
> > 
> > Why return the fd in memory instead of returning the fd as the return
> > value from the ioctl?
> > 
> > > +	if (copy_to_user(user_args, &tmp, sizeof(tmp))) {
> > > +		r = -EFAULT;
> > > +		goto out;
> > > +	}
> > 
> > ... this could then disappear.
> 
> Thanks for your comment, Matthew.
> 
> This is intentionally because I didn't want to deviate from KVM's API.
> The fewer differences the better.

Then don't define your own structure.  Use theirs.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 15:32       ` Matthew Wilcox
@ 2021-07-09 16:27         ` Wei Liu
  2021-07-09 16:38           ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2021-07-09 16:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, kumarpraveen, pasha.tatashin,
	Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Lillian Grassin-Drake,
	Muminul Islam, open list:DOCUMENTATION

On Fri, Jul 09, 2021 at 04:32:48PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 09, 2021 at 01:50:13PM +0000, Wei Liu wrote:
> > On Fri, Jul 09, 2021 at 02:02:04PM +0100, Matthew Wilcox wrote:
> > > On Fri, Jul 09, 2021 at 11:43:38AM +0000, Wei Liu wrote:
> > > > +static long
> > > > +mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> > > > +	void __user *user_args)
> > > > +{
> > > [...]
> > > > +	mshv_partition_get(partition);
> > > > +	r = anon_inode_getfd(ops->name, &mshv_device_fops, dev, O_RDWR | O_CLOEXEC);
> > > > +	if (r < 0) {
> > > > +		mshv_partition_put_no_destroy(partition);
> > > > +		list_del(&dev->partition_node);
> > > > +		ops->destroy(dev);
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	cd->fd = r;
> > > > +	r = 0;
> > > 
> > > Why return the fd in memory instead of returning the fd as the return
> > > value from the ioctl?
> > > 
> > > > +	if (copy_to_user(user_args, &tmp, sizeof(tmp))) {
> > > > +		r = -EFAULT;
> > > > +		goto out;
> > > > +	}
> > > 
> > > ... this could then disappear.
> > 
> > Thanks for your comment, Matthew.
> > 
> > This is intentionally because I didn't want to deviate from KVM's API.
> > The fewer differences the better.
> 
> Then don't define your own structure.  Use theirs.

I specifically mentioned in the cover letter I didn't do it because I
was not sure if that would be acceptable. I guess I will find out.

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 16:27         ` Wei Liu
@ 2021-07-09 16:38           ` Matthew Wilcox
  2021-07-09 19:14             ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2021-07-09 16:38 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linux on Hyper-V List, virtualization, Linux Kernel List,
	Michael Kelley, Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves,
	kumarpraveen, pasha.tatashin, Jonathan Corbet, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	Lillian Grassin-Drake, Muminul Islam, open list:DOCUMENTATION

On Fri, Jul 09, 2021 at 04:27:32PM +0000, Wei Liu wrote:
> > Then don't define your own structure.  Use theirs.
> 
> I specifically mentioned in the cover letter I didn't do it because I
> was not sure if that would be acceptable. I guess I will find out.

I only got patch 7/8.  You can't blame me for not reading 0/8 if you
didn't send me 0/8.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 16:38           ` Matthew Wilcox
@ 2021-07-09 19:14             ` Wei Liu
  2021-07-09 19:48               ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2021-07-09 19:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, kumarpraveen, pasha.tatashin,
	Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Lillian Grassin-Drake,
	Muminul Islam, open list:DOCUMENTATION

specifically On Fri, Jul 09, 2021 at 05:38:24PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 09, 2021 at 04:27:32PM +0000, Wei Liu wrote:
> > > Then don't define your own structure.  Use theirs.
> > 
> > I specifically mentioned in the cover letter I didn't do it because I
> > was not sure if that would be acceptable. I guess I will find out.
> 
> I only got patch 7/8.  You can't blame me for not reading 0/8 if you
> didn't send me 0/8.

To be clear I did not blame you.

You were not CC'ed on this patch, so presumably you got it via one of
the mailing lists. I'm not sure why you only got this one patch. Perhaps
if you wait a bit you will get the rest.

But it is all good. I appreciate the fact that you took the time to go
over this patch and voiced your opinion.

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 19:14             ` Wei Liu
@ 2021-07-09 19:48               ` Matthew Wilcox
  2021-07-09 20:11                 ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2021-07-09 19:48 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linux on Hyper-V List, virtualization, Linux Kernel List,
	Michael Kelley, Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves,
	kumarpraveen, pasha.tatashin, Jonathan Corbet, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	Lillian Grassin-Drake, Muminul Islam, open list:DOCUMENTATION

On Fri, Jul 09, 2021 at 07:14:05PM +0000, Wei Liu wrote:
> You were not CC'ed on this patch, so presumably you got it via one of
> the mailing lists. I'm not sure why you only got this one patch. Perhaps
> if you wait a bit you will get the rest.

No, I won't.  You only cc'd linux-doc on this one patch and not on any
of the others.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 19:48               ` Matthew Wilcox
@ 2021-07-09 20:11                 ` Wei Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-07-09 20:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, kumarpraveen, pasha.tatashin,
	Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Lillian Grassin-Drake,
	Muminul Islam, open list:DOCUMENTATION

On Fri, Jul 09, 2021 at 08:48:36PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 09, 2021 at 07:14:05PM +0000, Wei Liu wrote:
> > You were not CC'ed on this patch, so presumably you got it via one of
> > the mailing lists. I'm not sure why you only got this one patch. Perhaps
> > if you wait a bit you will get the rest.
> 
> No, I won't.  You only cc'd linux-doc on this one patch and not on any
> of the others.

I see. That's because get_maintainers.pl only detected changes to the
doc in this file.

If you're interested in having a look at the other patches, they can
be fetched via at the following url. 

https://lore.kernel.org/linux-hyperv/20210709114339.3467637-1-wei.liu@kernel.org/

Thanks for your attention.

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 5/8] mshv: add paravirtualized IOMMU support
  2021-07-09 11:43 ` [RFC v1 5/8] mshv: add paravirtualized IOMMU support Wei Liu
@ 2021-08-03 18:40   ` Praveen Kumar
  2021-08-03 21:47     ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Praveen Kumar @ 2021-08-03 18:40 UTC (permalink / raw)
  To: Wei Liu, Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin,
	Joerg Roedel, Will Deacon, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, open list:IOMMU DRIVERS

On 09-07-2021 17:13, Wei Liu wrote:
> +static void hv_iommu_domain_free(struct iommu_domain *d)
> +{
> +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> +	unsigned long flags;
> +	u64 status;
> +	struct hv_input_delete_device_domain *input;
> +
> +	if (is_identity_domain(domain) || is_null_domain(domain))
> +		return;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain= domain->device_domain;
> +
> +	status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);

Is it OK to deallocate the resources, if hypercall has failed ?
Do we have any specific error code EBUSY (kind of) which we need to wait upon ?

> +
> +	ida_free(&domain->hv_iommu->domain_ids, domain->device_domain.domain_id.id);
> +
> +	iommu_put_dma_cookie(d);
> +
> +	kfree(domain);
> +}
> +
> +static int hv_iommu_attach_dev(struct iommu_domain *d, struct device *dev)
> +{
> +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_attach_device_domain *input;
> +	struct pci_dev *pdev;
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> +	/* Only allow PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	dev_dbg(dev, "Attaching (%strusted) to %d\n", pdev->untrusted ? "un" : "",
> +		domain->device_domain.domain_id.id);
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain = domain->device_domain;
> +	input->device_id = hv_build_pci_dev_id(pdev);
> +
> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);

Does it make sense to vdev->domain = NULL ?

> +	else
> +		vdev->domain = domain;
> +
> +	return hv_status_to_errno(status);
> +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *d, struct device *dev)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_detach_device_domain *input;
> +	struct pci_dev *pdev;
> +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> +	/* See the attach function, only PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	dev_dbg(dev, "Detaching from %d\n", domain->device_domain.domain_id.id);
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	input->device_id = hv_build_pci_dev_id(pdev);
> +
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +	vdev->domain = NULL;
> +}
> +
> +static int hv_iommu_add_mapping(struct hv_iommu_domain *domain, unsigned long iova,
> +		phys_addr_t paddr, size_t size, u32 flags)
> +{
> +	unsigned long irqflags;
> +	struct hv_iommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> +	if (!mapping)
> +		return -ENOMEM;
> +
> +	mapping->paddr = paddr;
> +	mapping->iova.start = iova;
> +	mapping->iova.last = iova + size - 1;
> +	mapping->flags = flags;
> +
> +	spin_lock_irqsave(&domain->mappings_lock, irqflags);
> +	interval_tree_insert(&mapping->iova, &domain->mappings);
> +	spin_unlock_irqrestore(&domain->mappings_lock, irqflags);
> +
> +	return 0;
> +}
> +
> +static size_t hv_iommu_del_mappings(struct hv_iommu_domain *domain,
> +		unsigned long iova, size_t size)
> +{
> +	unsigned long flags;
> +	size_t unmapped = 0;
> +	unsigned long last = iova + size - 1;
> +	struct hv_iommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&domain->mappings_lock, flags);
> +	next = interval_tree_iter_first(&domain->mappings, iova, last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/* Trying to split a mapping? Not supported for now. */
> +		if (mapping->iova.start < iova)
> +			break;
> +
> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> +
> +		interval_tree_remove(node, &domain->mappings);
> +		kfree(mapping);
> +	}
> +	spin_unlock_irqrestore(&domain->mappings_lock, flags);
> +
> +	return unmapped;
> +}
> +
> +static int hv_iommu_map(struct iommu_domain *d, unsigned long iova,
> +			phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
> +{
> +	u32 map_flags;
> +	unsigned long flags, pfn, npages;
> +	int ret, i;
> +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> +	struct hv_input_map_device_gpa_pages *input;
> +	u64 status;
> +
> +	/* Reject size that's not a whole page */
> +	if (size & ~HV_HYP_PAGE_MASK)
> +		return -EINVAL;
> +
> +	map_flags = HV_MAP_GPA_READABLE; /* Always required */
> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> +
> +	ret = hv_iommu_add_mapping(domain, iova, paddr, size, flags);
> +	if (ret)
> +		return ret;
> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain = domain->device_domain;
> +	input->map_flags = map_flags;
> +	input->target_device_va_base = iova;
> +
> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
> +	for (i = 0; i < npages; i++) {
> +		input->gpa_page_list[i] = pfn;
> +		pfn += 1;
> +	}
> +
> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
> +			input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status)) {
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +		hv_iommu_del_mappings(domain, iova, size);
> +	}
> +
> +	return hv_status_to_errno(status);
> +}
> +
> +static size_t hv_iommu_unmap(struct iommu_domain *d, unsigned long iova,
> +			   size_t size, struct iommu_iotlb_gather *gather)
> +{
> +	size_t unmapped;
> +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> +	unsigned long flags, npages;
> +	struct hv_input_unmap_device_gpa_pages *input;
> +	u64 status;
> +
> +	unmapped = hv_iommu_del_mappings(domain, iova, size);
> +	if (unmapped < size)
> +		return 0;

Is there a case where unmapped > 0 && unmapped < size ?

> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain = domain->device_domain;
> +	input->target_device_va_base = iova;
> +
> +	/* Unmap `npages` pages starting from VA base */
> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> +			0, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +	return hv_result_success(status) ? unmapped : 0;
> +}
> +

Regards,

~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU
  2021-07-09 11:43 ` [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU Wei Liu
  2021-07-09 12:46   ` Robin Murphy
@ 2021-08-03 18:50   ` Praveen Kumar
  2021-08-03 21:56     ` Wei Liu
  1 sibling, 1 reply; 35+ messages in thread
From: Praveen Kumar @ 2021-08-03 18:50 UTC (permalink / raw)
  To: Wei Liu, Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	Joerg Roedel, Will Deacon, open list:IOMMU DRIVERS

On 09-07-2021 17:13, Wei Liu wrote:
> Some devices may have been claimed by the hypervisor already. One such
> example is a user can assign a NIC for debugging purpose.
> 
> Ideally Linux should be able to tell retrieve that information, but
> there is no way to do that yet. And designing that new mechanism is
> going to take time.
> 
> Provide a command line option for skipping devices. This is a stopgap
> solution, so it is intentionally undocumented. Hopefully we can retire
> it in the future.
> 
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> ---
>  drivers/iommu/hyperv-iommu.c | 45 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> index 043dcff06511..353da5036387 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -349,6 +349,16 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
>  
>  #ifdef CONFIG_HYPERV_ROOT_PVIOMMU
>  
> +/* The IOMMU will not claim these PCI devices. */
> +static char *pci_devs_to_skip;
> +static int __init mshv_iommu_setup_skip(char *str) {
> +	pci_devs_to_skip = str;
> +
> +	return 0;
> +}
> +/* mshv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> +__setup("mshv_iommu_skip=", mshv_iommu_setup_skip);
> +
>  /* DMA remapping support */
>  struct hv_iommu_domain {
>  	struct iommu_domain domain;
> @@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct device *dev)
>  	if (!dev_is_pci(dev))
>  		return ERR_PTR(-ENODEV);
>  
> +	/*
> +	 * Skip the PCI device specified in `pci_devs_to_skip`. This is a
> +	 * temporary solution until we figure out a way to extract information
> +	 * from the hypervisor what devices it is already using.
> +	 */
> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> +		int pos = 0;
> +		int parsed;
> +		int segment, bus, slot, func;
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		do {
> +			parsed = 0;
> +
> +			sscanf(pci_devs_to_skip + pos,
> +				" (%x:%x:%x.%x) %n",
> +				&segment, &bus, &slot, &func, &parsed);
> +
> +			if (parsed <= 0)
> +				break;
> +
> +			if (pci_domain_nr(pdev->bus) == segment &&
> +				pdev->bus->number == bus &&
> +				PCI_SLOT(pdev->devfn) == slot &&
> +				PCI_FUNC(pdev->devfn) == func)
> +			{
> +				dev_info(dev, "skipped by MSHV IOMMU\n");
> +				return ERR_PTR(-ENODEV);
> +			}
> +
> +			pos += parsed;
> +
> +		} while (pci_devs_to_skip[pos]);

Is there a possibility of pci_devs_to_skip + pos > sizeof(pci_devs_to_skip) and also a valid memory ?
I would recommend to have a check of size as well before accessing the array content, just to be safer accessing any memory.

> +	}
> +
>  	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
>  	if (!vdev)
>  		return ERR_PTR(-ENOMEM);
> 

Regards,

~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-07-09 11:43 ` [RFC v1 7/8] mshv: implement in-kernel device framework Wei Liu
  2021-07-09 13:02   ` Matthew Wilcox
@ 2021-08-03 19:12   ` Praveen Kumar
  2021-08-03 22:04     ` Wei Liu
  1 sibling, 1 reply; 35+ messages in thread
From: Praveen Kumar @ 2021-08-03 19:12 UTC (permalink / raw)
  To: Wei Liu, Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin,
	Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Lillian Grassin-Drake,
	Muminul Islam, open list:DOCUMENTATION

On 09-07-2021 17:13, Wei Liu wrote:
> This is basically the same code adopted from KVM. The main user case is
> the future MSHV-VFIO bridge device. We don't have any plan to support
> in-kernel device emulation yet, but it wouldn't hurt to make the code
> more flexible.
> 
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> ---
>  Documentation/virt/mshv/api.rst |  12 +++
>  drivers/hv/mshv_main.c          | 181 ++++++++++++++++++++++++++++++++
>  include/linux/mshv.h            |  57 ++++++++++
>  include/uapi/linux/mshv.h       |  36 +++++++
>  4 files changed, 286 insertions(+)
> 
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index 56a6edfcfe29..7d35dd589831 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -170,4 +170,16 @@ Can be used to get/set various properties of a partition.
>  Some properties can only be set at partition creation. For these, there are
>  parameters in MSHV_CREATE_PARTITION.
>  
> +3.12 MSHV_CREATE_DEVICE
> +-----------------------
> +:Type: partition ioctl
> +:Parameters: struct mshv_create_device
> +:Returns: 0 on success
> +
> +Can be used to create an in-kernel device.
> +
> +If the MSHV_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
>  
> +Currently only supports VFIO type device.
> diff --git a/drivers/hv/mshv_main.c b/drivers/hv/mshv_main.c
> index 4cbc520471e4..84c774a561de 100644
> --- a/drivers/hv/mshv_main.c
> +++ b/drivers/hv/mshv_main.c
> @@ -20,6 +20,8 @@
>  #include <linux/random.h>
>  #include <linux/mshv.h>
>  #include <linux/mshv_eventfd.h>
> +#include <linux/hyperv.h>
> +#include <linux/nospec.h>
>  #include <asm/mshyperv.h>
>  
>  #include "mshv.h"
> @@ -33,6 +35,7 @@ static int mshv_vp_release(struct inode *inode, struct file *filp);
>  static long mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
>  static struct mshv_partition *mshv_partition_get(struct mshv_partition *partition);
>  static void mshv_partition_put(struct mshv_partition *partition);
> +static void mshv_partition_put_no_destroy(struct mshv_partition *partition);
>  static int mshv_partition_release(struct inode *inode, struct file *filp);
>  static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
>  static int mshv_dev_open(struct inode *inode, struct file *filp);
> @@ -912,6 +915,172 @@ mshv_partition_ioctl_set_msi_routing(struct mshv_partition *partition,
>  	return ret;
>  }
>  
> +static int mshv_device_ioctl_attr(struct mshv_device *dev,
> +				 int (*accessor)(struct mshv_device *dev,
> +						 struct mshv_device_attr *attr),
> +				 unsigned long arg)
> +{
> +	struct mshv_device_attr attr;
> +
> +	if (!accessor)
> +		return -EPERM;
> +
> +	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
> +		return -EFAULT;
> +
> +	return accessor(dev, &attr);
> +}
> +
> +static long mshv_device_ioctl(struct file *filp, unsigned int ioctl,
> +			      unsigned long arg)
> +{
> +	struct mshv_device *dev = filp->private_data;
> +
> +	switch (ioctl) {
> +	case MSHV_SET_DEVICE_ATTR:
> +		return mshv_device_ioctl_attr(dev, dev->ops->set_attr, arg);
> +	case MSHV_GET_DEVICE_ATTR:
> +		return mshv_device_ioctl_attr(dev, dev->ops->get_attr, arg);
> +	case MSHV_HAS_DEVICE_ATTR:
> +		return mshv_device_ioctl_attr(dev, dev->ops->has_attr, arg);
> +	default:
> +		if (dev->ops->ioctl)
> +			return dev->ops->ioctl(dev, ioctl, arg);
> +
> +		return -ENOTTY;
> +	}

Have seen some static analyzer tool cribbing here of not returning any error.
If you feel OK, please move the 'return -ENOTTY' down after switch block. Thanks.

> +}
> +
> +static int mshv_device_release(struct inode *inode, struct file *filp)
> +{
> +	struct mshv_device *dev = filp->private_data;
> +	struct mshv_partition *partition = dev->partition;
> +
> +	if (dev->ops->release) {
> +		mutex_lock(&partition->mutex);
> +		list_del(&dev->partition_node);
> +		dev->ops->release(dev);
> +		mutex_unlock(&partition->mutex);
> +	}
> +
> +	mshv_partition_put(partition);
> +	return 0;
> +}
> +
> +static const struct file_operations mshv_device_fops = {
> +	.unlocked_ioctl = mshv_device_ioctl,
> +	.release = mshv_device_release,
> +};
> +
> +static const struct mshv_device_ops *mshv_device_ops_table[MSHV_DEV_TYPE_MAX];
> +
> +int mshv_register_device_ops(const struct mshv_device_ops *ops, u32 type)
> +{
> +	if (type >= ARRAY_SIZE(mshv_device_ops_table))
> +		return -ENOSPC;
> +
> +	if (mshv_device_ops_table[type] != NULL)
> +		return -EEXIST;
> +
> +	mshv_device_ops_table[type] = ops;
> +	return 0;
> +}
> +
> +void mshv_unregister_device_ops(u32 type)
> +{
> +	if (type >= ARRAY_SIZE(mshv_device_ops_table))
> +		return;
> +	mshv_device_ops_table[type] = NULL;
> +}
> +
> +static long
> +mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> +	void __user *user_args)
> +{
> +	long r;
> +	struct mshv_create_device tmp, *cd;
> +	struct mshv_device *dev;
> +	const struct mshv_device_ops *ops;
> +	int type;
> +
> +	if (copy_from_user(&tmp, user_args, sizeof(tmp))) {
> +		r = -EFAULT;
> +		goto out;
> +	}
> +
> +	cd = &tmp;
> +
> +	if (cd->type >= ARRAY_SIZE(mshv_device_ops_table)) {
> +		r = -ENODEV;
> +		goto out;
> +	}
> +
> +	type = array_index_nospec(cd->type, ARRAY_SIZE(mshv_device_ops_table));
> +	ops = mshv_device_ops_table[type];
> +	if (ops == NULL) {
> +		r = -ENODEV;
> +		goto out;
> +	}
> +
> +	if (cd->flags & MSHV_CREATE_DEVICE_TEST) {
> +		r = 0;
> +		goto out;
> +	}
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL_ACCOUNT);
> +	if (!dev) {
> +		r = -ENOMEM;
> +		goto out;
> +	}
> +
> +	dev->ops = ops;
> +	dev->partition = partition;
> +
> +	r = ops->create(dev, type);
> +	if (r < 0) {
> +		kfree(dev);
> +		goto out;
> +	}
> +
> +	list_add(&dev->partition_node, &partition->devices);
> +
> +	if (ops->init)
> +		ops->init(dev);
> +
> +	mshv_partition_get(partition);
> +	r = anon_inode_getfd(ops->name, &mshv_device_fops, dev, O_RDWR | O_CLOEXEC);
> +	if (r < 0) {
> +		mshv_partition_put_no_destroy(partition);
> +		list_del(&dev->partition_node);
> +		ops->destroy(dev);

I hope ops->destroy will free dev as well ?

> +		goto out;
> +	}
> +
> +	cd->fd = r;
> +	r = 0;
> +
> +	if (copy_to_user(user_args, &tmp, sizeof(tmp))) {
> +		r = -EFAULT;

I don't think we will be cleaning up anything ? Or do we need to?
> +		goto out;
> +	}
> +out:
> +	return r;
> +}
> +

Regards,

~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 8/8] mshv: add vfio bridge device
  2021-07-09 11:43 ` [RFC v1 8/8] mshv: add vfio bridge device Wei Liu
@ 2021-08-03 19:27   ` Praveen Kumar
  2021-08-10 10:52     ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Praveen Kumar @ 2021-08-03 19:27 UTC (permalink / raw)
  To: Wei Liu, Linux on Hyper-V List
  Cc: virtualization, Linux Kernel List, Michael Kelley,
	Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Dexuan Cui

On 09-07-2021 17:13, Wei Liu wrote:
> +
> +static int mshv_vfio_set_group(struct mshv_device *dev, long attr, u64 arg)
> +{
> +	struct mshv_vfio *mv = dev->private;
> +	struct vfio_group *vfio_group;
> +	struct mshv_vfio_group *mvg;
> +	int32_t __user *argp = (int32_t __user *)(unsigned long)arg;
> +	struct fd f;
> +	int32_t fd;
> +	int ret;
> +
> +	switch (attr) {
> +	case MSHV_DEV_VFIO_GROUP_ADD:
> +		if (get_user(fd, argp))
> +			return -EFAULT;
> +
> +		f = fdget(fd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = mshv_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		mutex_lock(&mv->lock);
> +
> +		list_for_each_entry(mvg, &mv->group_list, node) {
> +			if (mvg->vfio_group == vfio_group) {
> +				mutex_unlock(&mv->lock);
> +				mshv_vfio_group_put_external_user(vfio_group);
> +				return -EEXIST;
> +			}
> +		}
> +
> +		mvg = kzalloc(sizeof(*mvg), GFP_KERNEL_ACCOUNT);
> +		if (!mvg) {
> +			mutex_unlock(&mv->lock);
> +			mshv_vfio_group_put_external_user(vfio_group);
> +			return -ENOMEM;
> +		}
> +
> +		list_add_tail(&mvg->node, &mv->group_list);
> +		mvg->vfio_group = vfio_group;
> +
> +		mutex_unlock(&mv->lock);
> +
> +		return 0;
> +
> +	case MSHV_DEV_VFIO_GROUP_DEL:
> +		if (get_user(fd, argp))
> +			return -EFAULT;
> +
> +		f = fdget(fd);
> +		if (!f.file)
> +			return -EBADF;

Can we move these both checks above switch statement and do fdput accordingly under both case statement accordingly?

> +
> +		ret = -ENOENT;
> +
> +		mutex_lock(&mv->lock);
> +
> +		list_for_each_entry(mvg, &mv->group_list, node) {
> +			if (!mshv_vfio_external_group_match_file(mvg->vfio_group,
> +								 f.file))
> +				continue;
> +
> +			list_del(&mvg->node);
> +			mshv_vfio_group_put_external_user(mvg->vfio_group);
> +			kfree(mvg);
> +			ret = 0;
> +			break;
> +		}
> +
> +		mutex_unlock(&mv->lock);
> +
> +		fdput(f);
> +
> +		return ret;
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static int mshv_vfio_set_attr(struct mshv_device *dev,
> +			      struct mshv_device_attr *attr)
> +{
> +	switch (attr->group) {
> +	case MSHV_DEV_VFIO_GROUP:
> +		return mshv_vfio_set_group(dev, attr->attr, attr->addr);
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static int mshv_vfio_has_attr(struct mshv_device *dev,
> +			      struct mshv_device_attr *attr)
> +{
> +	switch (attr->group) {
> +	case MSHV_DEV_VFIO_GROUP:
> +		switch (attr->attr) {
> +		case MSHV_DEV_VFIO_GROUP_ADD:
> +		case MSHV_DEV_VFIO_GROUP_DEL:
> +			return 0;
> +		}
> +
> +		break;

do we need this break statement ? If not, lets remove it.
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static void mshv_vfio_destroy(struct mshv_device *dev)
> +{
> +	struct mshv_vfio *mv = dev->private;
> +	struct mshv_vfio_group *mvg, *tmp;
> +
> +	list_for_each_entry_safe(mvg, tmp, &mv->group_list, node) {
> +		mshv_vfio_group_put_external_user(mvg->vfio_group);
> +		list_del(&mvg->node);
> +		kfree(mvg);
> +	}
> +
> +	kfree(mv);
> +	kfree(dev);

We are freeing up dev. Please ignore my comment in caller patch. Thanks.

> +}
> +
> +static int mshv_vfio_create(struct mshv_device *dev, u32 type);
> +
> +static struct mshv_device_ops mshv_vfio_ops = {
> +	.name = "mshv-vfio",
> +	.create = mshv_vfio_create,
> +	.destroy = mshv_vfio_destroy,
> +	.set_attr = mshv_vfio_set_attr,
> +	.has_attr = mshv_vfio_has_attr,
> +};

Regards,

~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 5/8] mshv: add paravirtualized IOMMU support
  2021-08-03 18:40   ` Praveen Kumar
@ 2021-08-03 21:47     ` Wei Liu
  2021-08-04  6:43       ` Praveen Kumar
  0 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2021-08-03 21:47 UTC (permalink / raw)
  To: Praveen Kumar
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin, Joerg Roedel,
	Will Deacon, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Dexuan Cui, open list:IOMMU DRIVERS

On Wed, Aug 04, 2021 at 12:10:45AM +0530, Praveen Kumar wrote:
> On 09-07-2021 17:13, Wei Liu wrote:
> > +static void hv_iommu_domain_free(struct iommu_domain *d)
> > +{
> > +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> > +	unsigned long flags;
> > +	u64 status;
> > +	struct hv_input_delete_device_domain *input;
> > +
> > +	if (is_identity_domain(domain) || is_null_domain(domain))
> > +		return;
> > +
> > +	local_irq_save(flags);
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	memset(input, 0, sizeof(*input));
> > +
> > +	input->device_domain= domain->device_domain;
> > +
> > +	status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL);
> > +
> > +	local_irq_restore(flags);
> > +
> > +	if (!hv_result_success(status))
> > +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> 
> Is it OK to deallocate the resources, if hypercall has failed ?

It should be fine. We leak some resources in the hypervisor, but Linux
is in a rather wedged state anyway. Refusing to free up resources in
Linux does not much good.

> Do we have any specific error code EBUSY (kind of) which we need to wait upon ?
> 

I have not found a circumstance that can happen.

> > +
> > +	ida_free(&domain->hv_iommu->domain_ids, domain->device_domain.domain_id.id);
> > +
> > +	iommu_put_dma_cookie(d);
> > +
> > +	kfree(domain);
> > +}
> > +
> > +static int hv_iommu_attach_dev(struct iommu_domain *d, struct device *dev)
> > +{
> > +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> > +	u64 status;
> > +	unsigned long flags;
> > +	struct hv_input_attach_device_domain *input;
> > +	struct pci_dev *pdev;
> > +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +
> > +	/* Only allow PCI devices for now */
> > +	if (!dev_is_pci(dev))
> > +		return -EINVAL;
> > +
> > +	pdev = to_pci_dev(dev);
> > +
> > +	dev_dbg(dev, "Attaching (%strusted) to %d\n", pdev->untrusted ? "un" : "",
> > +		domain->device_domain.domain_id.id);
> > +
> > +	local_irq_save(flags);
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	memset(input, 0, sizeof(*input));
> > +
> > +	input->device_domain = domain->device_domain;
> > +	input->device_id = hv_build_pci_dev_id(pdev);
> > +
> > +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
> > +	local_irq_restore(flags);
> > +
> > +	if (!hv_result_success(status))
> > +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> 
> Does it make sense to vdev->domain = NULL ?
>

It is already NULL -- there is no other code path that sets it and the
detach path always sets the field to NULL.

> > +	else
> > +		vdev->domain = domain;
> > +
> > +	return hv_status_to_errno(status);
> > +}
> > +
[...]
> > +static size_t hv_iommu_unmap(struct iommu_domain *d, unsigned long iova,
> > +			   size_t size, struct iommu_iotlb_gather *gather)
> > +{
> > +	size_t unmapped;
> > +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> > +	unsigned long flags, npages;
> > +	struct hv_input_unmap_device_gpa_pages *input;
> > +	u64 status;
> > +
> > +	unmapped = hv_iommu_del_mappings(domain, iova, size);
> > +	if (unmapped < size)
> > +		return 0;
> 
> Is there a case where unmapped > 0 && unmapped < size ?
> 

There could be such a case -- hv_iommu_del_mappings' return value is >= 0.
Is there a problem with this predicate?

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU
  2021-08-03 18:50   ` Praveen Kumar
@ 2021-08-03 21:56     ` Wei Liu
  2021-08-04  7:03       ` Praveen Kumar
  0 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2021-08-03 21:56 UTC (permalink / raw)
  To: Praveen Kumar
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	Joerg Roedel, Will Deacon, open list:IOMMU DRIVERS

On Wed, Aug 04, 2021 at 12:20:42AM +0530, Praveen Kumar wrote:
> On 09-07-2021 17:13, Wei Liu wrote:
> > Some devices may have been claimed by the hypervisor already. One such
> > example is a user can assign a NIC for debugging purpose.
> > 
> > Ideally Linux should be able to tell retrieve that information, but
> > there is no way to do that yet. And designing that new mechanism is
> > going to take time.
> > 
> > Provide a command line option for skipping devices. This is a stopgap
> > solution, so it is intentionally undocumented. Hopefully we can retire
> > it in the future.
> > 
> > Signed-off-by: Wei Liu <wei.liu@kernel.org>
> > ---
> >  drivers/iommu/hyperv-iommu.c | 45 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 45 insertions(+)
> > 
> > diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> > index 043dcff06511..353da5036387 100644
> > --- a/drivers/iommu/hyperv-iommu.c
> > +++ b/drivers/iommu/hyperv-iommu.c
> > @@ -349,6 +349,16 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
> >  
> >  #ifdef CONFIG_HYPERV_ROOT_PVIOMMU
> >  
> > +/* The IOMMU will not claim these PCI devices. */
> > +static char *pci_devs_to_skip;
> > +static int __init mshv_iommu_setup_skip(char *str) {
> > +	pci_devs_to_skip = str;
> > +
> > +	return 0;
> > +}
> > +/* mshv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> > +__setup("mshv_iommu_skip=", mshv_iommu_setup_skip);
> > +
> >  /* DMA remapping support */
> >  struct hv_iommu_domain {
> >  	struct iommu_domain domain;
> > @@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> >  	if (!dev_is_pci(dev))
> >  		return ERR_PTR(-ENODEV);
> >  
> > +	/*
> > +	 * Skip the PCI device specified in `pci_devs_to_skip`. This is a
> > +	 * temporary solution until we figure out a way to extract information
> > +	 * from the hypervisor what devices it is already using.
> > +	 */
> > +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> > +		int pos = 0;
> > +		int parsed;
> > +		int segment, bus, slot, func;
> > +		struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +		do {
> > +			parsed = 0;
> > +
> > +			sscanf(pci_devs_to_skip + pos,
> > +				" (%x:%x:%x.%x) %n",
> > +				&segment, &bus, &slot, &func, &parsed);
> > +
> > +			if (parsed <= 0)
> > +				break;
> > +
> > +			if (pci_domain_nr(pdev->bus) == segment &&
> > +				pdev->bus->number == bus &&
> > +				PCI_SLOT(pdev->devfn) == slot &&
> > +				PCI_FUNC(pdev->devfn) == func)
> > +			{
> > +				dev_info(dev, "skipped by MSHV IOMMU\n");
> > +				return ERR_PTR(-ENODEV);
> > +			}
> > +
> > +			pos += parsed;
> > +
> > +		} while (pci_devs_to_skip[pos]);
> 
> Is there a possibility of pci_devs_to_skip + pos > sizeof(pci_devs_to_skip)
> and also a valid memory ?

pos should point to the last parsed position. If parsing fails pos does
not get updated and the code breaks out of the loop. If parsing is
success pos should point to either the start of next element of '\0'
(end of string). To me this is good enough.

> I would recommend to have a check of size as well before accessing the
> array content, just to be safer accessing any memory.
> 

What check do you have in mind?

Wei.

> > +	}
> > +
> >  	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> >  	if (!vdev)
> >  		return ERR_PTR(-ENOMEM);
> > 
> 
> Regards,
> 
> ~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 7/8] mshv: implement in-kernel device framework
  2021-08-03 19:12   ` Praveen Kumar
@ 2021-08-03 22:04     ` Wei Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-08-03 22:04 UTC (permalink / raw)
  To: Praveen Kumar
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin,
	Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Lillian Grassin-Drake,
	Muminul Islam, open list:DOCUMENTATION

On Wed, Aug 04, 2021 at 12:42:22AM +0530, Praveen Kumar wrote:
> On 09-07-2021 17:13, Wei Liu wrote:
[...]
> > +static long mshv_device_ioctl(struct file *filp, unsigned int ioctl,
> > +			      unsigned long arg)
> > +{
> > +	struct mshv_device *dev = filp->private_data;
> > +
> > +	switch (ioctl) {
> > +	case MSHV_SET_DEVICE_ATTR:
> > +		return mshv_device_ioctl_attr(dev, dev->ops->set_attr, arg);
> > +	case MSHV_GET_DEVICE_ATTR:
> > +		return mshv_device_ioctl_attr(dev, dev->ops->get_attr, arg);
> > +	case MSHV_HAS_DEVICE_ATTR:
> > +		return mshv_device_ioctl_attr(dev, dev->ops->has_attr, arg);
> > +	default:
> > +		if (dev->ops->ioctl)
> > +			return dev->ops->ioctl(dev, ioctl, arg);
> > +
> > +		return -ENOTTY;
> > +	}
> 
> Have seen some static analyzer tool cribbing here of not returning any error.
> If you feel OK, please move the 'return -ENOTTY' down after switch block. Thanks.
> 

Fair point. I will make the change.

> > +}
> > +
[...]
> > +static long
> > +mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> > +	void __user *user_args)
> > +{
> > +	long r;
> > +	struct mshv_create_device tmp, *cd;
> > +	struct mshv_device *dev;
> > +	const struct mshv_device_ops *ops;
> > +	int type;
> > +
> > +	if (copy_from_user(&tmp, user_args, sizeof(tmp))) {
> > +		r = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	cd = &tmp;
> > +
> > +	if (cd->type >= ARRAY_SIZE(mshv_device_ops_table)) {
> > +		r = -ENODEV;
> > +		goto out;
> > +	}
> > +
> > +	type = array_index_nospec(cd->type, ARRAY_SIZE(mshv_device_ops_table));
> > +	ops = mshv_device_ops_table[type];
> > +	if (ops == NULL) {
> > +		r = -ENODEV;
> > +		goto out;
> > +	}
> > +
> > +	if (cd->flags & MSHV_CREATE_DEVICE_TEST) {
> > +		r = 0;
> > +		goto out;
> > +	}
> > +
> > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL_ACCOUNT);
> > +	if (!dev) {
> > +		r = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	dev->ops = ops;
> > +	dev->partition = partition;
> > +
> > +	r = ops->create(dev, type);
> > +	if (r < 0) {
> > +		kfree(dev);
> > +		goto out;
> > +	}
> > +
> > +	list_add(&dev->partition_node, &partition->devices);
> > +
> > +	if (ops->init)
> > +		ops->init(dev);
> > +
> > +	mshv_partition_get(partition);
> > +	r = anon_inode_getfd(ops->name, &mshv_device_fops, dev, O_RDWR | O_CLOEXEC);
> > +	if (r < 0) {
> > +		mshv_partition_put_no_destroy(partition);
> > +		list_del(&dev->partition_node);
> > +		ops->destroy(dev);
> 
> I hope ops->destroy will free dev as well ?

Yes. It is clearly written in the preceding comment of that hook. I hope
that's prominent enough.

> 
> > +		goto out;
> > +	}
> > +
> > +	cd->fd = r;
> > +	r = 0;
> > +
> > +	if (copy_to_user(user_args, &tmp, sizeof(tmp))) {
> > +		r = -EFAULT;
> 
> I don't think we will be cleaning up anything ? Or do we need to?

No need. Whatever residuals left will be cleaned up once the VM is
destroyed.

Wei.

> > +		goto out;
> > +	}
> > +out:
> > +	return r;
> > +}
> > +
> 
> Regards,
> 
> ~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 5/8] mshv: add paravirtualized IOMMU support
  2021-08-03 21:47     ` Wei Liu
@ 2021-08-04  6:43       ` Praveen Kumar
  2021-08-10 10:46         ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Praveen Kumar @ 2021-08-04  6:43 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linux on Hyper-V List, virtualization, Linux Kernel List,
	Michael Kelley, Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves,
	pasha.tatashin, Joerg Roedel, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	open list:IOMMU DRIVERS

On 04-08-2021 03:17, Wei Liu wrote:
>>> +static size_t hv_iommu_unmap(struct iommu_domain *d, unsigned long iova,
>>> +			   size_t size, struct iommu_iotlb_gather *gather)
>>> +{
>>> +	size_t unmapped;
>>> +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
>>> +	unsigned long flags, npages;
>>> +	struct hv_input_unmap_device_gpa_pages *input;
>>> +	u64 status;
>>> +
>>> +	unmapped = hv_iommu_del_mappings(domain, iova, size);
>>> +	if (unmapped < size)
>>> +		return 0;
>> Is there a case where unmapped > 0 && unmapped < size ?
>>
> There could be such a case -- hv_iommu_del_mappings' return value is >= 0.
> Is there a problem with this predicate?

What I understand, if we are unmapping and return 0, means nothing was unmapped, and will that not cause any corruption or illegal access of unmapped memory later?
From __iommu_unmap
...
    13         while (unmapped < size) {
    12                 size_t pgsize = iommu_pgsize(domain, iova, size - unmapped);
    11
    10                 unmapped_page = ops->unmap(domain, iova, pgsize, iotlb_gather);
     9                 if (!unmapped_page)
     8                         break;		<<< we just break here, thinking there is nothing unmapped, but actually hv_iommu_del_mappings has removed some pages.
     7
     6                 pr_debug("unmapped: iova 0x%lx size 0x%zx\n",
     5                         ¦iova, unmapped_page);
     4
     3                 iova += unmapped_page;
     2                 unmapped += unmapped_page;
     1         }
...

Am I missing something ?

Regards,

~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU
  2021-08-03 21:56     ` Wei Liu
@ 2021-08-04  7:03       ` Praveen Kumar
  2021-08-10 10:04         ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Praveen Kumar @ 2021-08-04  7:03 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linux on Hyper-V List, virtualization, Linux Kernel List,
	Michael Kelley, Vineeth Pillai, Sunil Muthuswamy, Nuno Das Neves,
	pasha.tatashin, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, Joerg Roedel, Will Deacon,
	open list:IOMMU DRIVERS

On 04-08-2021 03:26, Wei Liu wrote:
>>>  	struct iommu_domain domain;
>>> @@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct device *dev)
>>>  	if (!dev_is_pci(dev))
>>>  		return ERR_PTR(-ENODEV);
>>>  
>>> +	/*
>>> +	 * Skip the PCI device specified in `pci_devs_to_skip`. This is a
>>> +	 * temporary solution until we figure out a way to extract information
>>> +	 * from the hypervisor what devices it is already using.
>>> +	 */
>>> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
>>> +		int pos = 0;
>>> +		int parsed;
>>> +		int segment, bus, slot, func;
>>> +		struct pci_dev *pdev = to_pci_dev(dev);
>>> +
>>> +		do {
>>> +			parsed = 0;
>>> +
>>> +			sscanf(pci_devs_to_skip + pos,
>>> +				" (%x:%x:%x.%x) %n",
>>> +				&segment, &bus, &slot, &func, &parsed);
>>> +
>>> +			if (parsed <= 0)
>>> +				break;
>>> +
>>> +			if (pci_domain_nr(pdev->bus) == segment &&
>>> +				pdev->bus->number == bus &&
>>> +				PCI_SLOT(pdev->devfn) == slot &&
>>> +				PCI_FUNC(pdev->devfn) == func)
>>> +			{
>>> +				dev_info(dev, "skipped by MSHV IOMMU\n");
>>> +				return ERR_PTR(-ENODEV);
>>> +			}
>>> +
>>> +			pos += parsed;
>>> +
>>> +		} while (pci_devs_to_skip[pos]);
>>
>> Is there a possibility of pci_devs_to_skip + pos > sizeof(pci_devs_to_skip)
>> and also a valid memory ?
> 
> pos should point to the last parsed position. If parsing fails pos does
> not get updated and the code breaks out of the loop. If parsing is
> success pos should point to either the start of next element of '\0'
> (end of string). To me this is good enough.

The point is, hypothetically the address to pci_devs_to_skip + pos can be valid address (later to '\0'), and thus there is a possibility, that parsing may not fail.
Another, there is also a possibility of sscanf faulting accessing the illegal address, if pci_devs_to_skip[pos] turns out to be not NULL or valid address.

> 
>> I would recommend to have a check of size as well before accessing the
>> array content, just to be safer accessing any memory.
>>
> 
> What check do you have in mind?

Something like,
size_t len = strlen(pci_devs_to_skip);
do {

	len -= parsed;
} while (len);

OR

do {
...
	pos += parsed;
} while (pos < len);

Further, I'm also fine with the existing code, if you think this won't break and already been taken care. Thanks.

Regards,

~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU
  2021-08-04  7:03       ` Praveen Kumar
@ 2021-08-10 10:04         ` Wei Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-08-10 10:04 UTC (permalink / raw)
  To: Praveen Kumar
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	Joerg Roedel, Will Deacon, open list:IOMMU DRIVERS

On Wed, Aug 04, 2021 at 12:33:54PM +0530, Praveen Kumar wrote:
> On 04-08-2021 03:26, Wei Liu wrote:
> >>>  	struct iommu_domain domain;
> >>> @@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> >>>  	if (!dev_is_pci(dev))
> >>>  		return ERR_PTR(-ENODEV);
> >>>  
> >>> +	/*
> >>> +	 * Skip the PCI device specified in `pci_devs_to_skip`. This is a
> >>> +	 * temporary solution until we figure out a way to extract information
> >>> +	 * from the hypervisor what devices it is already using.
> >>> +	 */
> >>> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> >>> +		int pos = 0;
> >>> +		int parsed;
> >>> +		int segment, bus, slot, func;
> >>> +		struct pci_dev *pdev = to_pci_dev(dev);
> >>> +
> >>> +		do {
> >>> +			parsed = 0;
> >>> +
> >>> +			sscanf(pci_devs_to_skip + pos,
> >>> +				" (%x:%x:%x.%x) %n",
> >>> +				&segment, &bus, &slot, &func, &parsed);
> >>> +
> >>> +			if (parsed <= 0)
> >>> +				break;
> >>> +
> >>> +			if (pci_domain_nr(pdev->bus) == segment &&
> >>> +				pdev->bus->number == bus &&
> >>> +				PCI_SLOT(pdev->devfn) == slot &&
> >>> +				PCI_FUNC(pdev->devfn) == func)
> >>> +			{
> >>> +				dev_info(dev, "skipped by MSHV IOMMU\n");
> >>> +				return ERR_PTR(-ENODEV);
> >>> +			}
> >>> +
> >>> +			pos += parsed;
> >>> +
> >>> +		} while (pci_devs_to_skip[pos]);
> >>
> >> Is there a possibility of pci_devs_to_skip + pos > sizeof(pci_devs_to_skip)
> >> and also a valid memory ?
> > 
> > pos should point to the last parsed position. If parsing fails pos does
> > not get updated and the code breaks out of the loop. If parsing is
> > success pos should point to either the start of next element of '\0'
> > (end of string). To me this is good enough.
> 
> The point is, hypothetically the address to pci_devs_to_skip + pos can
> be valid address (later to '\0'), and thus there is a possibility,
> that parsing may not fail.

Have you found an example how at any given point in time
pci_devs_to_skip + pos can point outside of user provided string?

> Another, there is also a possibility of sscanf faulting accessing the
> illegal address, if pci_devs_to_skip[pos] turns out to be not NULL or
> valid address.

That depends on pci_devs_to_skip + pos can point to an invalid address
in the first place, so that goes back to the question above.

> 
> > 
> >> I would recommend to have a check of size as well before accessing the
> >> array content, just to be safer accessing any memory.
> >>
> > 
> > What check do you have in mind?
> 
> Something like,
> size_t len = strlen(pci_devs_to_skip);
> do {
> 
> 	len -= parsed;
> } while (len);
> 
> OR
> 
> do {
> ...
> 	pos += parsed;
> } while (pos < len);
> 
> Further, I'm also fine with the existing code, if you think this won't
> break and already been taken care. Thanks.

But in the loop somewhere you will still need to parse pci_devs_to_skip
+ some_offset. The new code structure does not remove that, right?

Given this is for debugging and is supposed to be temporary, I think the
code is good enough. But I want to make sure if there is anything I
missed.

Wei.

> 
> Regards,
> 
> ~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 5/8] mshv: add paravirtualized IOMMU support
  2021-08-04  6:43       ` Praveen Kumar
@ 2021-08-10 10:46         ` Wei Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-08-10 10:46 UTC (permalink / raw)
  To: Praveen Kumar
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin, Joerg Roedel,
	Will Deacon, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Dexuan Cui, open list:IOMMU DRIVERS

On Wed, Aug 04, 2021 at 12:13:42PM +0530, Praveen Kumar wrote:
> On 04-08-2021 03:17, Wei Liu wrote:
> >>> +static size_t hv_iommu_unmap(struct iommu_domain *d, unsigned long iova,
> >>> +			   size_t size, struct iommu_iotlb_gather *gather)
> >>> +{
> >>> +	size_t unmapped;
> >>> +	struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> >>> +	unsigned long flags, npages;
> >>> +	struct hv_input_unmap_device_gpa_pages *input;
> >>> +	u64 status;
> >>> +
> >>> +	unmapped = hv_iommu_del_mappings(domain, iova, size);
> >>> +	if (unmapped < size)
> >>> +		return 0;
> >> Is there a case where unmapped > 0 && unmapped < size ?
> >>
> > There could be such a case -- hv_iommu_del_mappings' return value is >= 0.
> > Is there a problem with this predicate?
> 
> What I understand, if we are unmapping and return 0, means nothing was
> unmapped, and will that not cause any corruption or illegal access of
> unmapped memory later?  From __iommu_unmap

Those pages are not really unmapped. The hypercall is skipped.

> ...
>     13         while (unmapped < size) {
>     12                 size_t pgsize = iommu_pgsize(domain, iova, size - unmapped);
>     11
>     10                 unmapped_page = ops->unmap(domain, iova, pgsize, iotlb_gather);
>      9                 if (!unmapped_page)
>      8                         break;		<<< we just break here, thinking there is nothing unmapped, but actually hv_iommu_del_mappings has removed some pages.
>      7
>      6                 pr_debug("unmapped: iova 0x%lx size 0x%zx\n",
>      5                         ¦iova, unmapped_page);
>      4
>      3                 iova += unmapped_page;
>      2                 unmapped += unmapped_page;
>      1         }
> ...
> 
> Am I missing something ?
> 
> Regards,
> 
> ~Praveen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v1 8/8] mshv: add vfio bridge device
  2021-08-03 19:27   ` Praveen Kumar
@ 2021-08-10 10:52     ` Wei Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2021-08-10 10:52 UTC (permalink / raw)
  To: Praveen Kumar
  Cc: Wei Liu, Linux on Hyper-V List, virtualization,
	Linux Kernel List, Michael Kelley, Vineeth Pillai,
	Sunil Muthuswamy, Nuno Das Neves, pasha.tatashin,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Dexuan Cui

On Wed, Aug 04, 2021 at 12:57:03AM +0530, Praveen Kumar wrote:
> On 09-07-2021 17:13, Wei Liu wrote:
> > +
> > +static int mshv_vfio_set_group(struct mshv_device *dev, long attr, u64 arg)
> > +{
> > +	struct mshv_vfio *mv = dev->private;
> > +	struct vfio_group *vfio_group;
> > +	struct mshv_vfio_group *mvg;
> > +	int32_t __user *argp = (int32_t __user *)(unsigned long)arg;
> > +	struct fd f;
> > +	int32_t fd;
> > +	int ret;
> > +
> > +	switch (attr) {
> > +	case MSHV_DEV_VFIO_GROUP_ADD:
> > +		if (get_user(fd, argp))
> > +			return -EFAULT;
> > +
> > +		f = fdget(fd);
> > +		if (!f.file)
> > +			return -EBADF;
> > +
> > +		vfio_group = mshv_vfio_group_get_external_user(f.file);
> > +		fdput(f);
> > +
> > +		if (IS_ERR(vfio_group))
> > +			return PTR_ERR(vfio_group);
> > +
> > +		mutex_lock(&mv->lock);
> > +
> > +		list_for_each_entry(mvg, &mv->group_list, node) {
> > +			if (mvg->vfio_group == vfio_group) {
> > +				mutex_unlock(&mv->lock);
> > +				mshv_vfio_group_put_external_user(vfio_group);
> > +				return -EEXIST;
> > +			}
> > +		}
> > +
> > +		mvg = kzalloc(sizeof(*mvg), GFP_KERNEL_ACCOUNT);
> > +		if (!mvg) {
> > +			mutex_unlock(&mv->lock);
> > +			mshv_vfio_group_put_external_user(vfio_group);
> > +			return -ENOMEM;
> > +		}
> > +
> > +		list_add_tail(&mvg->node, &mv->group_list);
> > +		mvg->vfio_group = vfio_group;
> > +
> > +		mutex_unlock(&mv->lock);
> > +
> > +		return 0;
> > +
> > +	case MSHV_DEV_VFIO_GROUP_DEL:
> > +		if (get_user(fd, argp))
> > +			return -EFAULT;
> > +
> > +		f = fdget(fd);
> > +		if (!f.file)
> > +			return -EBADF;
> 
> Can we move these both checks above switch statement and do fdput
> accordingly under both case statement accordingly?

Fair point. This can be done, albeit at the cost of having a rather
different code structure.

I was waiting to see if we should somehow merge this with KVM's
implementation so the code was deliberately kept close. If there is no
further comment I can of course make the change you suggested.

> 
> > +
> > +		ret = -ENOENT;
> > +
> > +		mutex_lock(&mv->lock);
> > +
> > +		list_for_each_entry(mvg, &mv->group_list, node) {
> > +			if (!mshv_vfio_external_group_match_file(mvg->vfio_group,
> > +								 f.file))
> > +				continue;
> > +
> > +			list_del(&mvg->node);
> > +			mshv_vfio_group_put_external_user(mvg->vfio_group);
> > +			kfree(mvg);
> > +			ret = 0;
> > +			break;
> > +		}
> > +
> > +		mutex_unlock(&mv->lock);
> > +
> > +		fdput(f);
> > +
> > +		return ret;
> > +	}
> > +
> > +	return -ENXIO;
> > +}
> > +
> > +static int mshv_vfio_set_attr(struct mshv_device *dev,
> > +			      struct mshv_device_attr *attr)
> > +{
> > +	switch (attr->group) {
> > +	case MSHV_DEV_VFIO_GROUP:
> > +		return mshv_vfio_set_group(dev, attr->attr, attr->addr);
> > +	}
> > +
> > +	return -ENXIO;
> > +}
> > +
> > +static int mshv_vfio_has_attr(struct mshv_device *dev,
> > +			      struct mshv_device_attr *attr)
> > +{
> > +	switch (attr->group) {
> > +	case MSHV_DEV_VFIO_GROUP:
> > +		switch (attr->attr) {
> > +		case MSHV_DEV_VFIO_GROUP_ADD:
> > +		case MSHV_DEV_VFIO_GROUP_DEL:
> > +			return 0;
> > +		}
> > +
> > +		break;
> 
> do we need this break statement ? If not, lets remove it.

Will do.

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-08-10 10:52 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-09 11:43 [RFC v1 0/8] MSHV: add PV-IOMMU driver Wei Liu
2021-07-09 11:43 ` [RFC v1 1/8] x86/hyperv: export hv_build_pci_dev_id Wei Liu
2021-07-09 11:43 ` [RFC v1 2/8] asm-generic/hyperv: add device domain definitions Wei Liu
2021-07-09 11:43 ` [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible Wei Liu
2021-07-09 12:56   ` Robin Murphy
2021-07-09 13:42     ` Wei Liu
2021-07-09 11:43 ` [RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions Wei Liu
2021-07-09 14:17   ` Lu Baolu
2021-07-09 14:21     ` Wei Liu
2021-07-09 11:43 ` [RFC v1 5/8] mshv: add paravirtualized IOMMU support Wei Liu
2021-08-03 18:40   ` Praveen Kumar
2021-08-03 21:47     ` Wei Liu
2021-08-04  6:43       ` Praveen Kumar
2021-08-10 10:46         ` Wei Liu
2021-07-09 11:43 ` [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU Wei Liu
2021-07-09 12:46   ` Robin Murphy
2021-07-09 13:34     ` Wei Liu
2021-08-03 18:50   ` Praveen Kumar
2021-08-03 21:56     ` Wei Liu
2021-08-04  7:03       ` Praveen Kumar
2021-08-10 10:04         ` Wei Liu
2021-07-09 11:43 ` [RFC v1 7/8] mshv: implement in-kernel device framework Wei Liu
2021-07-09 13:02   ` Matthew Wilcox
2021-07-09 13:50     ` Wei Liu
2021-07-09 15:32       ` Matthew Wilcox
2021-07-09 16:27         ` Wei Liu
2021-07-09 16:38           ` Matthew Wilcox
2021-07-09 19:14             ` Wei Liu
2021-07-09 19:48               ` Matthew Wilcox
2021-07-09 20:11                 ` Wei Liu
2021-08-03 19:12   ` Praveen Kumar
2021-08-03 22:04     ` Wei Liu
2021-07-09 11:43 ` [RFC v1 8/8] mshv: add vfio bridge device Wei Liu
2021-08-03 19:27   ` Praveen Kumar
2021-08-10 10:52     ` Wei Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).