linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible
@ 2023-04-05 18:01 ankita
  2023-04-05 18:01 ` [PATCH v3 1/6] kvm: determine memory type from VMA ankita
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: ankita @ 2023-04-05 18:01 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
for the on-chip GPU that is the logical OS representation of the
internal propritary cache coherent interconnect.

This representation has a number of limitations compared to a real PCI
device, in particular, it does not model the coherent GPU memory
aperture as a PCI config space BAR, and PCI doesn't know anything
about cacheable memory types.

Provide a VFIO PCI variant driver that adapts the unique PCI
representation into a more standard PCI representation facing
userspace. The GPU memory aperture is obtained from ACPI, according to
the FW specification, and exported to userspace as the VFIO_REGION
that covers the first PCI BAR. qemu will naturally generate a PCI
device in the VM where the cacheable aperture is reported in BAR1.

Since this memory region is actually cache coherent with the CPU, the
VFIO variant driver will mmap it into VMA using a cacheable mapping.

As this is the first time an ARM environment has placed cacheable
non-struct page backed memory (eg from remap_pfn_range) into a KVM
page table, fix a bug in ARM KVM where it does not copy the cacheable
memory attributes from non-struct page backed PTEs to ensure the guest
also gets a cacheable mapping.

Finally, the cacheable memory can participate in memory failure
handling. ECC failures on this memory will trigger the normal ARM
mechanism to get into memory-failure.c. Since this memory is not
backed by struct page create a mechanism to route the memory-failure's
physical address to the VMA owner so that a SIGBUS can be generated
toward the correct process. This works with the existing KVM/qemu
handling for memory failure reporting toward a guest.

This goes along with a qemu series to provides the necessary
implementation of the Grace Hopper Superchip firmware specification so
that the guest operating system can see the correct ACPI modeling for
the coherent GPU device.
https://github.com/qemu/qemu/compare/master...ankita-nv:qemu:dev-ankit/cohmem-0330

Applied and tested over v6.3-rc4.

Ankit Agrawal (6):
  kvm: determine memory type from VMA
  vfio/nvgpu: expose GPU device memory as BAR1
  mm: handle poisoning of pfn without struct pages
  mm: Add poison error check in fixup_user_fault() for mapped PFN
  mm: Change ghes code to allow poison of non-struct PFN
  vfio/nvgpu: register device memory for poison handling

 MAINTAINERS                          |   6 +
 arch/arm64/include/asm/kvm_pgtable.h |   8 +-
 arch/arm64/include/asm/memory.h      |   6 +-
 arch/arm64/kvm/hyp/pgtable.c         |  16 +-
 arch/arm64/kvm/mmu.c                 |  27 +-
 drivers/acpi/apei/ghes.c             |  12 +-
 drivers/vfio/pci/Kconfig             |   2 +
 drivers/vfio/pci/Makefile            |   2 +
 drivers/vfio/pci/nvgpu/Kconfig       |  10 +
 drivers/vfio/pci/nvgpu/Makefile      |   3 +
 drivers/vfio/pci/nvgpu/main.c        | 359 +++++++++++++++++++++++++++
 include/linux/memory-failure.h       |  22 ++
 include/linux/mm.h                   |   1 +
 include/ras/ras_event.h              |   1 +
 mm/gup.c                             |   2 +-
 mm/memory-failure.c                  | 148 +++++++++--
 virt/kvm/kvm_main.c                  |   6 +
 17 files changed, 586 insertions(+), 45 deletions(-)
 create mode 100644 drivers/vfio/pci/nvgpu/Kconfig
 create mode 100644 drivers/vfio/pci/nvgpu/Makefile
 create mode 100644 drivers/vfio/pci/nvgpu/main.c
 create mode 100644 include/linux/memory-failure.h

-- 
2.17.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
@ 2023-04-05 18:01 ` ankita
  2023-04-12 12:43   ` Marc Zyngier
  2023-04-05 18:01 ` [PATCH v3 2/6] vfio/nvgpu: expose GPU device memory as BAR1 ankita
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: ankita @ 2023-04-05 18:01 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

Each VM stores the requires pgprots for its mappings in the
vma->pgprot. Based on this we can determine the desired MT_DEVICE_*
for the VMA directly, and do not have to guess based on heuristics
based on pfn_is_map_memory().

There are the following kinds of pgprot available to userspace and their
corresponding type:
pgprot_noncached -> MT_DEVICE_nGnRnE
pgprot_writecombine -> MT_NORMAL_NC
pgprot_device -> MT_DEVICE_nGnRE
pgprot_tagged -> MT_NORMAL_TAGGED

Decode the relevant MT_* types in use and translate them into the
corresponding KVM_PGTABLEPROT_*:

 - MT_DEVICE_nGnRE -> KVM_PGTABLE_PROT_DEVICE_nGnRE (device)
 - MT_DEVICE_nGnRnE -> KVM_PGTABLE_PROT_DEVICE_nGnRnE (noncached)
 - MT_NORMAL/_TAGGED/_NC -> 0

The selection of 0 for the S2 KVM_PGTABLE_PROT_DEVICE_nGnRnE is based
on [2].

Also worth noting is the result of the stage-1 and stage-2. Ref [3]
If FWB not set, then the combination is the one that is more restrictive.
The sequence from lowest restriction to the highest:
DEVICE_nGnRnE -> DEVICE_nGnRE -> NORMAL/_TAGGED/_NC
If FWB is set, then stage-2 mapping type overrides the stage-1 [1].

This solves a problem where KVM cannot preserve the MT_NORMAL memory
type for non-struct page backed memory into the S2 mapping. Instead
the VMA creator determines the MT type and the S2 will follow it.

[1] https://developer.arm.com/documentation/102376/0100/Combining-Stage-1-and-Stage-2-attributes
[2] ARMv8 reference manual: https://developer.arm.com/documentation/ddi0487/gb/ Section D5.5.3, Table D5-38
[3] ARMv8 reference manual: https://developer.arm.com/documentation/ddi0487/gb/ Table G5-20 on page G5-6330

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  8 +++++---
 arch/arm64/include/asm/memory.h      |  6 ++++--
 arch/arm64/kvm/hyp/pgtable.c         | 16 +++++++++++-----
 arch/arm64/kvm/mmu.c                 | 27 ++++++++++++++++++++++-----
 4 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 4cd6762bda80..d3166b6e6329 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -150,7 +150,8 @@ enum kvm_pgtable_stage2_flags {
  * @KVM_PGTABLE_PROT_X:		Execute permission.
  * @KVM_PGTABLE_PROT_W:		Write permission.
  * @KVM_PGTABLE_PROT_R:		Read permission.
- * @KVM_PGTABLE_PROT_DEVICE:	Device attributes.
+ * @KVM_PGTABLE_PROT_DEVICE_nGnRE:	Device nGnRE attributes.
+ * @KVM_PGTABLE_PROT_DEVICE_nGnRnE:	Device nGnRnE attributes.
  * @KVM_PGTABLE_PROT_SW0:	Software bit 0.
  * @KVM_PGTABLE_PROT_SW1:	Software bit 1.
  * @KVM_PGTABLE_PROT_SW2:	Software bit 2.
@@ -161,7 +162,8 @@ enum kvm_pgtable_prot {
 	KVM_PGTABLE_PROT_W			= BIT(1),
 	KVM_PGTABLE_PROT_R			= BIT(2),
 
-	KVM_PGTABLE_PROT_DEVICE			= BIT(3),
+	KVM_PGTABLE_PROT_DEVICE_nGnRE		= BIT(3),
+	KVM_PGTABLE_PROT_DEVICE_nGnRnE		= BIT(4),
 
 	KVM_PGTABLE_PROT_SW0			= BIT(55),
 	KVM_PGTABLE_PROT_SW1			= BIT(56),
@@ -178,7 +180,7 @@ enum kvm_pgtable_prot {
 #define PAGE_HYP		KVM_PGTABLE_PROT_RW
 #define PAGE_HYP_EXEC		(KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X)
 #define PAGE_HYP_RO		(KVM_PGTABLE_PROT_R)
-#define PAGE_HYP_DEVICE		(PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
+#define PAGE_HYP_DEVICE		(PAGE_HYP | KVM_PGTABLE_PROT_DEVICE_nGnRE)
 
 typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
 					   enum kvm_pgtable_prot prot);
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 78e5163836a0..4ebbc4b1ba4d 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -147,14 +147,16 @@
  * Memory types for Stage-2 translation
  */
 #define MT_S2_NORMAL		0xf
+#define MT_S2_DEVICE_nGnRnE 0x0
 #define MT_S2_DEVICE_nGnRE	0x1
 
 /*
  * Memory types for Stage-2 translation when ID_AA64MMFR2_EL1.FWB is 0001
  * Stage-2 enforces Normal-WB and Device-nGnRE
  */
-#define MT_S2_FWB_NORMAL	6
-#define MT_S2_FWB_DEVICE_nGnRE	1
+#define MT_S2_FWB_NORMAL	0x6
+#define MT_S2_FWB_DEVICE_nGnRnE 0x0
+#define MT_S2_FWB_DEVICE_nGnRE	0x1
 
 #ifdef CONFIG_ARM64_4K_PAGES
 #define IOREMAP_MAX_ORDER	(PUD_SHIFT)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3d61bd3e591d..7a8238b41590 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -355,7 +355,7 @@ struct hyp_map_data {
 
 static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
 {
-	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
+	bool device = prot & KVM_PGTABLE_PROT_DEVICE_nGnRE;
 	u32 mtype = device ? MT_DEVICE_nGnRE : MT_NORMAL;
 	kvm_pte_t attr = FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX, mtype);
 	u32 sh = KVM_PTE_LEAF_ATTR_LO_S1_SH_IS;
@@ -636,14 +636,20 @@ static bool stage2_has_fwb(struct kvm_pgtable *pgt)
 static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot prot,
 				kvm_pte_t *ptep)
 {
-	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
-	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
-			    KVM_S2_MEMATTR(pgt, NORMAL);
 	u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
+	kvm_pte_t attr;
+
+	if (prot & KVM_PGTABLE_PROT_DEVICE_nGnRE)
+		attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRE);
+	else if (prot & KVM_PGTABLE_PROT_DEVICE_nGnRnE)
+		attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRnE);
+	else
+		attr = KVM_S2_MEMATTR(pgt, NORMAL);
 
 	if (!(prot & KVM_PGTABLE_PROT_X))
 		attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
-	else if (device)
+	else if (prot & KVM_PGTABLE_PROT_DEVICE_nGnRE ||
+		 prot & KVM_PGTABLE_PROT_DEVICE_nGnRnE)
 		return -EINVAL;
 
 	if (prot & KVM_PGTABLE_PROT_R)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7113587222ff..8d63aa951c33 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -897,7 +897,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 	int ret = 0;
 	struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
-	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
+	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE_nGnRE |
 				     KVM_PGTABLE_PROT_R |
 				     (writable ? KVM_PGTABLE_PROT_W : 0);
 
@@ -1186,6 +1186,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+/*
+ * Determine the memory region cacheability from VMA's pgprot. This
+ * is used to set the stage 2 PTEs.
+ */
+static unsigned long mapping_type(pgprot_t page_prot)
+{
+	return ((pgprot_val(page_prot) & PTE_ATTRINDX_MASK) >> 2);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -1368,10 +1377,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault)
 		prot |= KVM_PGTABLE_PROT_X;
 
-	if (device)
-		prot |= KVM_PGTABLE_PROT_DEVICE;
-	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
-		prot |= KVM_PGTABLE_PROT_X;
+	switch (mapping_type(vma->vm_page_prot)) {
+	case MT_DEVICE_nGnRE:
+		prot |= KVM_PGTABLE_PROT_DEVICE_nGnRE;
+		break;
+	case MT_DEVICE_nGnRnE:
+		prot |= KVM_PGTABLE_PROT_DEVICE_nGnRnE;
+		break;
+		/* MT_NORMAL/_TAGGED/_NC */
+	default:
+		if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
+			prot |= KVM_PGTABLE_PROT_X;
+	}
 
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 2/6] vfio/nvgpu: expose GPU device memory as BAR1
  2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
  2023-04-05 18:01 ` [PATCH v3 1/6] kvm: determine memory type from VMA ankita
@ 2023-04-05 18:01 ` ankita
  2023-04-05 21:07   ` kernel test robot
  2023-04-05 18:01 ` [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages ankita
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: ankita @ 2023-04-05 18:01 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

The NVIDIA Grace Hopper superchip does not model the coherent GPU memory
aperture as a PCI config space BAR.

Introduce an in-tree VFIO PCI variant module (nvgpu-vfio-pci) to expose
the GPU memory as BAR1 to the userspace. The GPU memory size and physical
address are obtained from ACPI using device_property_read_u64() and
exported to userspace as the VFIO_REGION. QEMU will naturally generate a
PCI device in the VM where the cachable aperture is reported in BAR1.

QEMU can fetch the region information and perform mapping on it. The
subsequent mmap call is handled by mmap() function pointer for the
nvgpu-vfio-pci module and mapping to the GPU memory is established
using the remap_pfn_range() API.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 MAINTAINERS                     |   6 +
 drivers/vfio/pci/Kconfig        |   2 +
 drivers/vfio/pci/Makefile       |   2 +
 drivers/vfio/pci/nvgpu/Kconfig  |  10 ++
 drivers/vfio/pci/nvgpu/Makefile |   3 +
 drivers/vfio/pci/nvgpu/main.c   | 255 ++++++++++++++++++++++++++++++++
 6 files changed, 278 insertions(+)
 create mode 100644 drivers/vfio/pci/nvgpu/Kconfig
 create mode 100644 drivers/vfio/pci/nvgpu/Makefile
 create mode 100644 drivers/vfio/pci/nvgpu/main.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1dc8bd26b6cf..6b48756c30d3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21954,6 +21954,12 @@ L:	kvm@vger.kernel.org
 S:	Maintained
 F:	drivers/vfio/pci/mlx5/
 
+VFIO NVIDIA PCI DRIVER
+M:	Ankit Agrawal <ankita@nvidia.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	drivers/vfio/pci/nvgpu/
+
 VGA_SWITCHEROO
 R:	Lukas Wunner <lukas@wunner.de>
 S:	Maintained
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index f9d0c908e738..ade18b0ffb7b 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -59,4 +59,6 @@ source "drivers/vfio/pci/mlx5/Kconfig"
 
 source "drivers/vfio/pci/hisilicon/Kconfig"
 
+source "drivers/vfio/pci/nvgpu/Kconfig"
+
 endif
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 24c524224da5..0c93d452d0da 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
 
 obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
+
+obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu/
diff --git a/drivers/vfio/pci/nvgpu/Kconfig b/drivers/vfio/pci/nvgpu/Kconfig
new file mode 100644
index 000000000000..066f764f7c5f
--- /dev/null
+++ b/drivers/vfio/pci/nvgpu/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config NVGPU_VFIO_PCI
+	tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip"
+	depends on ARM64 || (COMPILE_TEST && 64BIT)
+	select VFIO_PCI_CORE
+	help
+	  VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
+	  required to assign the GPU device to a VM using KVM/qemu/etc.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/nvgpu/Makefile b/drivers/vfio/pci/nvgpu/Makefile
new file mode 100644
index 000000000000..00fd3a078218
--- /dev/null
+++ b/drivers/vfio/pci/nvgpu/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu-vfio-pci.o
+nvgpu-vfio-pci-y := main.o
diff --git a/drivers/vfio/pci/nvgpu/main.c b/drivers/vfio/pci/nvgpu/main.c
new file mode 100644
index 000000000000..2dd8cc6e0145
--- /dev/null
+++ b/drivers/vfio/pci/nvgpu/main.c
@@ -0,0 +1,255 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+
+#define DUMMY_PFN \
+	(((nvdev->mem_prop.hpa + nvdev->mem_prop.mem_length) >> PAGE_SHIFT) - 1)
+
+struct dev_mem_properties {
+	uint64_t hpa;
+	uint64_t mem_length;
+	int bar1_start_offset;
+};
+
+struct nvgpu_vfio_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	struct dev_mem_properties mem_prop;
+};
+
+static int vfio_get_bar1_start_offset(struct vfio_pci_core_device *vdev)
+{
+	u8 val = 0;
+
+	pci_read_config_byte(vdev->pdev, 0x10, &val);
+	/*
+	 * The BAR1 start offset in the PCI config space depends on the BAR0size.
+	 * Check if the BAR0 is 64b and return the approproiate BAR1 offset.
+	 */
+	if (val & PCI_BASE_ADDRESS_MEM_TYPE_64)
+		return VFIO_PCI_BAR2_REGION_INDEX;
+
+	return VFIO_PCI_BAR1_REGION_INDEX;
+}
+
+static int nvgpu_vfio_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
+		core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
+	struct vfio_pci_core_device *vdev =
+		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	vfio_pci_core_finish_enable(vdev);
+
+	nvdev->mem_prop.bar1_start_offset = vfio_get_bar1_start_offset(vdev);
+
+	return ret;
+}
+
+int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
+			struct vm_area_struct *vma)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
+		core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
+
+	unsigned long start_pfn;
+	unsigned int index;
+	u64 req_len, pgoff;
+	int ret = 0;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+	if (index != nvdev->mem_prop.bar1_start_offset)
+		return vfio_pci_core_mmap(core_vdev, vma);
+
+	/*
+	 * Request to mmap the BAR1. Map to the CPU accessible memory on the
+	 * GPU using the memory information gathered from the system ACPI
+	 * tables.
+	 */
+	start_pfn = nvdev->mem_prop.hpa >> PAGE_SHIFT;
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	if (pgoff >= (nvdev->mem_prop.mem_length >> PAGE_SHIFT))
+		return -EINVAL;
+
+	/*
+	 * Perform a PFN map to the memory. The device BAR1 is backed by the
+	 * GPU memory now. Check that the mapping does not overflow out of
+	 * the GPU memory size.
+	 */
+	ret = remap_pfn_range(vma, vma->vm_start, start_pfn + pgoff,
+			      min(req_len, nvdev->mem_prop.mem_length - pgoff),
+			      vma->vm_page_prot);
+	if (ret)
+		return ret;
+
+	vma->vm_pgoff = start_pfn + pgoff;
+
+	return 0;
+}
+
+long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
+		core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
+
+	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
+	struct vfio_region_info info;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_REGION_INFO:
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		if (info.index == nvdev->mem_prop.bar1_start_offset) {
+			/*
+			 * Request to determine the BAR1 region information. Send the
+			 * GPU memory information.
+			 */
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = nvdev->mem_prop.mem_length;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE |
+				     VFIO_REGION_INFO_FLAG_MMAP;
+			return copy_to_user((void __user *)arg, &info, minsz) ?
+				       -EFAULT : 0;
+		}
+
+		if (info.index == nvdev->mem_prop.bar1_start_offset + 1) {
+			/*
+			 * The BAR1 region is 64b. Ignore this access.
+			 */
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0;
+			info.flags = 0;
+			return copy_to_user((void __user *)arg, &info, minsz) ?
+				-EFAULT : 0;
+		}
+
+		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+
+	default:
+		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+	}
+}
+
+static const struct vfio_device_ops nvgpu_vfio_pci_ops = {
+	.name = "nvgpu-vfio-pci",
+	.init = vfio_pci_core_init_dev,
+	.release = vfio_pci_core_release_dev,
+	.open_device = nvgpu_vfio_pci_open_device,
+	.close_device = vfio_pci_core_close_device,
+	.ioctl = nvgpu_vfio_pci_ioctl,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = nvgpu_vfio_pci_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+	.bind_iommufd = vfio_iommufd_physical_bind,
+	.unbind_iommufd = vfio_iommufd_physical_unbind,
+	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+};
+
+static struct nvgpu_vfio_pci_core_device *nvgpu_drvdata(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+	return container_of(core_device, struct nvgpu_vfio_pci_core_device,
+			    core_device);
+}
+
+static int
+nvgpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
+				     struct nvgpu_vfio_pci_core_device *nvdev)
+{
+	int ret = 0;
+
+	/*
+	 * The memory information is present in the system ACPI tables as DSD
+	 * properties nvidia,gpu-mem-base-pa and nvidia,gpu-mem-size.
+	 */
+	ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-base-pa",
+				       &(nvdev->mem_prop.hpa));
+	if (ret)
+		return ret;
+
+	ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-size",
+				       &(nvdev->mem_prop.mem_length));
+	return ret;
+}
+
+static int nvgpu_vfio_pci_probe(struct pci_dev *pdev,
+				const struct pci_device_id *id)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev;
+	int ret;
+
+	nvdev = vfio_alloc_device(nvgpu_vfio_pci_core_device, core_device.vdev,
+				  &pdev->dev, &nvgpu_vfio_pci_ops);
+	if (IS_ERR(nvdev))
+		return PTR_ERR(nvdev);
+
+	dev_set_drvdata(&pdev->dev, nvdev);
+
+	ret = nvgpu_vfio_pci_fetch_memory_property(pdev, nvdev);
+	if (ret)
+		goto out_put_vdev;
+
+	ret = vfio_pci_core_register_device(&nvdev->core_device);
+	if (ret)
+		goto out_put_vdev;
+
+	return ret;
+
+out_put_vdev:
+	vfio_put_device(&nvdev->core_device.vdev);
+	return ret;
+}
+
+static void nvgpu_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev = nvgpu_drvdata(pdev);
+	struct vfio_pci_core_device *vdev = &nvdev->core_device;
+
+	vfio_pci_core_unregister_device(vdev);
+	vfio_put_device(&vdev->vdev);
+}
+
+static const struct pci_device_id nvgpu_vfio_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2342) },
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2343) },
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) },
+	{}
+};
+
+MODULE_DEVICE_TABLE(pci, nvgpu_vfio_pci_table);
+
+static struct pci_driver nvgpu_vfio_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = nvgpu_vfio_pci_table,
+	.probe = nvgpu_vfio_pci_probe,
+	.remove = nvgpu_vfio_pci_remove,
+	.err_handler = &vfio_pci_core_err_handlers,
+	.driver_managed_dma = true,
+};
+
+module_pci_driver(nvgpu_vfio_pci_driver);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
+MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>");
+MODULE_DESCRIPTION(
+	"VFIO NVGPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages
  2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
  2023-04-05 18:01 ` [PATCH v3 1/6] kvm: determine memory type from VMA ankita
  2023-04-05 18:01 ` [PATCH v3 2/6] vfio/nvgpu: expose GPU device memory as BAR1 ankita
@ 2023-04-05 18:01 ` ankita
  2023-04-05 21:07   ` kernel test robot
  2023-05-09  9:51   ` HORIGUCHI NAOYA(堀口 直也)
  2023-04-05 18:01 ` [PATCH v3 4/6] mm: Add poison error check in fixup_user_fault() for mapped PFN ankita
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: ankita @ 2023-04-05 18:01 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

The kernel MM does not currently handle ECC errors / poison on a memory
region that is not backed by struct pages. In this series, mapping request
from QEMU to the device memory is executed using remap_pfn_range().
Hence added a new mechanism to handle memory failure on such memory.

Make kernel MM expose a function to allow modules managing the device
memory to register a failure function and the address space that is
associated with the device memory. MM maintains this information as
interval tree. The registered memory failure function is used by MM to
notify the module of the PFN, so that the module may take any required
action. The module for example may use the information to track the
poisoned pages.

In this implementation, kernel MM follows the following sequence (mostly)
similar to the memory_failure() handler for struct page backed memory:
1. memory_failure() is triggered on reception of a poison error. An
absence of struct page is detected and consequently memory_failure_pfn
is executed.
2. memory_failure_pfn() call the newly introduced failure handler exposed
by the module managing the poisoned memory to notify it of the problematic
PFN.
3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
4. memory_failure_pfn() collects the processes mapped to the PFN.
5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the processes
mapping the faulty PFN using kill_procs().
6. An access to the faulty PFN by an operation in VM at a later point of
time is trapped and user_mem_abort() is called.
7. user_mem_abort() calls __gfn_to_pfn_memslot() on the PFN, and the
following execution path is followed: __gfn_to_pfn_memslot() ->
hva_to_pfn() -> hva_to_pfn_remapped() -> fixup_user_fault() ->
handle_mm_fault() -> handle_pte_fault() -> do_fault(). do_fault() is
expected to return VM_FAULT_HWPOISON on the PFN (it currently does not
and is fixed as part of another patch in the series).
8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON, which cause
the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU process
through kvm_send_hwpoison_signal().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 include/linux/memory-failure.h |  22 +++++
 include/linux/mm.h             |   1 +
 include/ras/ras_event.h        |   1 +
 mm/memory-failure.c            | 148 +++++++++++++++++++++++++++++----
 4 files changed, 154 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/memory-failure.h

diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
new file mode 100644
index 000000000000..9a579960972a
--- /dev/null
+++ b/include/linux/memory-failure.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_FAILURE_H
+#define _LINUX_MEMORY_FAILURE_H
+
+#include <linux/interval_tree.h>
+
+struct pfn_address_space;
+
+struct pfn_address_space_ops {
+	void (*failure)(struct pfn_address_space *pfn_space, unsigned long pfn);
+};
+
+struct pfn_address_space {
+	struct interval_tree_node node;
+	const struct pfn_address_space_ops *ops;
+	struct address_space *mapping;
+};
+
+int register_pfn_address_space(struct pfn_address_space *pfn_space);
+void unregister_pfn_address_space(struct pfn_address_space *pfn_space);
+
+#endif /* _LINUX_MEMORY_FAILURE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..e3ef52d3d45a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3530,6 +3530,7 @@ enum mf_action_page_type {
 	MF_MSG_BUDDY,
 	MF_MSG_DAX,
 	MF_MSG_UNSPLIT_THP,
+	MF_MSG_PFN,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index cbd3ddd7c33d..5c62a4d17172 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -373,6 +373,7 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_BUDDY, "free buddy page" )				\
 	EM ( MF_MSG_DAX, "dax page" )					\
 	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
+	EM ( MF_MSG_PFN, "non struct page pfn" )					\
 	EMe ( MF_MSG_UNKNOWN, "unknown page" )
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fae9baf3be16..2c1a2ec42f7b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -38,6 +38,7 @@
 
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/memory-failure.h>
 #include <linux/page-flags.h>
 #include <linux/kernel-page-flags.h>
 #include <linux/sched/signal.h>
@@ -62,6 +63,7 @@
 #include <linux/page-isolation.h>
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
+#include <linux/pfn_t.h>
 #include "swap.h"
 #include "internal.h"
 #include "ras/ras_event.h"
@@ -122,6 +124,10 @@ const struct attribute_group memory_failure_attr_group = {
 	.attrs = memory_failure_attr,
 };
 
+static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
+
+static DEFINE_MUTEX(pfn_space_lock);
+
 /*
  * Return values:
  *   1:   the page is dissolved (if needed) and taken off from buddy,
@@ -399,15 +405,14 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  *
- * Note: @fsdax_pgoff is used only when @p is a fsdax page and a
- * filesystem with a memory failure handler has claimed the
- * memory_failure event. In all other cases, page->index and
- * page->mapping are sufficient for mapping the page back to its
+ * Notice: @pgoff is used either when @p is a fsdax page or a PFN is not
+ * backed by struct page and a filesystem with a memory failure handler
+ * has claimed the memory_failure event. In all other cases, page->index
+ * and page->mapping are sufficient for mapping the page back to its
  * corresponding user virtual address.
  */
-static void add_to_kill(struct task_struct *tsk, struct page *p,
-			pgoff_t fsdax_pgoff, struct vm_area_struct *vma,
-			struct list_head *to_kill)
+static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,
+			struct vm_area_struct *vma, struct list_head *to_kill)
 {
 	struct to_kill *tk;
 
@@ -417,13 +422,20 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 		return;
 	}
 
-	tk->addr = page_address_in_vma(p, vma);
-	if (is_zone_device_page(p)) {
-		if (fsdax_pgoff != FSDAX_INVALID_PGOFF)
-			tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
+	if (vma->vm_flags | PFN_MAP) {
+		tk->addr =
+			vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+		tk->size_shift = PAGE_SHIFT;
+	} else if (is_zone_device_page(p)) {
+		if (pgoff != FSDAX_INVALID_PGOFF)
+			tk->addr = vma_pgoff_address(pgoff, 1, vma);
+		else
+			tk->addr = page_address_in_vma(p, vma);
 		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
-	} else
+	} else {
+		tk->addr = page_address_in_vma(p, vma);
 		tk->size_shift = page_shift(compound_head(p));
+	}
 
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
@@ -617,13 +629,12 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 	i_mmap_unlock_read(mapping);
 }
 
-#ifdef CONFIG_FS_DAX
 /*
  * Collect processes when the error hit a fsdax page.
  */
-static void collect_procs_fsdax(struct page *page,
-		struct address_space *mapping, pgoff_t pgoff,
-		struct list_head *to_kill)
+static void collect_procs_pgoff(struct page *page,
+				struct address_space *mapping, pgoff_t pgoff,
+				struct list_head *to_kill)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -643,7 +654,6 @@ static void collect_procs_fsdax(struct page *page,
 	read_unlock(&tasklist_lock);
 	i_mmap_unlock_read(mapping);
 }
-#endif /* CONFIG_FS_DAX */
 
 /*
  * Collect the processes who have the corrupted page mapped to kill.
@@ -835,6 +845,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
+	[MF_MSG_PFN]			= "non struct page pfn",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1745,7 +1756,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 
 		SetPageHWPoison(page);
 
-		collect_procs_fsdax(page, mapping, index, &to_kill);
+		collect_procs_pgoff(page, mapping, index, &to_kill);
 		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
 				index, mf_flags);
 unlock:
@@ -2052,6 +2063,99 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+/**
+ * register_pfn_address_space - Register PA region for poison notification.
+ * @pfn_space: structure containing region range and callback function on
+ *             poison detection.
+ *
+ * This function is called by a kernel module to register a PA region and
+ * a callback function with the kernel. On detection of poison, the
+ * kernel code will go through all registered regions and call the
+ * appropriate callback function associated with the range. The kernel
+ * module is responsible for tracking the poisoned pages.
+ *
+ * Return: 0 if successfully registered,
+ *         -EBUSY if the region is already registered
+ */
+int register_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+	if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
+		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT, ""))
+		return -EBUSY;
+
+	mutex_lock(&pfn_space_lock);
+	interval_tree_insert(&pfn_space->node, &pfn_space_itree);
+	mutex_unlock(&pfn_space_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(register_pfn_address_space);
+
+/**
+ * unregister_pfn_address_space - Unregister a PA region from poison
+ * notification.
+ * @pfn_space: structure containing region range to be unregistered.
+ *
+ * This function is called by a kernel module to unregister the PA region
+ * from the kernel from poison tracking.
+ */
+void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+	mutex_lock(&pfn_space_lock);
+	interval_tree_remove(&pfn_space->node, &pfn_space_itree);
+	mutex_unlock(&pfn_space_lock);
+	release_mem_region(pfn_space->node.start << PAGE_SHIFT,
+		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT);
+}
+EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
+
+static int memory_failure_pfn(unsigned long pfn, int flags)
+{
+	struct interval_tree_node *node;
+	int rc = -EBUSY;
+	LIST_HEAD(tokill);
+
+	mutex_lock(&pfn_space_lock);
+	/*
+	 * Modules registers with MM the address space mapping to the device memory they
+	 * manage. Iterate to identify exactly which address space has mapped to this
+	 * failing PFN.
+	 */
+	for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
+	     node = interval_tree_iter_next(node, pfn, pfn)) {
+		struct pfn_address_space *pfn_space =
+			container_of(node, struct pfn_address_space, node);
+		rc = 0;
+
+		/*
+		 * Modules managing the device memory needs to be conveyed about the
+		 * memory failure so that the poisoned PFN can be tracked.
+		 */
+		pfn_space->ops->failure(pfn_space, pfn);
+
+		collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);
+
+		unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
+				    PAGE_SIZE, 0);
+	}
+	mutex_unlock(&pfn_space_lock);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a different
+	 * physical page at a given virtual address, so all userspace
+	 * consumption of direct PFN memory necessitates SIGBUS (i.e.
+	 * MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	kill_procs(&tokill, true, false, pfn, flags);
+
+	pr_err("%#lx: recovery action for %s: %s\n",
+			pfn, action_page_types[MF_MSG_PFN],
+			action_name[rc ? MF_FAILED : MF_RECOVERED]);
+
+	return rc;
+}
+
 static DEFINE_MUTEX(mf_mutex);
 
 /**
@@ -2093,6 +2197,11 @@ int memory_failure(unsigned long pfn, int flags)
 	if (!(flags & MF_SW_SIMULATED))
 		hw_memory_failure = true;
 
+	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
+		res = memory_failure_pfn(pfn, flags);
+		goto unlock_mutex;
+	}
+
 	p = pfn_to_online_page(pfn);
 	if (!p) {
 		res = arch_memory_failure(pfn, flags);
@@ -2106,6 +2215,9 @@ int memory_failure(unsigned long pfn, int flags)
 								 pgmap);
 				goto unlock_mutex;
 			}
+
+			res = memory_failure_pfn(pfn, flags);
+			goto unlock_mutex;
 		}
 		pr_err("%#lx: memory outside kernel control\n", pfn);
 		res = -ENXIO;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 4/6] mm: Add poison error check in fixup_user_fault() for mapped PFN
  2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
                   ` (2 preceding siblings ...)
  2023-04-05 18:01 ` [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages ankita
@ 2023-04-05 18:01 ` ankita
  2023-04-05 18:01 ` [PATCH v3 5/6] mm: Change ghes code to allow poison of non-struct PFN ankita
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: ankita @ 2023-04-05 18:01 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

The fixup_user_fault() currently does not expect a VM_FAULT_HWPOISON
and hence does not check for it while calling vm_fault_to_errno(). Since
we now have a new code path which can trigger such case, change
fixup_user_fault to look for VM_FAULT_HWPOISON.

Also make hva_to_pfn_remapped check for -EHWPOISON and communicate the
poison fault up to the user_mem_abort().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 mm/gup.c            | 2 +-
 virt/kvm/kvm_main.c | 6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index eab18ba045db..507a96e91bad 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1290,7 +1290,7 @@ int fixup_user_fault(struct mm_struct *mm,
 	}
 
 	if (ret & VM_FAULT_ERROR) {
-		int err = vm_fault_to_errno(ret, 0);
+		int err = vm_fault_to_errno(ret, FOLL_HWPOISON);
 
 		if (err)
 			return err;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331..09b6973e679d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2688,6 +2688,12 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
 		if (r == -EAGAIN)
 			goto retry;
+
+		if (r == -EHWPOISON) {
+			pfn = KVM_PFN_ERR_HWPOISON;
+			goto exit;
+		}
+
 		if (r < 0)
 			pfn = KVM_PFN_ERR_FAULT;
 	} else {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 5/6] mm: Change ghes code to allow poison of non-struct PFN
  2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
                   ` (3 preceding siblings ...)
  2023-04-05 18:01 ` [PATCH v3 4/6] mm: Add poison error check in fixup_user_fault() for mapped PFN ankita
@ 2023-04-05 18:01 ` ankita
  2023-04-05 18:01 ` [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling ankita
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: ankita @ 2023-04-05 18:01 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

The GHES code allows calling of memory_failure() on the PFNs that pass the
pfn_valid() check. This contract is broken for the remapped PFNs which
fails the check and ghes_do_memory_failure() returns without triggering
memory_failure().

Update code to allow memory_failure() call on PFNs failing pfn_valid().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/acpi/apei/ghes.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 34ad071a64e9..2ab7fec8127d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -459,20 +459,10 @@ static void ghes_kick_task_work(struct callback_head *head)
 
 static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 {
-	unsigned long pfn;
-
 	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
 		return false;
 
-	pfn = PHYS_PFN(physical_addr);
-	if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
-		pr_warn_ratelimited(FW_WARN GHES_PFX
-		"Invalid address in generic error data: %#llx\n",
-		physical_addr);
-		return false;
-	}
-
-	memory_failure_queue(pfn, flags);
+	memory_failure_queue(PHYS_PFN(physical_addr), flags);
 	return true;
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling
  2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
                   ` (4 preceding siblings ...)
  2023-04-05 18:01 ` [PATCH v3 5/6] mm: Change ghes code to allow poison of non-struct PFN ankita
@ 2023-04-05 18:01 ` ankita
  2023-04-05 20:24   ` Zhi Wang
                     ` (2 more replies)
  2023-04-06 12:07 ` [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible David Hildenbrand
  2023-04-12 12:28 ` Marc Zyngier
  7 siblings, 3 replies; 31+ messages in thread
From: ankita @ 2023-04-05 18:01 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

The nvgpu-vfio-pci module maps QEMU VMA to device memory through
remap_pfn_range(). The new mechanism to handle poison on memory not backed
by struct page is leveraged here.

nvgpu-vfio-pci defines a function pfn_memory_failure() to get the ECC PFN
from the MM. The function is registered with kernel MM along with the
address space and PFN range through register_pfn_address_space().

Track poisoned PFN in the nvgpu-vfio-pci module as bitmap with a bit per
PFN. The PFN is communicated by the kernel MM to the module through the
failure function, which sets the appropriate bit in the bitmap.

Register a VMA fault ops for the module. It returns VM_FAULT_HWPOISON
in case the bit for the PFN is set in the bitmap.

Clear bitmap on reset to reflect the clean state of the device memory
after reset.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgpu/main.c | 116 ++++++++++++++++++++++++++++++++--
 1 file changed, 110 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/nvgpu/main.c b/drivers/vfio/pci/nvgpu/main.c
index 2dd8cc6e0145..8ccd3fe33a0f 100644
--- a/drivers/vfio/pci/nvgpu/main.c
+++ b/drivers/vfio/pci/nvgpu/main.c
@@ -5,6 +5,8 @@
 
 #include <linux/pci.h>
 #include <linux/vfio_pci_core.h>
+#include <linux/bitmap.h>
+#include <linux/memory-failure.h>
 
 #define DUMMY_PFN \
 	(((nvdev->mem_prop.hpa + nvdev->mem_prop.mem_length) >> PAGE_SHIFT) - 1)
@@ -12,12 +14,78 @@
 struct dev_mem_properties {
 	uint64_t hpa;
 	uint64_t mem_length;
+	unsigned long *pfn_bitmap;
 	int bar1_start_offset;
 };
 
 struct nvgpu_vfio_pci_core_device {
 	struct vfio_pci_core_device core_device;
 	struct dev_mem_properties mem_prop;
+	struct pfn_address_space pfn_address_space;
+};
+
+void nvgpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
+				       unsigned long pfn)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
+		pfn_space, struct nvgpu_vfio_pci_core_device, pfn_address_space);
+
+	/*
+	 * MM has called to notify a poisoned page. Track that in the bitmap.
+	 */
+	__set_bit(pfn - (pfn_space->node.start), nvdev->mem_prop.pfn_bitmap);
+}
+
+struct pfn_address_space_ops nvgpu_vfio_pci_pas_ops = {
+	.failure = nvgpu_vfio_pci_pfn_memory_failure,
+};
+
+static int
+nvgpu_vfio_pci_register_pfn_range(struct nvgpu_vfio_pci_core_device *nvdev,
+				  struct vm_area_struct *vma)
+{
+	unsigned long nr_pages;
+	int ret = 0;
+
+	nr_pages = nvdev->mem_prop.mem_length >> PAGE_SHIFT;
+
+	nvdev->pfn_address_space.node.start = vma->vm_pgoff;
+	nvdev->pfn_address_space.node.last = vma->vm_pgoff + nr_pages - 1;
+	nvdev->pfn_address_space.ops = &nvgpu_vfio_pci_pas_ops;
+	nvdev->pfn_address_space.mapping = vma->vm_file->f_mapping;
+
+	ret = register_pfn_address_space(&(nvdev->pfn_address_space));
+
+	return ret;
+}
+
+static vm_fault_t nvgpu_vfio_pci_fault(struct vm_fault *vmf)
+{
+	unsigned long mem_offset = vmf->pgoff - vmf->vma->vm_pgoff;
+	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
+		vmf->vma->vm_file->private_data,
+		struct nvgpu_vfio_pci_core_device, core_device.vdev);
+	int ret;
+
+	/*
+	 * Check if the page is poisoned.
+	 */
+	if (mem_offset < (nvdev->mem_prop.mem_length >> PAGE_SHIFT) &&
+		test_bit(mem_offset, nvdev->mem_prop.pfn_bitmap))
+		return VM_FAULT_HWPOISON;
+
+	ret = remap_pfn_range(vmf->vma,
+			vmf->vma->vm_start + (mem_offset << PAGE_SHIFT),
+			DUMMY_PFN, PAGE_SIZE,
+			vmf->vma->vm_page_prot);
+	if (ret)
+		return VM_FAULT_ERROR;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct nvgpu_vfio_pci_mmap_ops = {
+	.fault = nvgpu_vfio_pci_fault,
 };
 
 static int vfio_get_bar1_start_offset(struct vfio_pci_core_device *vdev)
@@ -26,8 +94,9 @@ static int vfio_get_bar1_start_offset(struct vfio_pci_core_device *vdev)
 
 	pci_read_config_byte(vdev->pdev, 0x10, &val);
 	/*
-	 * The BAR1 start offset in the PCI config space depends on the BAR0size.
-	 * Check if the BAR0 is 64b and return the approproiate BAR1 offset.
+	 * The BAR1 start offset in the PCI config space depends on the BAR0
+	 * size. Check if the BAR0 is 64b and return the approproiate BAR1
+	 * offset.
 	 */
 	if (val & PCI_BASE_ADDRESS_MEM_TYPE_64)
 		return VFIO_PCI_BAR2_REGION_INDEX;
@@ -54,6 +123,16 @@ static int nvgpu_vfio_pci_open_device(struct vfio_device *core_vdev)
 	return ret;
 }
 
+void nvgpu_vfio_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
+		core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
+
+	unregister_pfn_address_space(&(nvdev->pfn_address_space));
+
+	vfio_pci_core_close_device(core_vdev);
+}
+
 int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
 			struct vm_area_struct *vma)
 {
@@ -93,8 +172,11 @@ int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
 		return ret;
 
 	vma->vm_pgoff = start_pfn + pgoff;
+	vma->vm_ops = &nvgpu_vfio_pci_mmap_ops;
 
-	return 0;
+	ret = nvgpu_vfio_pci_register_pfn_range(nvdev, vma);
+
+	return ret;
 }
 
 long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
@@ -140,7 +222,14 @@ long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		}
 
 		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
-
+	case VFIO_DEVICE_RESET:
+		/*
+		 * Resetting the GPU clears up the poisoned page. Reset the
+		 * poisoned page bitmap.
+		 */
+		memset(nvdev->mem_prop.pfn_bitmap, 0,
+		       nvdev->mem_prop.mem_length >> (PAGE_SHIFT + 3));
+		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
 	default:
 		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
 	}
@@ -151,7 +240,7 @@ static const struct vfio_device_ops nvgpu_vfio_pci_ops = {
 	.init = vfio_pci_core_init_dev,
 	.release = vfio_pci_core_release_dev,
 	.open_device = nvgpu_vfio_pci_open_device,
-	.close_device = vfio_pci_core_close_device,
+	.close_device = nvgpu_vfio_pci_close_device,
 	.ioctl = nvgpu_vfio_pci_ioctl,
 	.read = vfio_pci_core_read,
 	.write = vfio_pci_core_write,
@@ -188,7 +277,20 @@ nvgpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
 
 	ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-size",
 				       &(nvdev->mem_prop.mem_length));
-	return ret;
+	if (ret)
+		return ret;
+
+	/*
+	 * A bitmap is maintained to teack the pages that are poisoned. Each
+	 * page is represented by a bit. Allocation size in bytes is
+	 * determined by shifting the device memory size by PAGE_SHIFT to
+	 * determine the number of pages; and further shifted by 3 as each
+	 * byte could track 8 pages.
+	 */
+	nvdev->mem_prop.pfn_bitmap
+		= vzalloc(nvdev->mem_prop.mem_length >> (PAGE_SHIFT + 3));
+
+	return 0;
 }
 
 static int nvgpu_vfio_pci_probe(struct pci_dev *pdev,
@@ -224,6 +326,8 @@ static void nvgpu_vfio_pci_remove(struct pci_dev *pdev)
 	struct nvgpu_vfio_pci_core_device *nvdev = nvgpu_drvdata(pdev);
 	struct vfio_pci_core_device *vdev = &nvdev->core_device;
 
+	vfree(nvdev->mem_prop.pfn_bitmap);
+
 	vfio_pci_core_unregister_device(vdev);
 	vfio_put_device(&vdev->vdev);
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling
  2023-04-05 18:01 ` [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling ankita
@ 2023-04-05 20:24   ` Zhi Wang
  2023-04-05 21:50   ` kernel test robot
  2023-05-24  9:53   ` Dan Carpenter
  2 siblings, 0 replies; 31+ messages in thread
From: Zhi Wang @ 2023-04-05 20:24 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Wed, 5 Apr 2023 11:01:34 -0700
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The nvgpu-vfio-pci module maps QEMU VMA to device memory through
> remap_pfn_range(). The new mechanism to handle poison on memory not backed
> by struct page is leveraged here.
> 
> nvgpu-vfio-pci defines a function pfn_memory_failure() to get the ECC PFN
> from the MM. The function is registered with kernel MM along with the
> address space and PFN range through register_pfn_address_space().
> 
> Track poisoned PFN in the nvgpu-vfio-pci module as bitmap with a bit per
> PFN. The PFN is communicated by the kernel MM to the module through the
> failure function, which sets the appropriate bit in the bitmap.
> 
> Register a VMA fault ops for the module. It returns VM_FAULT_HWPOISON
> in case the bit for the PFN is set in the bitmap.
> 
> Clear bitmap on reset to reflect the clean state of the device memory
> after reset.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgpu/main.c | 116 ++++++++++++++++++++++++++++++++--
>  1 file changed, 110 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgpu/main.c b/drivers/vfio/pci/nvgpu/main.c
> index 2dd8cc6e0145..8ccd3fe33a0f 100644
> --- a/drivers/vfio/pci/nvgpu/main.c
> +++ b/drivers/vfio/pci/nvgpu/main.c
> @@ -5,6 +5,8 @@
>  
>  #include <linux/pci.h>
>  #include <linux/vfio_pci_core.h>
> +#include <linux/bitmap.h>
> +#include <linux/memory-failure.h>
>  
>  #define DUMMY_PFN \
>  	(((nvdev->mem_prop.hpa + nvdev->mem_prop.mem_length) >> PAGE_SHIFT) - 1)
> @@ -12,12 +14,78 @@
>  struct dev_mem_properties {
>  	uint64_t hpa;
>  	uint64_t mem_length;
> +	unsigned long *pfn_bitmap;
>  	int bar1_start_offset;
>  };
>  
>  struct nvgpu_vfio_pci_core_device {
>  	struct vfio_pci_core_device core_device;
>  	struct dev_mem_properties mem_prop;
> +	struct pfn_address_space pfn_address_space;
> +};
> +
> +void nvgpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
> +				       unsigned long pfn)
> +{
> +	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
> +		pfn_space, struct nvgpu_vfio_pci_core_device, pfn_address_space);
> +
> +	/*
> +	 * MM has called to notify a poisoned page. Track that in the bitmap.
> +	 */
> +	__set_bit(pfn - (pfn_space->node.start), nvdev->mem_prop.pfn_bitmap);
> +}
> +
> +struct pfn_address_space_ops nvgpu_vfio_pci_pas_ops = {
> +	.failure = nvgpu_vfio_pci_pfn_memory_failure,
> +};
> +
> +static int
> +nvgpu_vfio_pci_register_pfn_range(struct nvgpu_vfio_pci_core_device *nvdev,
> +				  struct vm_area_struct *vma)
> +{
> +	unsigned long nr_pages;
> +	int ret = 0;
> +
> +	nr_pages = nvdev->mem_prop.mem_length >> PAGE_SHIFT;
> +
> +	nvdev->pfn_address_space.node.start = vma->vm_pgoff;
> +	nvdev->pfn_address_space.node.last = vma->vm_pgoff + nr_pages - 1;
> +	nvdev->pfn_address_space.ops = &nvgpu_vfio_pci_pas_ops;
> +	nvdev->pfn_address_space.mapping = vma->vm_file->f_mapping;
> +
> +	ret = register_pfn_address_space(&(nvdev->pfn_address_space));
> +
> +	return ret;
> +}
> +
> +static vm_fault_t nvgpu_vfio_pci_fault(struct vm_fault *vmf)
> +{
> +	unsigned long mem_offset = vmf->pgoff - vmf->vma->vm_pgoff;
> +	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
> +		vmf->vma->vm_file->private_data,
> +		struct nvgpu_vfio_pci_core_device, core_device.vdev);
> +	int ret;
> +
> +	/*
> +	 * Check if the page is poisoned.
> +	 */
> +	if (mem_offset < (nvdev->mem_prop.mem_length >> PAGE_SHIFT) &&
> +		test_bit(mem_offset, nvdev->mem_prop.pfn_bitmap))
> +		return VM_FAULT_HWPOISON;
> +
> +	ret = remap_pfn_range(vmf->vma,
> +			vmf->vma->vm_start + (mem_offset << PAGE_SHIFT),
> +			DUMMY_PFN, PAGE_SIZE,
> +			vmf->vma->vm_page_prot);
> +	if (ret)
> +		return VM_FAULT_ERROR;
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct nvgpu_vfio_pci_mmap_ops = {
> +	.fault = nvgpu_vfio_pci_fault,
>  };
>  
>  static int vfio_get_bar1_start_offset(struct vfio_pci_core_device *vdev)
> @@ -26,8 +94,9 @@ static int vfio_get_bar1_start_offset(struct vfio_pci_core_device *vdev)
>  
>  	pci_read_config_byte(vdev->pdev, 0x10, &val);
>  	/*
> -	 * The BAR1 start offset in the PCI config space depends on the BAR0size.
> -	 * Check if the BAR0 is 64b and return the approproiate BAR1 offset.
> +	 * The BAR1 start offset in the PCI config space depends on the BAR0
> +	 * size. Check if the BAR0 is 64b and return the approproiate BAR1
> +	 * offset.
>  	 */
>  	if (val & PCI_BASE_ADDRESS_MEM_TYPE_64)
>  		return VFIO_PCI_BAR2_REGION_INDEX;
> @@ -54,6 +123,16 @@ static int nvgpu_vfio_pci_open_device(struct vfio_device *core_vdev)
>  	return ret;
>  }
>  
> +void nvgpu_vfio_pci_close_device(struct vfio_device *core_vdev)
> +{
> +	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
> +		core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
> +
> +	unregister_pfn_address_space(&(nvdev->pfn_address_space));
> +
> +	vfio_pci_core_close_device(core_vdev);
> +}
> +
>  int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
>  			struct vm_area_struct *vma)
>  {
> @@ -93,8 +172,11 @@ int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
>  		return ret;
>  
>  	vma->vm_pgoff = start_pfn + pgoff;
> +	vma->vm_ops = &nvgpu_vfio_pci_mmap_ops;
>  
> -	return 0;
> +	ret = nvgpu_vfio_pci_register_pfn_range(nvdev, vma);
> +
> +	return ret;
>  }
>  
>  long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> @@ -140,7 +222,14 @@ long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  		}
>  
>  		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> -
> +	case VFIO_DEVICE_RESET:
> +		/*
> +		 * Resetting the GPU clears up the poisoned page. Reset the
> +		 * poisoned page bitmap.
> +		 */
> +		memset(nvdev->mem_prop.pfn_bitmap, 0,
> +		       nvdev->mem_prop.mem_length >> (PAGE_SHIFT + 3));
> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
>  	default:
>  		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
>  	}
> @@ -151,7 +240,7 @@ static const struct vfio_device_ops nvgpu_vfio_pci_ops = {
>  	.init = vfio_pci_core_init_dev,
>  	.release = vfio_pci_core_release_dev,
>  	.open_device = nvgpu_vfio_pci_open_device,
> -	.close_device = vfio_pci_core_close_device,
> +	.close_device = nvgpu_vfio_pci_close_device,
>  	.ioctl = nvgpu_vfio_pci_ioctl,
>  	.read = vfio_pci_core_read,
>  	.write = vfio_pci_core_write,
> @@ -188,7 +277,20 @@ nvgpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
>  
>  	ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-size",
>  				       &(nvdev->mem_prop.mem_length));
> -	return ret;
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * A bitmap is maintained to teack the pages that are poisoned. Each
                                       ^track?
> +	 * page is represented by a bit. Allocation size in bytes is
> +	 * determined by shifting the device memory size by PAGE_SHIFT to
> +	 * determine the number of pages; and further shifted by 3 as each
> +	 * byte could track 8 pages.
> +	 */
> +	nvdev->mem_prop.pfn_bitmap
> +		= vzalloc(nvdev->mem_prop.mem_length >> (PAGE_SHIFT + 3));
> +
> +	return 0;
>  }
>  
>  static int nvgpu_vfio_pci_probe(struct pci_dev *pdev,
> @@ -224,6 +326,8 @@ static void nvgpu_vfio_pci_remove(struct pci_dev *pdev)
>  	struct nvgpu_vfio_pci_core_device *nvdev = nvgpu_drvdata(pdev);
>  	struct vfio_pci_core_device *vdev = &nvdev->core_device;
>  
> +	vfree(nvdev->mem_prop.pfn_bitmap);
> +
>  	vfio_pci_core_unregister_device(vdev);
>  	vfio_put_device(&vdev->vdev);
>  }



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/6] vfio/nvgpu: expose GPU device memory as BAR1
  2023-04-05 18:01 ` [PATCH v3 2/6] vfio/nvgpu: expose GPU device memory as BAR1 ankita
@ 2023-04-05 21:07   ` kernel test robot
  0 siblings, 0 replies; 31+ messages in thread
From: kernel test robot @ 2023-04-05 21:07 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: oe-kbuild-all, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, apopple, jhubbard, danw, kvm, linux-kernel,
	linux-arm-kernel, linux-mm

Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on awilliam-vfio/for-linus]
[also build test WARNING on kvmarm/next akpm-mm/mm-everything linus/master v6.3-rc5 next-20230405]
[cannot apply to awilliam-vfio/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/ankita-nvidia-com/kvm-determine-memory-type-from-VMA/20230406-020404
base:   https://github.com/awilliam/linux-vfio.git for-linus
patch link:    https://lore.kernel.org/r/20230405180134.16932-3-ankita%40nvidia.com
patch subject: [PATCH v3 2/6] vfio/nvgpu: expose GPU device memory as BAR1
config: sparc-allyesconfig (https://download.01.org/0day-ci/archive/20230406/202304060424.MtQM4udq-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/09ea30fcd2fb02d13a38cab4bf3d903f902408f4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review ankita-nvidia-com/kvm-determine-memory-type-from-VMA/20230406-020404
        git checkout 09ea30fcd2fb02d13a38cab4bf3d903f902408f4
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sparc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sparc SHELL=/bin/bash drivers/vfio/pci/nvgpu/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304060424.MtQM4udq-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/vfio/pci/nvgpu/main.c:57:5: warning: no previous prototype for 'nvgpu_vfio_pci_mmap' [-Wmissing-prototypes]
      57 | int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
         |     ^~~~~~~~~~~~~~~~~~~
>> drivers/vfio/pci/nvgpu/main.c:100:6: warning: no previous prototype for 'nvgpu_vfio_pci_ioctl' [-Wmissing-prototypes]
     100 | long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
         |      ^~~~~~~~~~~~~~~~~~~~


vim +/nvgpu_vfio_pci_mmap +57 drivers/vfio/pci/nvgpu/main.c

    56	
  > 57	int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
    58				struct vm_area_struct *vma)
    59	{
    60		struct nvgpu_vfio_pci_core_device *nvdev = container_of(
    61			core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
    62	
    63		unsigned long start_pfn;
    64		unsigned int index;
    65		u64 req_len, pgoff;
    66		int ret = 0;
    67	
    68		index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
    69		if (index != nvdev->mem_prop.bar1_start_offset)
    70			return vfio_pci_core_mmap(core_vdev, vma);
    71	
    72		/*
    73		 * Request to mmap the BAR1. Map to the CPU accessible memory on the
    74		 * GPU using the memory information gathered from the system ACPI
    75		 * tables.
    76		 */
    77		start_pfn = nvdev->mem_prop.hpa >> PAGE_SHIFT;
    78		req_len = vma->vm_end - vma->vm_start;
    79		pgoff = vma->vm_pgoff &
    80			((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
    81		if (pgoff >= (nvdev->mem_prop.mem_length >> PAGE_SHIFT))
    82			return -EINVAL;
    83	
    84		/*
    85		 * Perform a PFN map to the memory. The device BAR1 is backed by the
    86		 * GPU memory now. Check that the mapping does not overflow out of
    87		 * the GPU memory size.
    88		 */
    89		ret = remap_pfn_range(vma, vma->vm_start, start_pfn + pgoff,
    90				      min(req_len, nvdev->mem_prop.mem_length - pgoff),
    91				      vma->vm_page_prot);
    92		if (ret)
    93			return ret;
    94	
    95		vma->vm_pgoff = start_pfn + pgoff;
    96	
    97		return 0;
    98	}
    99	
 > 100	long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
   101				  unsigned long arg)
   102	{
   103		struct nvgpu_vfio_pci_core_device *nvdev = container_of(
   104			core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
   105	
   106		unsigned long minsz = offsetofend(struct vfio_region_info, offset);
   107		struct vfio_region_info info;
   108	
   109		switch (cmd) {
   110		case VFIO_DEVICE_GET_REGION_INFO:
   111			if (copy_from_user(&info, (void __user *)arg, minsz))
   112				return -EFAULT;
   113	
   114			if (info.argsz < minsz)
   115				return -EINVAL;
   116	
   117			if (info.index == nvdev->mem_prop.bar1_start_offset) {
   118				/*
   119				 * Request to determine the BAR1 region information. Send the
   120				 * GPU memory information.
   121				 */
   122				info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
   123				info.size = nvdev->mem_prop.mem_length;
   124				info.flags = VFIO_REGION_INFO_FLAG_READ |
   125					     VFIO_REGION_INFO_FLAG_WRITE |
   126					     VFIO_REGION_INFO_FLAG_MMAP;
   127				return copy_to_user((void __user *)arg, &info, minsz) ?
   128					       -EFAULT : 0;
   129			}
   130	
   131			if (info.index == nvdev->mem_prop.bar1_start_offset + 1) {
   132				/*
   133				 * The BAR1 region is 64b. Ignore this access.
   134				 */
   135				info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
   136				info.size = 0;
   137				info.flags = 0;
   138				return copy_to_user((void __user *)arg, &info, minsz) ?
   139					-EFAULT : 0;
   140			}
   141	
   142			return vfio_pci_core_ioctl(core_vdev, cmd, arg);
   143	
   144		default:
   145			return vfio_pci_core_ioctl(core_vdev, cmd, arg);
   146		}
   147	}
   148	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages
  2023-04-05 18:01 ` [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages ankita
@ 2023-04-05 21:07   ` kernel test robot
  2023-05-09  9:51   ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 0 replies; 31+ messages in thread
From: kernel test robot @ 2023-04-05 21:07 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: oe-kbuild-all, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, apopple, jhubbard, danw, kvm, linux-kernel,
	linux-arm-kernel, linux-mm

Hi,

kernel test robot noticed the following build errors:

[auto build test ERROR on awilliam-vfio/for-linus]
[also build test ERROR on kvmarm/next akpm-mm/mm-everything linus/master v6.3-rc5]
[cannot apply to awilliam-vfio/next next-20230405]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/ankita-nvidia-com/kvm-determine-memory-type-from-VMA/20230406-020404
base:   https://github.com/awilliam/linux-vfio.git for-linus
patch link:    https://lore.kernel.org/r/20230405180134.16932-4-ankita%40nvidia.com
patch subject: [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages
config: x86_64-randconfig-a015-20230403 (https://download.01.org/0day-ci/archive/20230406/202304060452.tpNrPK39-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/25466c8c2fa22d39a08721a24f0cf3bc3059417b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review ankita-nvidia-com/kvm-determine-memory-type-from-VMA/20230406-020404
        git checkout 25466c8c2fa22d39a08721a24f0cf3bc3059417b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 olddefconfig
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304060452.tpNrPK39-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: vmlinux.o: in function `memory_failure_pfn':
>> mm/memory-failure.c:2124: undefined reference to `interval_tree_iter_first'
>> ld: mm/memory-failure.c:2125: undefined reference to `interval_tree_iter_next'
   ld: vmlinux.o: in function `register_pfn_address_space':
>> mm/memory-failure.c:2087: undefined reference to `interval_tree_insert'
   ld: vmlinux.o: in function `unregister_pfn_address_space':
>> mm/memory-failure.c:2105: undefined reference to `interval_tree_remove'


vim +2124 mm/memory-failure.c

  2065	
  2066	/**
  2067	 * register_pfn_address_space - Register PA region for poison notification.
  2068	 * @pfn_space: structure containing region range and callback function on
  2069	 *             poison detection.
  2070	 *
  2071	 * This function is called by a kernel module to register a PA region and
  2072	 * a callback function with the kernel. On detection of poison, the
  2073	 * kernel code will go through all registered regions and call the
  2074	 * appropriate callback function associated with the range. The kernel
  2075	 * module is responsible for tracking the poisoned pages.
  2076	 *
  2077	 * Return: 0 if successfully registered,
  2078	 *         -EBUSY if the region is already registered
  2079	 */
  2080	int register_pfn_address_space(struct pfn_address_space *pfn_space)
  2081	{
  2082		if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
  2083			(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT, ""))
  2084			return -EBUSY;
  2085	
  2086		mutex_lock(&pfn_space_lock);
> 2087		interval_tree_insert(&pfn_space->node, &pfn_space_itree);
  2088		mutex_unlock(&pfn_space_lock);
  2089	
  2090		return 0;
  2091	}
  2092	EXPORT_SYMBOL_GPL(register_pfn_address_space);
  2093	
  2094	/**
  2095	 * unregister_pfn_address_space - Unregister a PA region from poison
  2096	 * notification.
  2097	 * @pfn_space: structure containing region range to be unregistered.
  2098	 *
  2099	 * This function is called by a kernel module to unregister the PA region
  2100	 * from the kernel from poison tracking.
  2101	 */
  2102	void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
  2103	{
  2104		mutex_lock(&pfn_space_lock);
> 2105		interval_tree_remove(&pfn_space->node, &pfn_space_itree);
  2106		mutex_unlock(&pfn_space_lock);
  2107		release_mem_region(pfn_space->node.start << PAGE_SHIFT,
  2108			(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT);
  2109	}
  2110	EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
  2111	
  2112	static int memory_failure_pfn(unsigned long pfn, int flags)
  2113	{
  2114		struct interval_tree_node *node;
  2115		int rc = -EBUSY;
  2116		LIST_HEAD(tokill);
  2117	
  2118		mutex_lock(&pfn_space_lock);
  2119		/*
  2120		 * Modules registers with MM the address space mapping to the device memory they
  2121		 * manage. Iterate to identify exactly which address space has mapped to this
  2122		 * failing PFN.
  2123		 */
> 2124		for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> 2125		     node = interval_tree_iter_next(node, pfn, pfn)) {
  2126			struct pfn_address_space *pfn_space =
  2127				container_of(node, struct pfn_address_space, node);
  2128			rc = 0;
  2129	
  2130			/*
  2131			 * Modules managing the device memory needs to be conveyed about the
  2132			 * memory failure so that the poisoned PFN can be tracked.
  2133			 */
  2134			pfn_space->ops->failure(pfn_space, pfn);
  2135	
  2136			collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);
  2137	
  2138			unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
  2139					    PAGE_SIZE, 0);
  2140		}
  2141		mutex_unlock(&pfn_space_lock);
  2142	
  2143		/*
  2144		 * Unlike System-RAM there is no possibility to swap in a different
  2145		 * physical page at a given virtual address, so all userspace
  2146		 * consumption of direct PFN memory necessitates SIGBUS (i.e.
  2147		 * MF_MUST_KILL)
  2148		 */
  2149		flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
  2150		kill_procs(&tokill, true, false, pfn, flags);
  2151	
  2152		pr_err("%#lx: recovery action for %s: %s\n",
  2153				pfn, action_page_types[MF_MSG_PFN],
  2154				action_name[rc ? MF_FAILED : MF_RECOVERED]);
  2155	
  2156		return rc;
  2157	}
  2158	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling
  2023-04-05 18:01 ` [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling ankita
  2023-04-05 20:24   ` Zhi Wang
@ 2023-04-05 21:50   ` kernel test robot
  2023-05-24  9:53   ` Dan Carpenter
  2 siblings, 0 replies; 31+ messages in thread
From: kernel test robot @ 2023-04-05 21:50 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: oe-kbuild-all, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, apopple, jhubbard, danw, kvm, linux-kernel,
	linux-arm-kernel, linux-mm

Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on awilliam-vfio/for-linus]
[also build test WARNING on kvmarm/next akpm-mm/mm-everything linus/master v6.3-rc5]
[cannot apply to awilliam-vfio/next next-20230405]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/ankita-nvidia-com/kvm-determine-memory-type-from-VMA/20230406-020404
base:   https://github.com/awilliam/linux-vfio.git for-linus
patch link:    https://lore.kernel.org/r/20230405180134.16932-7-ankita%40nvidia.com
patch subject: [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling
config: sparc-allyesconfig (https://download.01.org/0day-ci/archive/20230406/202304060554.C67nAJzr-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/f2eb3c3417670adc1615fc629e6363d50c0623b4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review ankita-nvidia-com/kvm-determine-memory-type-from-VMA/20230406-020404
        git checkout f2eb3c3417670adc1615fc629e6363d50c0623b4
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sparc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sparc SHELL=/bin/bash drivers/vfio/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304060554.C67nAJzr-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/vfio/pci/nvgpu/main.c:27:6: warning: no previous prototype for 'nvgpu_vfio_pci_pfn_memory_failure' [-Wmissing-prototypes]
      27 | void nvgpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/vfio/pci/nvgpu/main.c:126:6: warning: no previous prototype for 'nvgpu_vfio_pci_close_device' [-Wmissing-prototypes]
     126 | void nvgpu_vfio_pci_close_device(struct vfio_device *core_vdev)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/vfio/pci/nvgpu/main.c:136:5: warning: no previous prototype for 'nvgpu_vfio_pci_mmap' [-Wmissing-prototypes]
     136 | int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
         |     ^~~~~~~~~~~~~~~~~~~
   drivers/vfio/pci/nvgpu/main.c:182:6: warning: no previous prototype for 'nvgpu_vfio_pci_ioctl' [-Wmissing-prototypes]
     182 | long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
         |      ^~~~~~~~~~~~~~~~~~~~


vim +/nvgpu_vfio_pci_pfn_memory_failure +27 drivers/vfio/pci/nvgpu/main.c

    26	
  > 27	void nvgpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
    28					       unsigned long pfn)
    29	{
    30		struct nvgpu_vfio_pci_core_device *nvdev = container_of(
    31			pfn_space, struct nvgpu_vfio_pci_core_device, pfn_address_space);
    32	
    33		/*
    34		 * MM has called to notify a poisoned page. Track that in the bitmap.
    35		 */
    36		__set_bit(pfn - (pfn_space->node.start), nvdev->mem_prop.pfn_bitmap);
    37	}
    38	
    39	struct pfn_address_space_ops nvgpu_vfio_pci_pas_ops = {
    40		.failure = nvgpu_vfio_pci_pfn_memory_failure,
    41	};
    42	
    43	static int
    44	nvgpu_vfio_pci_register_pfn_range(struct nvgpu_vfio_pci_core_device *nvdev,
    45					  struct vm_area_struct *vma)
    46	{
    47		unsigned long nr_pages;
    48		int ret = 0;
    49	
    50		nr_pages = nvdev->mem_prop.mem_length >> PAGE_SHIFT;
    51	
    52		nvdev->pfn_address_space.node.start = vma->vm_pgoff;
    53		nvdev->pfn_address_space.node.last = vma->vm_pgoff + nr_pages - 1;
    54		nvdev->pfn_address_space.ops = &nvgpu_vfio_pci_pas_ops;
    55		nvdev->pfn_address_space.mapping = vma->vm_file->f_mapping;
    56	
    57		ret = register_pfn_address_space(&(nvdev->pfn_address_space));
    58	
    59		return ret;
    60	}
    61	
    62	static vm_fault_t nvgpu_vfio_pci_fault(struct vm_fault *vmf)
    63	{
    64		unsigned long mem_offset = vmf->pgoff - vmf->vma->vm_pgoff;
    65		struct nvgpu_vfio_pci_core_device *nvdev = container_of(
    66			vmf->vma->vm_file->private_data,
    67			struct nvgpu_vfio_pci_core_device, core_device.vdev);
    68		int ret;
    69	
    70		/*
    71		 * Check if the page is poisoned.
    72		 */
    73		if (mem_offset < (nvdev->mem_prop.mem_length >> PAGE_SHIFT) &&
    74			test_bit(mem_offset, nvdev->mem_prop.pfn_bitmap))
    75			return VM_FAULT_HWPOISON;
    76	
    77		ret = remap_pfn_range(vmf->vma,
    78				vmf->vma->vm_start + (mem_offset << PAGE_SHIFT),
    79				DUMMY_PFN, PAGE_SIZE,
    80				vmf->vma->vm_page_prot);
    81		if (ret)
    82			return VM_FAULT_ERROR;
    83	
    84		return VM_FAULT_NOPAGE;
    85	}
    86	
    87	static const struct vm_operations_struct nvgpu_vfio_pci_mmap_ops = {
    88		.fault = nvgpu_vfio_pci_fault,
    89	};
    90	
    91	static int vfio_get_bar1_start_offset(struct vfio_pci_core_device *vdev)
    92	{
    93		u8 val = 0;
    94	
    95		pci_read_config_byte(vdev->pdev, 0x10, &val);
    96		/*
    97		 * The BAR1 start offset in the PCI config space depends on the BAR0
    98		 * size. Check if the BAR0 is 64b and return the approproiate BAR1
    99		 * offset.
   100		 */
   101		if (val & PCI_BASE_ADDRESS_MEM_TYPE_64)
   102			return VFIO_PCI_BAR2_REGION_INDEX;
   103	
   104		return VFIO_PCI_BAR1_REGION_INDEX;
   105	}
   106	
   107	static int nvgpu_vfio_pci_open_device(struct vfio_device *core_vdev)
   108	{
   109		struct nvgpu_vfio_pci_core_device *nvdev = container_of(
   110			core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
   111		struct vfio_pci_core_device *vdev =
   112			container_of(core_vdev, struct vfio_pci_core_device, vdev);
   113		int ret;
   114	
   115		ret = vfio_pci_core_enable(vdev);
   116		if (ret)
   117			return ret;
   118	
   119		vfio_pci_core_finish_enable(vdev);
   120	
   121		nvdev->mem_prop.bar1_start_offset = vfio_get_bar1_start_offset(vdev);
   122	
   123		return ret;
   124	}
   125	
 > 126	void nvgpu_vfio_pci_close_device(struct vfio_device *core_vdev)
   127	{
   128		struct nvgpu_vfio_pci_core_device *nvdev = container_of(
   129			core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
   130	
   131		unregister_pfn_address_space(&(nvdev->pfn_address_space));
   132	
   133		vfio_pci_core_close_device(core_vdev);
   134	}
   135	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible
  2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
                   ` (5 preceding siblings ...)
  2023-04-05 18:01 ` [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling ankita
@ 2023-04-06 12:07 ` David Hildenbrand
  2023-04-12  8:43   ` Ankit Agrawal
  2023-04-12 12:28 ` Marc Zyngier
  7 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2023-04-06 12:07 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

[...]

> This goes along with a qemu series to provides the necessary
> implementation of the Grace Hopper Superchip firmware specification so
> that the guest operating system can see the correct ACPI modeling for
> the coherent GPU device.
> https://github.com/qemu/qemu/compare/master...ankita-nv:qemu:dev-ankit/cohmem-0330
> 
> Applied and tested over v6.3-rc4.
> 

I briefly skimmed over the series, the patch subject prefixes are a bit 
misleading IMHO and could be improved:

> Ankit Agrawal (6):
>    kvm: determine memory type from VMA

this is arch64 specific kvm (kvm/aarch64: ?)

>    vfio/nvgpu: expose GPU device memory as BAR1
>    mm: handle poisoning of pfn without struct pages

mm/memory-failure:

>    mm: Add poison error check in fixup_user_fault() for mapped PFN

That's both MM and core-KVM, maybe worth splitting up.

>    mm: Change ghes code to allow poison of non-struct PFN

That's  drivers/acpi/apei code, not core-mm code.

>    vfio/nvgpu: register device memory for poison handling

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible
  2023-04-06 12:07 ` [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible David Hildenbrand
@ 2023-04-12  8:43   ` Ankit Agrawal
  2023-04-12  9:48     ` Marc Zyngier
  0 siblings, 1 reply; 31+ messages in thread
From: Ankit Agrawal @ 2023-04-12  8:43 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe, alex.williamson,
	naoya.horiguchi, maz, oliver.upton
  Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
	Dan Williams, kvm, linux-kernel, linux-arm-kernel, linux-mm

Thanks David, response inline.

[...]

> I briefly skimmed over the series, the patch subject prefixes are a bit
> misleading IMHO and could be improved:

Understood. Will fix that in the next iteration.


>> Ankit Agrawal (6):
>>    kvm: determine memory type from VMA

> this is arch64 specific kvm (kvm/aarch64: ?)
Right. I'll change the prefix to kvm/aarch64


>>>    vfio/nvgpu: expose GPU device memory as BAR1
>>    mm: handle poisoning of pfn without struct pages
>mm/memory-failure:
Will change the prefix.

>>    mm: Add poison error check in fixup_user_fault() for mapped PFN
> That's both MM and core-KVM, maybe worth splitting up.
Ack, will do.

>>    mm: Change ghes code to allow poison of non-struct PFN
> That's  drivers/acpi/apei code, not core-mm code.
Ack.

>>    vfio/nvgpu: register device memory for poison handling

Thanks,
Ankit Agrawal

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible
  2023-04-12  8:43   ` Ankit Agrawal
@ 2023-04-12  9:48     ` Marc Zyngier
  0 siblings, 0 replies; 31+ messages in thread
From: Marc Zyngier @ 2023-04-12  9:48 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: David Hildenbrand, Jason Gunthorpe, alex.williamson,
	naoya.horiguchi, oliver.upton, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
	Dan Williams, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Wed, 12 Apr 2023 09:43:56 +0100,
Ankit Agrawal <ankita@nvidia.com> wrote:
> 
> Thanks David, response inline.
> 
> [...]
> 
> > I briefly skimmed over the series, the patch subject prefixes are a bit
> > misleading IMHO and could be improved:
> 
> Understood. Will fix that in the next iteration.
> 
> 
> >> Ankit Agrawal (6):
> >>    kvm: determine memory type from VMA
> 
> > this is arch64 specific kvm (kvm/aarch64: ?)
> Right. I'll change the prefix to kvm/aarch64

Please look at the git log: the idiomatic prefix is
"KVM: arm64: Something starting with a capital letter"

AArch64 is almost never used anywhere in the arm64 tree.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible
  2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
                   ` (6 preceding siblings ...)
  2023-04-06 12:07 ` [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible David Hildenbrand
@ 2023-04-12 12:28 ` Marc Zyngier
  2023-04-12 12:53   ` Jason Gunthorpe
  7 siblings, 1 reply; 31+ messages in thread
From: Marc Zyngier @ 2023-04-12 12:28 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, naoya.horiguchi, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Wed, 05 Apr 2023 19:01:28 +0100,
<ankita@nvidia.com> wrote:
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal propritary cache coherent interconnect.
> 
> This representation has a number of limitations compared to a real PCI
> device, in particular, it does not model the coherent GPU memory
> aperture as a PCI config space BAR, and PCI doesn't know anything
> about cacheable memory types.
> 
> Provide a VFIO PCI variant driver that adapts the unique PCI
> representation into a more standard PCI representation facing
> userspace. The GPU memory aperture is obtained from ACPI, according to
> the FW specification, and exported to userspace as the VFIO_REGION
> that covers the first PCI BAR. qemu will naturally generate a PCI
> device in the VM where the cacheable aperture is reported in BAR1.
> 
> Since this memory region is actually cache coherent with the CPU, the
> VFIO variant driver will mmap it into VMA using a cacheable mapping.
> 
> As this is the first time an ARM environment has placed cacheable
> non-struct page backed memory (eg from remap_pfn_range) into a KVM
> page table, fix a bug in ARM KVM where it does not copy the cacheable
> memory attributes from non-struct page backed PTEs to ensure the guest
> also gets a cacheable mapping.

This is not a bug, but a conscious design decision. As you pointed out
above, nothing needed this until now, and a device mapping is the only
safe thing to do as we know exactly *nothing* about the memory that
gets mapped.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-04-05 18:01 ` [PATCH v3 1/6] kvm: determine memory type from VMA ankita
@ 2023-04-12 12:43   ` Marc Zyngier
  2023-04-12 13:01     ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: Marc Zyngier @ 2023-04-12 12:43 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, naoya.horiguchi, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Wed, 05 Apr 2023 19:01:29 +0100,
<ankita@nvidia.com> wrote:
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Each VM stores the requires pgprots for its mappings in the

The VM stores *nothing*. Did you mean VMA? And even then, I can't
parse this.

> vma->pgprot. Based on this we can determine the desired MT_DEVICE_*
> for the VMA directly, and do not have to guess based on heuristics
> based on pfn_is_map_memory().
> 
> There are the following kinds of pgprot available to userspace and their
> corresponding type:
> pgprot_noncached -> MT_DEVICE_nGnRnE
> pgprot_writecombine -> MT_NORMAL_NC
> pgprot_device -> MT_DEVICE_nGnRE
> pgprot_tagged -> MT_NORMAL_TAGGED
> 
> Decode the relevant MT_* types in use and translate them into the
> corresponding KVM_PGTABLEPROT_*:
> 
>  - MT_DEVICE_nGnRE -> KVM_PGTABLE_PROT_DEVICE_nGnRE (device)
>  - MT_DEVICE_nGnRnE -> KVM_PGTABLE_PROT_DEVICE_nGnRnE (noncached)
>  - MT_NORMAL/_TAGGED/_NC -> 0
> 
> The selection of 0 for the S2 KVM_PGTABLE_PROT_DEVICE_nGnRnE is based
> on [2].
> 
> Also worth noting is the result of the stage-1 and stage-2. Ref [3]
> If FWB not set, then the combination is the one that is more restrictive.
> The sequence from lowest restriction to the highest:
> DEVICE_nGnRnE -> DEVICE_nGnRE -> NORMAL/_TAGGED/_NC

Sorry, but I can't parse this either. If you mean that the strongest
memory type wins, then I agree. But what does it bring to the
discussion?

> If FWB is set, then stage-2 mapping type overrides the stage-1 [1].
> 
> This solves a problem where KVM cannot preserve the MT_NORMAL memory
> type for non-struct page backed memory into the S2 mapping. Instead
> the VMA creator determines the MT type and the S2 will follow it.

What makes it safe? How does VFIO ensures that the memory type used is
correct in all circumstances? This has to hold for *ANY* device, not
just your favourite toy of the day. Nothing in this patch is horribly
wrong, but the above question must be answered before we can consider
any of this.

> 
> [1] https://developer.arm.com/documentation/102376/0100/Combining-Stage-1-and-Stage-2-attributes
> [2] ARMv8 reference manual: https://developer.arm.com/documentation/ddi0487/gb/ Section D5.5.3, Table D5-38
> [3] ARMv8 reference manual: https://developer.arm.com/documentation/ddi0487/gb/ Table G5-20 on page G5-6330

Please quote references that are current at the time of posting.
Revision DDI0487I.a is the most recent version, so use that.

>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  8 +++++---
>  arch/arm64/include/asm/memory.h      |  6 ++++--
>  arch/arm64/kvm/hyp/pgtable.c         | 16 +++++++++++-----
>  arch/arm64/kvm/mmu.c                 | 27 ++++++++++++++++++++++-----
>  4 files changed, 42 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 4cd6762bda80..d3166b6e6329 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -150,7 +150,8 @@ enum kvm_pgtable_stage2_flags {
>   * @KVM_PGTABLE_PROT_X:		Execute permission.
>   * @KVM_PGTABLE_PROT_W:		Write permission.
>   * @KVM_PGTABLE_PROT_R:		Read permission.
> - * @KVM_PGTABLE_PROT_DEVICE:	Device attributes.
> + * @KVM_PGTABLE_PROT_DEVICE_nGnRE:	Device nGnRE attributes.
> + * @KVM_PGTABLE_PROT_DEVICE_nGnRnE:	Device nGnRnE attributes.
>   * @KVM_PGTABLE_PROT_SW0:	Software bit 0.
>   * @KVM_PGTABLE_PROT_SW1:	Software bit 1.
>   * @KVM_PGTABLE_PROT_SW2:	Software bit 2.
> @@ -161,7 +162,8 @@ enum kvm_pgtable_prot {
>  	KVM_PGTABLE_PROT_W			= BIT(1),
>  	KVM_PGTABLE_PROT_R			= BIT(2),
>  
> -	KVM_PGTABLE_PROT_DEVICE			= BIT(3),
> +	KVM_PGTABLE_PROT_DEVICE_nGnRE		= BIT(3),
> +	KVM_PGTABLE_PROT_DEVICE_nGnRnE		= BIT(4),
>  
>  	KVM_PGTABLE_PROT_SW0			= BIT(55),
>  	KVM_PGTABLE_PROT_SW1			= BIT(56),
> @@ -178,7 +180,7 @@ enum kvm_pgtable_prot {
>  #define PAGE_HYP		KVM_PGTABLE_PROT_RW
>  #define PAGE_HYP_EXEC		(KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X)
>  #define PAGE_HYP_RO		(KVM_PGTABLE_PROT_R)
> -#define PAGE_HYP_DEVICE		(PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
> +#define PAGE_HYP_DEVICE		(PAGE_HYP | KVM_PGTABLE_PROT_DEVICE_nGnRE)
>  
>  typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
>  					   enum kvm_pgtable_prot prot);
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 78e5163836a0..4ebbc4b1ba4d 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -147,14 +147,16 @@
>   * Memory types for Stage-2 translation
>   */
>  #define MT_S2_NORMAL		0xf
> +#define MT_S2_DEVICE_nGnRnE 0x0
>  #define MT_S2_DEVICE_nGnRE	0x1
>  
>  /*
>   * Memory types for Stage-2 translation when ID_AA64MMFR2_EL1.FWB is 0001
>   * Stage-2 enforces Normal-WB and Device-nGnRE
>   */
> -#define MT_S2_FWB_NORMAL	6
> -#define MT_S2_FWB_DEVICE_nGnRE	1
> +#define MT_S2_FWB_NORMAL	0x6
> +#define MT_S2_FWB_DEVICE_nGnRnE 0x0
> +#define MT_S2_FWB_DEVICE_nGnRE	0x1

Why the repainting of perfectly valid constants? What does the 0x
prefix bring? Does the comment above need updating?

>  
>  #ifdef CONFIG_ARM64_4K_PAGES
>  #define IOREMAP_MAX_ORDER	(PUD_SHIFT)
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 3d61bd3e591d..7a8238b41590 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -355,7 +355,7 @@ struct hyp_map_data {
>  
>  static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
>  {
> -	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
> +	bool device = prot & KVM_PGTABLE_PROT_DEVICE_nGnRE;
>  	u32 mtype = device ? MT_DEVICE_nGnRE : MT_NORMAL;
>  	kvm_pte_t attr = FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX, mtype);
>  	u32 sh = KVM_PTE_LEAF_ATTR_LO_S1_SH_IS;
> @@ -636,14 +636,20 @@ static bool stage2_has_fwb(struct kvm_pgtable *pgt)
>  static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot prot,
>  				kvm_pte_t *ptep)
>  {
> -	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
> -	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
> -			    KVM_S2_MEMATTR(pgt, NORMAL);
>  	u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
> +	kvm_pte_t attr;
> +
> +	if (prot & KVM_PGTABLE_PROT_DEVICE_nGnRE)
> +		attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRE);
> +	else if (prot & KVM_PGTABLE_PROT_DEVICE_nGnRnE)
> +		attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRnE);
> +	else
> +		attr = KVM_S2_MEMATTR(pgt, NORMAL);
>  
>  	if (!(prot & KVM_PGTABLE_PROT_X))
>  		attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
> -	else if (device)
> +	else if (prot & KVM_PGTABLE_PROT_DEVICE_nGnRE ||
> +		 prot & KVM_PGTABLE_PROT_DEVICE_nGnRnE)

Why don't you keep 'device', which is concise and readable, and simply
change the way it is assigned?

>  		return -EINVAL;
>  
>  	if (prot & KVM_PGTABLE_PROT_R)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 7113587222ff..8d63aa951c33 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -897,7 +897,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  	int ret = 0;
>  	struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
>  	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> -	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> +	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE_nGnRE |
>  				     KVM_PGTABLE_PROT_R |
>  				     (writable ? KVM_PGTABLE_PROT_W : 0);
>  
> @@ -1186,6 +1186,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
>  
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type(pgprot_t page_prot)
> +{
> +	return ((pgprot_val(page_prot) & PTE_ATTRINDX_MASK) >> 2);

Please use FIELD_GET() for field extraction.

> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1368,10 +1377,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
>  
> -	if (device)
> -		prot |= KVM_PGTABLE_PROT_DEVICE;
> -	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> -		prot |= KVM_PGTABLE_PROT_X;
> +	switch (mapping_type(vma->vm_page_prot)) {
> +	case MT_DEVICE_nGnRE:
> +		prot |= KVM_PGTABLE_PROT_DEVICE_nGnRE;
> +		break;
> +	case MT_DEVICE_nGnRnE:
> +		prot |= KVM_PGTABLE_PROT_DEVICE_nGnRnE;
> +		break;

Please keep the 'device' guard. It makes it easy to find the code
paths that are relevant to this case.

> +		/* MT_NORMAL/_TAGGED/_NC */
> +	default:
> +		if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> +			prot |= KVM_PGTABLE_PROT_X;
> +	}
>  
>  	/*
>  	 * Under the premise of getting a FSC_PERM fault, we just need to relax

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible
  2023-04-12 12:28 ` Marc Zyngier
@ 2023-04-12 12:53   ` Jason Gunthorpe
  2023-04-13  9:52     ` Marc Zyngier
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2023-04-12 12:53 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: ankita, alex.williamson, naoya.horiguchi, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Wed, Apr 12, 2023 at 01:28:08PM +0100, Marc Zyngier wrote:
> On Wed, 05 Apr 2023 19:01:28 +0100,
> <ankita@nvidia.com> wrote:
> > 
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > for the on-chip GPU that is the logical OS representation of the
> > internal propritary cache coherent interconnect.
> > 
> > This representation has a number of limitations compared to a real PCI
> > device, in particular, it does not model the coherent GPU memory
> > aperture as a PCI config space BAR, and PCI doesn't know anything
> > about cacheable memory types.
> > 
> > Provide a VFIO PCI variant driver that adapts the unique PCI
> > representation into a more standard PCI representation facing
> > userspace. The GPU memory aperture is obtained from ACPI, according to
> > the FW specification, and exported to userspace as the VFIO_REGION
> > that covers the first PCI BAR. qemu will naturally generate a PCI
> > device in the VM where the cacheable aperture is reported in BAR1.
> > 
> > Since this memory region is actually cache coherent with the CPU, the
> > VFIO variant driver will mmap it into VMA using a cacheable mapping.
> > 
> > As this is the first time an ARM environment has placed cacheable
> > non-struct page backed memory (eg from remap_pfn_range) into a KVM
> > page table, fix a bug in ARM KVM where it does not copy the cacheable
> > memory attributes from non-struct page backed PTEs to ensure the guest
> > also gets a cacheable mapping.
> 
> This is not a bug, but a conscious design decision. As you pointed out
> above, nothing needed this until now, and a device mapping is the only
> safe thing to do as we know exactly *nothing* about the memory that
> gets mapped.

IMHO, from the mm perspective, the bug is using pfn_is_map_memory() to
determine the cachability or device memory status of a PFN in a
VMA. That is not what that API is for.

The cachability should be determined by the pgprot bits in the VMA.

VM_IO is the flag that says the VMA maps memory with side-effects.

I understand in ARM KVM it is not allowed for the VM and host to have
different cachability, so mis-detecting host cachable memory and
making it forced non-cachable in the VM is not a safe thing to do?

Jason


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-04-12 12:43   ` Marc Zyngier
@ 2023-04-12 13:01     ` Jason Gunthorpe
  2023-05-31 11:35       ` Catalin Marinas
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2023-04-12 13:01 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: ankita, alex.williamson, naoya.horiguchi, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Wed, Apr 12, 2023 at 01:43:26PM +0100, Marc Zyngier wrote:

> What makes it safe? How does VFIO ensures that the memory type used is
> correct in all circumstances? This has to hold for *ANY* device, not
> just your favourite toy of the day. Nothing in this patch is horribly
> wrong, but the above question must be answered before we can consider
> any of this.

In VFIO we now have the concept of "variant drivers" which work with
specific PCI IDs. The variant drivers can inject device specific
knowledge into VFIO. In this series the driver injects the cachable
pgprot when it creates some of the VMAs because it knows the PCI IDs
it supports, parses the ACPI description, and knows for sure that the
memory it puts in the cachable VMA is linked with a cache coherent
interconnect.

The generic vfio-pci path is not changed, so 'any device' is not
relevant here.

Jason


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible
  2023-04-12 12:53   ` Jason Gunthorpe
@ 2023-04-13  9:52     ` Marc Zyngier
  2023-04-13 13:19       ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: Marc Zyngier @ 2023-04-13  9:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ankita, alex.williamson, naoya.horiguchi, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Wed, 12 Apr 2023 13:53:07 +0100,
Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Wed, Apr 12, 2023 at 01:28:08PM +0100, Marc Zyngier wrote:
> > On Wed, 05 Apr 2023 19:01:28 +0100,
> > <ankita@nvidia.com> wrote:
> > > 
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > > 
> > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > > for the on-chip GPU that is the logical OS representation of the
> > > internal propritary cache coherent interconnect.
> > > 
> > > This representation has a number of limitations compared to a real PCI
> > > device, in particular, it does not model the coherent GPU memory
> > > aperture as a PCI config space BAR, and PCI doesn't know anything
> > > about cacheable memory types.
> > > 
> > > Provide a VFIO PCI variant driver that adapts the unique PCI
> > > representation into a more standard PCI representation facing
> > > userspace. The GPU memory aperture is obtained from ACPI, according to
> > > the FW specification, and exported to userspace as the VFIO_REGION
> > > that covers the first PCI BAR. qemu will naturally generate a PCI
> > > device in the VM where the cacheable aperture is reported in BAR1.
> > > 
> > > Since this memory region is actually cache coherent with the CPU, the
> > > VFIO variant driver will mmap it into VMA using a cacheable mapping.
> > > 
> > > As this is the first time an ARM environment has placed cacheable
> > > non-struct page backed memory (eg from remap_pfn_range) into a KVM
> > > page table, fix a bug in ARM KVM where it does not copy the cacheable
> > > memory attributes from non-struct page backed PTEs to ensure the guest
> > > also gets a cacheable mapping.
> > 
> > This is not a bug, but a conscious design decision. As you pointed out
> > above, nothing needed this until now, and a device mapping is the only
> > safe thing to do as we know exactly *nothing* about the memory that
> > gets mapped.
> 
> IMHO, from the mm perspective, the bug is using pfn_is_map_memory() to
> determine the cachability or device memory status of a PFN in a
> VMA. That is not what that API is for.

It is the right API for what KVM/arm64 has been designed for. RAM gets
a normal memory mapping, and everything else gets device. That may not
suit your *new* use case, but that doesn't make it broken.

> 
> The cachability should be determined by the pgprot bits in the VMA.
> 
> VM_IO is the flag that says the VMA maps memory with side-effects.
> 
> I understand in ARM KVM it is not allowed for the VM and host to have
> different cachability, so mis-detecting host cachable memory and
> making it forced non-cachable in the VM is not a safe thing to do?

Only if you insist on not losing coherency between the two aliases
used at the same time (something that would seem pretty improbable).
And said coherency can be restored by using CMOs, as documented in
B2.8.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible
  2023-04-13  9:52     ` Marc Zyngier
@ 2023-04-13 13:19       ` Jason Gunthorpe
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2023-04-13 13:19 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: ankita, alex.williamson, naoya.horiguchi, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Thu, Apr 13, 2023 at 10:52:10AM +0100, Marc Zyngier wrote:

> > IMHO, from the mm perspective, the bug is using pfn_is_map_memory() to
> > determine the cachability or device memory status of a PFN in a
> > VMA. That is not what that API is for.
> 
> It is the right API for what KVM/arm64 has been designed for. RAM gets
> a normal memory mapping, and everything else gets device. 

The MM has a pretty flexible definition of "RAM" these days. For
instance, I don't think pfn_is_map_memory() works correctly for all
the cases we can do now with devm_memremap_pages().

> That may not suit your *new* use case, but that doesn't make it
> broken.

I've now spent alot of time working on improving VFIO and the related
ecosystem. I would to get to a point where we have a consistent VFIO
experience on all the platforms.

Currently, real NIC and GPU HW with wide VFIO deployments on x86 do
not work fully correctly on KVM/arm64. write-combining in the VM is
the big problem for existing HW, and this new CXL-like stuff has
problems with cachability.

I don't really care what we call it, as long as we can agree that VFIO
devices not working fully in VMs is a problem that should be fixed.

> Only if you insist on not losing coherency between the two aliases
> used at the same time (something that would seem pretty improbable).

This is VFIO so there is DMA involved. My understanding has been that
the SMMU is allowed to pull data out of the cache. So if the
hypervisor cachable side has pulled a line into cache and the VM
uncached side dirtied the physical memory, it is allowed that SMMU
will read stale cache data? Thus the VM will experience data
corruption on its DMAs.

With VFIO live migration I expect the hypervisor qemu side to be
actively reading from the cachable memory while the VM is running to
migrate it, so it does not seem improbable.

Jason


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages
  2023-04-05 18:01 ` [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages ankita
  2023-04-05 21:07   ` kernel test robot
@ 2023-05-09  9:51   ` HORIGUCHI NAOYA(堀口 直也)
  2023-05-15 11:18     ` Ankit Agrawal
  1 sibling, 1 reply; 31+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2023-05-09  9:51 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, maz, oliver.upton, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	kvm, linux-kernel, linux-arm-kernel, linux-mm


On Wed, Apr 05, 2023 at 11:01:31AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The kernel MM does not currently handle ECC errors / poison on a memory
> region that is not backed by struct pages. In this series, mapping request
> from QEMU to the device memory is executed using remap_pfn_range().
> Hence added a new mechanism to handle memory failure on such memory.
> 
> Make kernel MM expose a function to allow modules managing the device
> memory to register a failure function and the address space that is
> associated with the device memory. MM maintains this information as
> interval tree. The registered memory failure function is used by MM to
> notify the module of the PFN, so that the module may take any required
> action. The module for example may use the information to track the
> poisoned pages.
> 
> In this implementation, kernel MM follows the following sequence (mostly)
> similar to the memory_failure() handler for struct page backed memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn
> is executed.
> 2. memory_failure_pfn() call the newly introduced failure handler exposed
> by the module managing the poisoned memory to notify it of the problematic
> PFN.
> 3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
> 4. memory_failure_pfn() collects the processes mapped to the PFN.
> 5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the processes
> mapping the faulty PFN using kill_procs().
> 6. An access to the faulty PFN by an operation in VM at a later point of
> time is trapped and user_mem_abort() is called.
> 7. user_mem_abort() calls __gfn_to_pfn_memslot() on the PFN, and the
> following execution path is followed: __gfn_to_pfn_memslot() ->
> hva_to_pfn() -> hva_to_pfn_remapped() -> fixup_user_fault() ->
> handle_mm_fault() -> handle_pte_fault() -> do_fault(). do_fault() is
> expected to return VM_FAULT_HWPOISON on the PFN (it currently does not
> and is fixed as part of another patch in the series).
> 8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON, which cause
> the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU process
> through kvm_send_hwpoison_signal().
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  include/linux/memory-failure.h |  22 +++++
>  include/linux/mm.h             |   1 +
>  include/ras/ras_event.h        |   1 +
>  mm/memory-failure.c            | 148 +++++++++++++++++++++++++++++----
>  4 files changed, 154 insertions(+), 18 deletions(-)
>  create mode 100644 include/linux/memory-failure.h
> 
> diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
> new file mode 100644
> index 000000000000..9a579960972a
> --- /dev/null
> +++ b/include/linux/memory-failure.h
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_FAILURE_H
> +#define _LINUX_MEMORY_FAILURE_H
> +
> +#include <linux/interval_tree.h>
> +
> +struct pfn_address_space;
> +
> +struct pfn_address_space_ops {
> +	void (*failure)(struct pfn_address_space *pfn_space, unsigned long pfn);
> +};
> +
> +struct pfn_address_space {
> +	struct interval_tree_node node;
> +	const struct pfn_address_space_ops *ops;
> +	struct address_space *mapping;
> +};
> +
> +int register_pfn_address_space(struct pfn_address_space *pfn_space);
> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space);
> +
> +#endif /* _LINUX_MEMORY_FAILURE_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1f79667824eb..e3ef52d3d45a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3530,6 +3530,7 @@ enum mf_action_page_type {
>  	MF_MSG_BUDDY,
>  	MF_MSG_DAX,
>  	MF_MSG_UNSPLIT_THP,
> +	MF_MSG_PFN,

Personally, the keyword "PFN" looks to me a little too generic, so I prefer
"PFNMAP" or "PFN_MAP" because memory_failure() is anyway called with pfn
regardless of being backed by struct page.

>  	MF_MSG_UNKNOWN,
>  };
>  
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index cbd3ddd7c33d..5c62a4d17172 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -373,6 +373,7 @@ TRACE_EVENT(aer_event,
>  	EM ( MF_MSG_BUDDY, "free buddy page" )				\
>  	EM ( MF_MSG_DAX, "dax page" )					\
>  	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
> +	EM ( MF_MSG_PFN, "non struct page pfn" )					\
>  	EMe ( MF_MSG_UNKNOWN, "unknown page" )
>  
>  /*
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fae9baf3be16..2c1a2ec42f7b 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -38,6 +38,7 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/mm.h>
> +#include <linux/memory-failure.h>
>  #include <linux/page-flags.h>
>  #include <linux/kernel-page-flags.h>
>  #include <linux/sched/signal.h>
> @@ -62,6 +63,7 @@
>  #include <linux/page-isolation.h>
>  #include <linux/pagewalk.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/pfn_t.h>
>  #include "swap.h"
>  #include "internal.h"
>  #include "ras/ras_event.h"
> @@ -122,6 +124,10 @@ const struct attribute_group memory_failure_attr_group = {
>  	.attrs = memory_failure_attr,
>  };
>  
> +static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
> +
> +static DEFINE_MUTEX(pfn_space_lock);
> +
>  /*
>   * Return values:
>   *   1:   the page is dissolved (if needed) and taken off from buddy,
> @@ -399,15 +405,14 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
>   * Schedule a process for later kill.
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>   *
> - * Note: @fsdax_pgoff is used only when @p is a fsdax page and a
> - * filesystem with a memory failure handler has claimed the
> - * memory_failure event. In all other cases, page->index and
> - * page->mapping are sufficient for mapping the page back to its
> + * Notice: @pgoff is used either when @p is a fsdax page or a PFN is not
> + * backed by struct page and a filesystem with a memory failure handler
> + * has claimed the memory_failure event. In all other cases, page->index

You add a new case using @pgoff, and now we have 3 such cases, so could you
update the comment to itemize them (which makes it easier to read and update)?

> + * and page->mapping are sufficient for mapping the page back to its
>   * corresponding user virtual address.
>   */
> -static void add_to_kill(struct task_struct *tsk, struct page *p,
> -			pgoff_t fsdax_pgoff, struct vm_area_struct *vma,
> -			struct list_head *to_kill)
> +static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,
> +			struct vm_area_struct *vma, struct list_head *to_kill)
>  {
>  	struct to_kill *tk;
>  
> @@ -417,13 +422,20 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>  		return;
>  	}
>  
> -	tk->addr = page_address_in_vma(p, vma);
> -	if (is_zone_device_page(p)) {
> -		if (fsdax_pgoff != FSDAX_INVALID_PGOFF)
> -			tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
> +	if (vma->vm_flags | PFN_MAP) {
> +		tk->addr =
> +			vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +		tk->size_shift = PAGE_SHIFT;
> +	} else if (is_zone_device_page(p)) {
> +		if (pgoff != FSDAX_INVALID_PGOFF)
> +			tk->addr = vma_pgoff_address(pgoff, 1, vma);
> +		else
> +			tk->addr = page_address_in_vma(p, vma);
>  		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> -	} else
> +	} else {
> +		tk->addr = page_address_in_vma(p, vma);
>  		tk->size_shift = page_shift(compound_head(p));
> +	}
>  
>  	/*
>  	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> @@ -617,13 +629,12 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>  	i_mmap_unlock_read(mapping);
>  }
>  
> -#ifdef CONFIG_FS_DAX

It seems that your new driver is built only in limited configuration/architecture,
so loosening the condition instead of simply removing might be better.

>  /*
>   * Collect processes when the error hit a fsdax page.
>   */
> -static void collect_procs_fsdax(struct page *page,
> -		struct address_space *mapping, pgoff_t pgoff,
> -		struct list_head *to_kill)
> +static void collect_procs_pgoff(struct page *page,
> +				struct address_space *mapping, pgoff_t pgoff,
> +				struct list_head *to_kill)
>  {
>  	struct vm_area_struct *vma;
>  	struct task_struct *tsk;
> @@ -643,7 +654,6 @@ static void collect_procs_fsdax(struct page *page,
>  	read_unlock(&tasklist_lock);
>  	i_mmap_unlock_read(mapping);
>  }
> -#endif /* CONFIG_FS_DAX */
>  
>  /*
>   * Collect the processes who have the corrupted page mapped to kill.
> @@ -835,6 +845,7 @@ static const char * const action_page_types[] = {
>  	[MF_MSG_BUDDY]			= "free buddy page",
>  	[MF_MSG_DAX]			= "dax page",
>  	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
> +	[MF_MSG_PFN]			= "non struct page pfn",
>  	[MF_MSG_UNKNOWN]		= "unknown page",
>  };
>  
> @@ -1745,7 +1756,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  
>  		SetPageHWPoison(page);
>  
> -		collect_procs_fsdax(page, mapping, index, &to_kill);
> +		collect_procs_pgoff(page, mapping, index, &to_kill);
>  		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>  				index, mf_flags);
>  unlock:
> @@ -2052,6 +2063,99 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	return rc;
>  }
>  
> +/**
> + * register_pfn_address_space - Register PA region for poison notification.
> + * @pfn_space: structure containing region range and callback function on
> + *             poison detection.
> + *
> + * This function is called by a kernel module to register a PA region and
> + * a callback function with the kernel. On detection of poison, the
> + * kernel code will go through all registered regions and call the
> + * appropriate callback function associated with the range. The kernel
> + * module is responsible for tracking the poisoned pages.
> + *
> + * Return: 0 if successfully registered,
> + *         -EBUSY if the region is already registered
> + */
> +int register_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +	if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
> +		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT, ""))
> +		return -EBUSY;
> +
> +	mutex_lock(&pfn_space_lock);
> +	interval_tree_insert(&pfn_space->node, &pfn_space_itree);
> +	mutex_unlock(&pfn_space_lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(register_pfn_address_space);

register_pfn_address_space and unregister_pfn_address_space are not compiled if
CONFIG_MEMORY_FAILURE is not set, so maybe your new driver might need to depend
on this config.

> +
> +/**
> + * unregister_pfn_address_space - Unregister a PA region from poison
> + * notification.
> + * @pfn_space: structure containing region range to be unregistered.
> + *
> + * This function is called by a kernel module to unregister the PA region
> + * from the kernel from poison tracking.
> + */
> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +	mutex_lock(&pfn_space_lock);
> +	interval_tree_remove(&pfn_space->node, &pfn_space_itree);
> +	mutex_unlock(&pfn_space_lock);
> +	release_mem_region(pfn_space->node.start << PAGE_SHIFT,
> +		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT);
> +}
> +EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
> +
> +static int memory_failure_pfn(unsigned long pfn, int flags)
> +{
> +	struct interval_tree_node *node;
> +	int rc = -EBUSY;
> +	LIST_HEAD(tokill);
> +
> +	mutex_lock(&pfn_space_lock);
> +	/*
> +	 * Modules registers with MM the address space mapping to the device memory they
> +	 * manage. Iterate to identify exactly which address space has mapped to this
> +	 * failing PFN.
> +	 */
> +	for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> +	     node = interval_tree_iter_next(node, pfn, pfn)) {
> +		struct pfn_address_space *pfn_space =
> +			container_of(node, struct pfn_address_space, node);
> +		rc = 0;
> +
> +		/*
> +		 * Modules managing the device memory needs to be conveyed about the
> +		 * memory failure so that the poisoned PFN can be tracked.
> +		 */
> +		pfn_space->ops->failure(pfn_space, pfn);
> +
> +		collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);
> +
> +		unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
> +				    PAGE_SIZE, 0);
> +	}
> +	mutex_unlock(&pfn_space_lock);

If rc == 0 at this point, the given pfn seems to be outside the GPU memory,
so that should be considered as "Invalid address" case whose check is removed
by patch 5/6.  So it might be better to sperate the case from "do handling
for non struct page pfn" case.

> +
> +	/*
> +	 * Unlike System-RAM there is no possibility to swap in a different
> +	 * physical page at a given virtual address, so all userspace
> +	 * consumption of direct PFN memory necessitates SIGBUS (i.e.
> +	 * MF_MUST_KILL)
> +	 */
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +	kill_procs(&tokill, true, false, pfn, flags);
> +
> +	pr_err("%#lx: recovery action for %s: %s\n",
> +			pfn, action_page_types[MF_MSG_PFN],
> +			action_name[rc ? MF_FAILED : MF_RECOVERED]);

Could you call action_result() to print out the summary line.
It has some other common things like accounting and tracepoint.

> +
> +	return rc;
> +}
> +
>  static DEFINE_MUTEX(mf_mutex);
>  
>  /**
> @@ -2093,6 +2197,11 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!(flags & MF_SW_SIMULATED))
>  		hw_memory_failure = true;
>  
> +	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> +		res = memory_failure_pfn(pfn, flags);
> +		goto unlock_mutex;
> +	}

This might break exisiting corner case about DAX device, so maybe this should
be checked after confirming that pfn_to_online_page returns NULL.

> +
>  	p = pfn_to_online_page(pfn);
>  	if (!p) {
>  		res = arch_memory_failure(pfn, flags);
> @@ -2106,6 +2215,9 @@ int memory_failure(unsigned long pfn, int flags)
>  								 pgmap);
>  				goto unlock_mutex;
>  			}
> +
> +			res = memory_failure_pfn(pfn, flags);
> +			goto unlock_mutex;

This path is chosen when pfn_valid returns true, which cannot happen for GPU
memory's case?

Thanks,
Naoya Horiguchi

>  		}
>  		pr_err("%#lx: memory outside kernel control\n", pfn);
>  		res = -ENXIO;
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages
  2023-05-09  9:51   ` HORIGUCHI NAOYA(堀口 直也)
@ 2023-05-15 11:18     ` Ankit Agrawal
  2023-05-23  5:43       ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 31+ messages in thread
From: Ankit Agrawal @ 2023-05-15 11:18 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Jason Gunthorpe, alex.williamson, maz, oliver.upton,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
	Dan Williams, kvm, linux-kernel, linux-arm-kernel, linux-mm

Thanks, Naoya for reviewing the patch. Comments inline.

> -----Original Message-----
> From: HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com>
> Sent: Tuesday, May 9, 2023 2:51 AM
> To: Ankit Agrawal <ankita@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; alex.williamson@redhat.com;
> maz@kernel.org; oliver.upton@linux.dev; Aniket Agashe
> <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Vikram Sethi <vsethi@nvidia.com>; Andy Currid <acurrid@nvidia.com>;
> Alistair Popple <apopple@nvidia.com>; John Hubbard
> <jhubbard@nvidia.com>; Dan Williams <danw@nvidia.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; linux-mm@kvack.org
> Subject: Re: [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages
> 
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Apr 05, 2023 at 11:01:31AM -0700, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> >
> > The kernel MM does not currently handle ECC errors / poison on a
> > memory region that is not backed by struct pages. In this series,
> > mapping request from QEMU to the device memory is executed using
> remap_pfn_range().
> > Hence added a new mechanism to handle memory failure on such memory.
> >
> > Make kernel MM expose a function to allow modules managing the device
> > memory to register a failure function and the address space that is
> > associated with the device memory. MM maintains this information as
> > interval tree. The registered memory failure function is used by MM to
> > notify the module of the PFN, so that the module may take any required
> > action. The module for example may use the information to track the
> > poisoned pages.
> >
> > In this implementation, kernel MM follows the following sequence
> > (mostly) similar to the memory_failure() handler for struct page backed
> memory:
> > 1. memory_failure() is triggered on reception of a poison error. An
> > absence of struct page is detected and consequently memory_failure_pfn
> > is executed.
> > 2. memory_failure_pfn() call the newly introduced failure handler
> > exposed by the module managing the poisoned memory to notify it of the
> > problematic PFN.
> > 3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
> > 4. memory_failure_pfn() collects the processes mapped to the PFN.
> > 5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the
> > processes mapping the faulty PFN using kill_procs().
> > 6. An access to the faulty PFN by an operation in VM at a later point
> > of time is trapped and user_mem_abort() is called.
> > 7. user_mem_abort() calls __gfn_to_pfn_memslot() on the PFN, and the
> > following execution path is followed: __gfn_to_pfn_memslot() ->
> > hva_to_pfn() -> hva_to_pfn_remapped() -> fixup_user_fault() ->
> > handle_mm_fault() -> handle_pte_fault() -> do_fault(). do_fault() is
> > expected to return VM_FAULT_HWPOISON on the PFN (it currently does not
> > and is fixed as part of another patch in the series).
> > 8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON,
> which
> > cause the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU
> > process through kvm_send_hwpoison_signal().
> >
> > Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> > ---
> >  include/linux/memory-failure.h |  22 +++++
> >  include/linux/mm.h             |   1 +
> >  include/ras/ras_event.h        |   1 +
> >  mm/memory-failure.c            | 148 +++++++++++++++++++++++++++++----
> >  4 files changed, 154 insertions(+), 18 deletions(-)  create mode
> > 100644 include/linux/memory-failure.h
> >
> > diff --git a/include/linux/memory-failure.h
> > b/include/linux/memory-failure.h new file mode 100644 index
> > 000000000000..9a579960972a
> > --- /dev/null
> > +++ b/include/linux/memory-failure.h
> > @@ -0,0 +1,22 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */ #ifndef
> > +_LINUX_MEMORY_FAILURE_H #define _LINUX_MEMORY_FAILURE_H
> > +
> > +#include <linux/interval_tree.h>
> > +
> > +struct pfn_address_space;
> > +
> > +struct pfn_address_space_ops {
> > +     void (*failure)(struct pfn_address_space *pfn_space, unsigned
> > +long pfn); };
> > +
> > +struct pfn_address_space {
> > +     struct interval_tree_node node;
> > +     const struct pfn_address_space_ops *ops;
> > +     struct address_space *mapping;
> > +};
> > +
> > +int register_pfn_address_space(struct pfn_address_space *pfn_space);
> > +void unregister_pfn_address_space(struct pfn_address_space
> > +*pfn_space);
> > +
> > +#endif /* _LINUX_MEMORY_FAILURE_H */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h index
> > 1f79667824eb..e3ef52d3d45a 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3530,6 +3530,7 @@ enum mf_action_page_type {
> >       MF_MSG_BUDDY,
> >       MF_MSG_DAX,
> >       MF_MSG_UNSPLIT_THP,
> > +     MF_MSG_PFN,
> 
> Personally, the keyword "PFN" looks to me a little too generic, so I prefer
> "PFNMAP" or "PFN_MAP" because memory_failure() is anyway called with
> pfn regardless of being backed by struct page.
> 

Makes sense. Will change to PFNMAP.

> >       MF_MSG_UNKNOWN,
> >  };
> >
> > diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h index
> > cbd3ddd7c33d..5c62a4d17172 100644
> > --- a/include/ras/ras_event.h
> > +++ b/include/ras/ras_event.h
> > @@ -373,6 +373,7 @@ TRACE_EVENT(aer_event,
> >       EM ( MF_MSG_BUDDY, "free buddy page" )                          \
> >       EM ( MF_MSG_DAX, "dax page" )                                   \
> >       EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )                        \
> > +     EM ( MF_MSG_PFN, "non struct page pfn" )                                        \
> >       EMe ( MF_MSG_UNKNOWN, "unknown page" )
> >
> >  /*
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c index
> > fae9baf3be16..2c1a2ec42f7b 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -38,6 +38,7 @@
> >
> >  #include <linux/kernel.h>
> >  #include <linux/mm.h>
> > +#include <linux/memory-failure.h>
> >  #include <linux/page-flags.h>
> >  #include <linux/kernel-page-flags.h>
> >  #include <linux/sched/signal.h>
> > @@ -62,6 +63,7 @@
> >  #include <linux/page-isolation.h>
> >  #include <linux/pagewalk.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/pfn_t.h>
> >  #include "swap.h"
> >  #include "internal.h"
> >  #include "ras/ras_event.h"
> > @@ -122,6 +124,10 @@ const struct attribute_group
> memory_failure_attr_group = {
> >       .attrs = memory_failure_attr,
> >  };
> >
> > +static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
> > +
> > +static DEFINE_MUTEX(pfn_space_lock);
> > +
> >  /*
> >   * Return values:
> >   *   1:   the page is dissolved (if needed) and taken off from buddy,
> > @@ -399,15 +405,14 @@ static unsigned long
> dev_pagemap_mapping_shift(struct vm_area_struct *vma,
> >   * Schedule a process for later kill.
> >   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> >   *
> > - * Note: @fsdax_pgoff is used only when @p is a fsdax page and a
> > - * filesystem with a memory failure handler has claimed the
> > - * memory_failure event. In all other cases, page->index and
> > - * page->mapping are sufficient for mapping the page back to its
> > + * Notice: @pgoff is used either when @p is a fsdax page or a PFN is
> > + not
> > + * backed by struct page and a filesystem with a memory failure
> > + handler
> > + * has claimed the memory_failure event. In all other cases,
> > + page->index
> 
> You add a new case using @pgoff, and now we have 3 such cases, so could
> you update the comment to itemize them (which makes it easier to read and
> update)?
> 

Sure, will do in the next version.

> > + * and page->mapping are sufficient for mapping the page back to its
> >   * corresponding user virtual address.
> >   */
> > -static void add_to_kill(struct task_struct *tsk, struct page *p,
> > -                     pgoff_t fsdax_pgoff, struct vm_area_struct *vma,
> > -                     struct list_head *to_kill)
> > +static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,
> > +                     struct vm_area_struct *vma, struct list_head
> > +*to_kill)
> >  {
> >       struct to_kill *tk;
> >
> > @@ -417,13 +422,20 @@ static void add_to_kill(struct task_struct *tsk,
> struct page *p,
> >               return;
> >       }
> >
> > -     tk->addr = page_address_in_vma(p, vma);
> > -     if (is_zone_device_page(p)) {
> > -             if (fsdax_pgoff != FSDAX_INVALID_PGOFF)
> > -                     tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
> > +     if (vma->vm_flags | PFN_MAP) {
> > +             tk->addr =
> > +                     vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +             tk->size_shift = PAGE_SHIFT;
> > +     } else if (is_zone_device_page(p)) {
> > +             if (pgoff != FSDAX_INVALID_PGOFF)
> > +                     tk->addr = vma_pgoff_address(pgoff, 1, vma);
> > +             else
> > +                     tk->addr = page_address_in_vma(p, vma);
> >               tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> > -     } else
> > +     } else {
> > +             tk->addr = page_address_in_vma(p, vma);
> >               tk->size_shift = page_shift(compound_head(p));
> > +     }
> >
> >       /*
> >        * Send SIGKILL if "tk->addr == -EFAULT". Also, as @@ -617,13
> > +629,12 @@ static void collect_procs_file(struct page *page, struct list_head
> *to_kill,
> >       i_mmap_unlock_read(mapping);
> >  }
> >
> > -#ifdef CONFIG_FS_DAX
> 
> It seems that your new driver is built only in limited
> configuration/architecture, so loosening the condition instead of simply
> removing might be better.
> 
> >  /*
> >   * Collect processes when the error hit a fsdax page.
> >   */
> > -static void collect_procs_fsdax(struct page *page,
> > -             struct address_space *mapping, pgoff_t pgoff,
> > -             struct list_head *to_kill)
> > +static void collect_procs_pgoff(struct page *page,
> > +                             struct address_space *mapping, pgoff_t pgoff,
> > +                             struct list_head *to_kill)
> >  {
> >       struct vm_area_struct *vma;
> >       struct task_struct *tsk;
> > @@ -643,7 +654,6 @@ static void collect_procs_fsdax(struct page *page,
> >       read_unlock(&tasklist_lock);
> >       i_mmap_unlock_read(mapping);
> >  }
> > -#endif /* CONFIG_FS_DAX */
> >
> >  /*
> >   * Collect the processes who have the corrupted page mapped to kill.
> > @@ -835,6 +845,7 @@ static const char * const action_page_types[] = {
> >       [MF_MSG_BUDDY]                  = "free buddy page",
> >       [MF_MSG_DAX]                    = "dax page",
> >       [MF_MSG_UNSPLIT_THP]            = "unsplit thp",
> > +     [MF_MSG_PFN]                    = "non struct page pfn",
> >       [MF_MSG_UNKNOWN]                = "unknown page",
> >  };
> >
> > @@ -1745,7 +1756,7 @@ int mf_dax_kill_procs(struct address_space
> > *mapping, pgoff_t index,
> >
> >               SetPageHWPoison(page);
> >
> > -             collect_procs_fsdax(page, mapping, index, &to_kill);
> > +             collect_procs_pgoff(page, mapping, index, &to_kill);
> >               unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> >                               index, mf_flags);
> >  unlock:
> > @@ -2052,6 +2063,99 @@ static int
> memory_failure_dev_pagemap(unsigned long pfn, int flags,
> >       return rc;
> >  }
> >
> > +/**
> > + * register_pfn_address_space - Register PA region for poison notification.
> > + * @pfn_space: structure containing region range and callback function on
> > + *             poison detection.
> > + *
> > + * This function is called by a kernel module to register a PA region
> > +and
> > + * a callback function with the kernel. On detection of poison, the
> > + * kernel code will go through all registered regions and call the
> > + * appropriate callback function associated with the range. The
> > +kernel
> > + * module is responsible for tracking the poisoned pages.
> > + *
> > + * Return: 0 if successfully registered,
> > + *         -EBUSY if the region is already registered
> > + */
> > +int register_pfn_address_space(struct pfn_address_space *pfn_space) {
> > +     if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
> > +             (pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT,
> ""))
> > +             return -EBUSY;
> > +
> > +     mutex_lock(&pfn_space_lock);
> > +     interval_tree_insert(&pfn_space->node, &pfn_space_itree);
> > +     mutex_unlock(&pfn_space_lock);
> > +
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(register_pfn_address_space);
> 
> register_pfn_address_space and unregister_pfn_address_space are not
> compiled if CONFIG_MEMORY_FAILURE is not set, so maybe your new driver
> might need to depend on this config.
> 

Yeah, that's right. Shall I put parts of code in the vfio-pci variant driver
related to poison handling as part of #ifdef CONFIG_MEMORY_FAILURE.
Or would you rather prefer I make NVGPU_VFIO_PCI config dependent
on CONFIG_MEMORY_FAILURE?

> > +
> > +/**
> > + * unregister_pfn_address_space - Unregister a PA region from poison
> > + * notification.
> > + * @pfn_space: structure containing region range to be unregistered.
> > + *
> > + * This function is called by a kernel module to unregister the PA
> > +region
> > + * from the kernel from poison tracking.
> > + */
> > +void unregister_pfn_address_space(struct pfn_address_space
> > +*pfn_space) {
> > +     mutex_lock(&pfn_space_lock);
> > +     interval_tree_remove(&pfn_space->node, &pfn_space_itree);
> > +     mutex_unlock(&pfn_space_lock);
> > +     release_mem_region(pfn_space->node.start << PAGE_SHIFT,
> > +             (pfn_space->node.last - pfn_space->node.start + 1) <<
> > +PAGE_SHIFT); } EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
> > +
> > +static int memory_failure_pfn(unsigned long pfn, int flags) {
> > +     struct interval_tree_node *node;
> > +     int rc = -EBUSY;
> > +     LIST_HEAD(tokill);
> > +
> > +     mutex_lock(&pfn_space_lock);
> > +     /*
> > +      * Modules registers with MM the address space mapping to the device
> memory they
> > +      * manage. Iterate to identify exactly which address space has mapped to
> this
> > +      * failing PFN.
> > +      */
> > +     for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> > +          node = interval_tree_iter_next(node, pfn, pfn)) {
> > +             struct pfn_address_space *pfn_space =
> > +                     container_of(node, struct pfn_address_space, node);
> > +             rc = 0;
> > +
> > +             /*
> > +              * Modules managing the device memory needs to be conveyed
> about the
> > +              * memory failure so that the poisoned PFN can be tracked.
> > +              */
> > +             pfn_space->ops->failure(pfn_space, pfn);
> > +
> > +             collect_procs_pgoff(NULL, pfn_space->mapping, pfn,
> > + &tokill);
> > +
> > +             unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
> > +                                 PAGE_SIZE, 0);
> > +     }
> > +     mutex_unlock(&pfn_space_lock);
> 
> If rc == 0 at this point, the given pfn seems to be outside the GPU memory, so
> that should be considered as "Invalid address" case whose check is removed
> by patch 5/6.  So it might be better to sperate the case from "do handling for
> non struct page pfn" case.
> 

Sorry did you mean rc !=0 here? But yeah, you are right that I should add the
case for check in case a region with the desired PFN isn't found above.

> > +
> > +     /*
> > +      * Unlike System-RAM there is no possibility to swap in a different
> > +      * physical page at a given virtual address, so all userspace
> > +      * consumption of direct PFN memory necessitates SIGBUS (i.e.
> > +      * MF_MUST_KILL)
> > +      */
> > +     flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > +     kill_procs(&tokill, true, false, pfn, flags);
> > +
> > +     pr_err("%#lx: recovery action for %s: %s\n",
> > +                     pfn, action_page_types[MF_MSG_PFN],
> > +                     action_name[rc ? MF_FAILED : MF_RECOVERED]);
> 
> Could you call action_result() to print out the summary line.
> It has some other common things like accounting and tracepoint.
> 

Ack.

> > +
> > +     return rc;
> > +}
> > +
> >  static DEFINE_MUTEX(mf_mutex);
> >
> >  /**
> > @@ -2093,6 +2197,11 @@ int memory_failure(unsigned long pfn, int flags)
> >       if (!(flags & MF_SW_SIMULATED))
> >               hw_memory_failure = true;
> >
> > +     if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> > +             res = memory_failure_pfn(pfn, flags);
> > +             goto unlock_mutex;
> > +     }
> 
> This might break exisiting corner case about DAX device, so maybe this should
> be checked after confirming that pfn_to_online_page returns NULL.
> 

Sorry, it isn't clear to me which corner case could break here. Can the pfn_valid()
be false on a DAX  device page?

> > +
> >       p = pfn_to_online_page(pfn);
> >       if (!p) {
> >               res = arch_memory_failure(pfn, flags); @@ -2106,6
> > +2215,9 @@ int memory_failure(unsigned long pfn, int flags)
> >                                                                pgmap);
> >                               goto unlock_mutex;
> >                       }
> > +
> > +                     res = memory_failure_pfn(pfn, flags);
> > +                     goto unlock_mutex;
> 
> This path is chosen when pfn_valid returns true, which cannot happen for
> GPU memory's case?
> 

Good catch. That needs to be removed.

> Thanks,
> Naoya Horiguchi
> 
> >               }
> >               pr_err("%#lx: memory outside kernel control\n", pfn);
> >               res = -ENXIO;
> > --
> > 2.17.1

On a separate note, would you rather prefer that I separate out the poison
handling parts (i.e. 3/6 and 5/6) into a separate series? Or should I just keep the
whole series together?

Thanks
Ankit Agrawal



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages
  2023-05-15 11:18     ` Ankit Agrawal
@ 2023-05-23  5:43       ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 31+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2023-05-23  5:43 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, alex.williamson, maz, oliver.upton,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
	Dan Williams, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Mon, May 15, 2023 at 11:18:16AM +0000, Ankit Agrawal wrote:
> Thanks, Naoya for reviewing the patch. Comments inline.
> 
> > -----Original Message-----
> > From: HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com>
> > Sent: Tuesday, May 9, 2023 2:51 AM
> > To: Ankit Agrawal <ankita@nvidia.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>; alex.williamson@redhat.com;
> > maz@kernel.org; oliver.upton@linux.dev; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <acurrid@nvidia.com>;
> > Alistair Popple <apopple@nvidia.com>; John Hubbard
> > <jhubbard@nvidia.com>; Dan Williams <danw@nvidia.com>;
> > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; linux-arm-
> > kernel@lists.infradead.org; linux-mm@kvack.org
> > Subject: Re: [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Apr 05, 2023 at 11:01:31AM -0700, ankita@nvidia.com wrote:
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > >
> > > The kernel MM does not currently handle ECC errors / poison on a
> > > memory region that is not backed by struct pages. In this series,
> > > mapping request from QEMU to the device memory is executed using
> > remap_pfn_range().
> > > Hence added a new mechanism to handle memory failure on such memory.
> > >
> > > Make kernel MM expose a function to allow modules managing the device
> > > memory to register a failure function and the address space that is
> > > associated with the device memory. MM maintains this information as
> > > interval tree. The registered memory failure function is used by MM to
> > > notify the module of the PFN, so that the module may take any required
> > > action. The module for example may use the information to track the
> > > poisoned pages.
> > >
> > > In this implementation, kernel MM follows the following sequence
> > > (mostly) similar to the memory_failure() handler for struct page backed
> > memory:
> > > 1. memory_failure() is triggered on reception of a poison error. An
> > > absence of struct page is detected and consequently memory_failure_pfn
> > > is executed.
> > > 2. memory_failure_pfn() call the newly introduced failure handler
> > > exposed by the module managing the poisoned memory to notify it of the
> > > problematic PFN.
> > > 3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
> > > 4. memory_failure_pfn() collects the processes mapped to the PFN.
> > > 5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the
> > > processes mapping the faulty PFN using kill_procs().
> > > 6. An access to the faulty PFN by an operation in VM at a later point
> > > of time is trapped and user_mem_abort() is called.
> > > 7. user_mem_abort() calls __gfn_to_pfn_memslot() on the PFN, and the
> > > following execution path is followed: __gfn_to_pfn_memslot() ->
> > > hva_to_pfn() -> hva_to_pfn_remapped() -> fixup_user_fault() ->
> > > handle_mm_fault() -> handle_pte_fault() -> do_fault(). do_fault() is
> > > expected to return VM_FAULT_HWPOISON on the PFN (it currently does not
> > > and is fixed as part of another patch in the series).
> > > 8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON,
> > which
> > > cause the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU
> > > process through kvm_send_hwpoison_signal().
> > >
> > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> > > ---
> > >  include/linux/memory-failure.h |  22 +++++
> > >  include/linux/mm.h             |   1 +
> > >  include/ras/ras_event.h        |   1 +
> > >  mm/memory-failure.c            | 148 +++++++++++++++++++++++++++++----
> > >  4 files changed, 154 insertions(+), 18 deletions(-)  create mode
> > > 100644 include/linux/memory-failure.h
> > >
...

> > > @@ -2052,6 +2063,99 @@ static int
> > memory_failure_dev_pagemap(unsigned long pfn, int flags,
> > >       return rc;
> > >  }
> > >
> > > +/**
> > > + * register_pfn_address_space - Register PA region for poison notification.
> > > + * @pfn_space: structure containing region range and callback function on
> > > + *             poison detection.
> > > + *
> > > + * This function is called by a kernel module to register a PA region
> > > +and
> > > + * a callback function with the kernel. On detection of poison, the
> > > + * kernel code will go through all registered regions and call the
> > > + * appropriate callback function associated with the range. The
> > > +kernel
> > > + * module is responsible for tracking the poisoned pages.
> > > + *
> > > + * Return: 0 if successfully registered,
> > > + *         -EBUSY if the region is already registered
> > > + */
> > > +int register_pfn_address_space(struct pfn_address_space *pfn_space) {
> > > +     if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
> > > +             (pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT,
> > ""))
> > > +             return -EBUSY;
> > > +
> > > +     mutex_lock(&pfn_space_lock);
> > > +     interval_tree_insert(&pfn_space->node, &pfn_space_itree);
> > > +     mutex_unlock(&pfn_space_lock);
> > > +
> > > +     return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(register_pfn_address_space);
> > 
> > register_pfn_address_space and unregister_pfn_address_space are not
> > compiled if CONFIG_MEMORY_FAILURE is not set, so maybe your new driver
> > might need to depend on this config.
> > 
> 
> Yeah, that's right. Shall I put parts of code in the vfio-pci variant driver
> related to poison handling as part of #ifdef CONFIG_MEMORY_FAILURE.
> Or would you rather prefer I make NVGPU_VFIO_PCI config dependent
> on CONFIG_MEMORY_FAILURE?

If the vfio-pci variant driver plans to provide features unrelated to poison
handling, "ifdef" option looks fine to me.  Otherwise, the second option could
be possible.  Both options look to me comparably reasonable.

> 
> > > +
> > > +/**
> > > + * unregister_pfn_address_space - Unregister a PA region from poison
> > > + * notification.
> > > + * @pfn_space: structure containing region range to be unregistered.
> > > + *
> > > + * This function is called by a kernel module to unregister the PA
> > > +region
> > > + * from the kernel from poison tracking.
> > > + */
> > > +void unregister_pfn_address_space(struct pfn_address_space
> > > +*pfn_space) {
> > > +     mutex_lock(&pfn_space_lock);
> > > +     interval_tree_remove(&pfn_space->node, &pfn_space_itree);
> > > +     mutex_unlock(&pfn_space_lock);
> > > +     release_mem_region(pfn_space->node.start << PAGE_SHIFT,
> > > +             (pfn_space->node.last - pfn_space->node.start + 1) <<
> > > +PAGE_SHIFT); } EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
> > > +
> > > +static int memory_failure_pfn(unsigned long pfn, int flags) {
> > > +     struct interval_tree_node *node;
> > > +     int rc = -EBUSY;
> > > +     LIST_HEAD(tokill);
> > > +
> > > +     mutex_lock(&pfn_space_lock);
> > > +     /*
> > > +      * Modules registers with MM the address space mapping to the device
> > memory they
> > > +      * manage. Iterate to identify exactly which address space has mapped to
> > this
> > > +      * failing PFN.
> > > +      */
> > > +     for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> > > +          node = interval_tree_iter_next(node, pfn, pfn)) {
> > > +             struct pfn_address_space *pfn_space =
> > > +                     container_of(node, struct pfn_address_space, node);
> > > +             rc = 0;
> > > +
> > > +             /*
> > > +              * Modules managing the device memory needs to be conveyed
> > about the
> > > +              * memory failure so that the poisoned PFN can be tracked.
> > > +              */
> > > +             pfn_space->ops->failure(pfn_space, pfn);
> > > +
> > > +             collect_procs_pgoff(NULL, pfn_space->mapping, pfn,
> > > + &tokill);
> > > +
> > > +             unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
> > > +                                 PAGE_SIZE, 0);
> > > +     }
> > > +     mutex_unlock(&pfn_space_lock);
> > 
> > If rc == 0 at this point, the given pfn seems to be outside the GPU memory, so
> > that should be considered as "Invalid address" case whose check is removed
> > by patch 5/6.  So it might be better to sperate the case from "do handling for
> > non struct page pfn" case.
> > 
> 
> Sorry did you mean rc !=0 here? But yeah, you are right that I should add the
> case for check in case a region with the desired PFN isn't found above.
> 
> > > +
> > > +     /*
> > > +      * Unlike System-RAM there is no possibility to swap in a different
> > > +      * physical page at a given virtual address, so all userspace
> > > +      * consumption of direct PFN memory necessitates SIGBUS (i.e.
> > > +      * MF_MUST_KILL)
> > > +      */
> > > +     flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > +     kill_procs(&tokill, true, false, pfn, flags);
> > > +
> > > +     pr_err("%#lx: recovery action for %s: %s\n",
> > > +                     pfn, action_page_types[MF_MSG_PFN],
> > > +                     action_name[rc ? MF_FAILED : MF_RECOVERED]);
> > 
> > Could you call action_result() to print out the summary line.
> > It has some other common things like accounting and tracepoint.
> > 
> 
> Ack.
> 
> > > +
> > > +     return rc;
> > > +}
> > > +
> > >  static DEFINE_MUTEX(mf_mutex);
> > >
> > >  /**
> > > @@ -2093,6 +2197,11 @@ int memory_failure(unsigned long pfn, int flags)
> > >       if (!(flags & MF_SW_SIMULATED))
> > >               hw_memory_failure = true;
> > >
> > > +     if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> > > +             res = memory_failure_pfn(pfn, flags);
> > > +             goto unlock_mutex;
> > > +     }
> > 
> > This might break exisiting corner case about DAX device, so maybe this should
> > be checked after confirming that pfn_to_online_page returns NULL.
> > 
> 
> Sorry, it isn't clear to me which corner case could break here. Can the pfn_valid()
> be false on a DAX  device page?

I thought that possibility, but I commented with relying on my vague/wrong memory, sorry.
pfn_valid() should always true for DAX device pages, so the above check should not break.

> 
> > > +
> > >       p = pfn_to_online_page(pfn);
> > >       if (!p) {
> > >               res = arch_memory_failure(pfn, flags); @@ -2106,6
> > > +2215,9 @@ int memory_failure(unsigned long pfn, int flags)
> > >                                                                pgmap);
> > >                               goto unlock_mutex;
> > >                       }
> > > +
> > > +                     res = memory_failure_pfn(pfn, flags);
> > > +                     goto unlock_mutex;
> > 
> > This path is chosen when pfn_valid returns true, which cannot happen for
> > GPU memory's case?
> > 
> 
> Good catch. That needs to be removed.
> 
> > Thanks,
> > Naoya Horiguchi
> > 
> > >               }
> > >               pr_err("%#lx: memory outside kernel control\n", pfn);
> > >               res = -ENXIO;
> > > --
> > > 2.17.1
> 
> On a separate note, would you rather prefer that I separate out the poison
> handling parts (i.e. 3/6 and 5/6) into a separate series? Or should I just keep the
> whole series together?

The new poison handling code is used after the vfio-pci driver is available,
so I think that they may as well be merged together.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling
  2023-04-05 18:01 ` [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling ankita
  2023-04-05 20:24   ` Zhi Wang
  2023-04-05 21:50   ` kernel test robot
@ 2023-05-24  9:53   ` Dan Carpenter
  2 siblings, 0 replies; 31+ messages in thread
From: Dan Carpenter @ 2023-05-24  9:53 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, naoya.horiguchi, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kvm, linux-kernel, linux-arm-kernel, linux-mm

On Wed, Apr 05, 2023 at 11:01:34AM -0700, ankita@nvidia.com wrote:
> @@ -188,7 +277,20 @@ nvgpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
>  
>  	ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-size",
>  				       &(nvdev->mem_prop.mem_length));
> -	return ret;
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * A bitmap is maintained to teack the pages that are poisoned. Each
> +	 * page is represented by a bit. Allocation size in bytes is
> +	 * determined by shifting the device memory size by PAGE_SHIFT to
> +	 * determine the number of pages; and further shifted by 3 as each
> +	 * byte could track 8 pages.
> +	 */
> +	nvdev->mem_prop.pfn_bitmap
> +		= vzalloc(nvdev->mem_prop.mem_length >> (PAGE_SHIFT + 3));

This allocation needs a NULL check.

regards,
dan carpenter

> +
> +	return 0;
>  }



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-04-12 13:01     ` Jason Gunthorpe
@ 2023-05-31 11:35       ` Catalin Marinas
  2023-06-14 12:44         ` Jason Gunthorpe
  2023-07-14  8:10         ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 31+ messages in thread
From: Catalin Marinas @ 2023-05-31 11:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Marc Zyngier, ankita, alex.williamson, naoya.horiguchi,
	oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, apopple, jhubbard, danw, kvm, linux-kernel,
	linux-arm-kernel, linux-mm, Lorenzo Pieralisi

On Wed, Apr 12, 2023 at 10:01:33AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 12, 2023 at 01:43:26PM +0100, Marc Zyngier wrote:
> > What makes it safe? How does VFIO ensures that the memory type used is
> > correct in all circumstances? This has to hold for *ANY* device, not
> > just your favourite toy of the day. Nothing in this patch is horribly
> > wrong, but the above question must be answered before we can consider
> > any of this.
> 
> In VFIO we now have the concept of "variant drivers" which work with
> specific PCI IDs. The variant drivers can inject device specific
> knowledge into VFIO. In this series the driver injects the cachable
> pgprot when it creates some of the VMAs because it knows the PCI IDs
> it supports, parses the ACPI description, and knows for sure that the
> memory it puts in the cachable VMA is linked with a cache coherent
> interconnect.
> 
> The generic vfio-pci path is not changed, so 'any device' is not
> relevant here.

There were several off-list discussions, I'm trying to summarise my
understanding here. This series aims to relax the VFIO mapping to
cacheable and have KVM map it into the guest with the same attributes.
Somewhat related past threads also tried to relax the KVM device
pass-through mapping from Device_nGnRnE (pgprot_noncached) to Normal_NC
(pgprot_writecombine). Those were initially using the PCIe prefetchable
BAR attribute but that's not a reliable means to infer whether Normal vs
Device is safe. Anyway, I think we'd need to unify these threads and
come up with some common handling that can cater for various attributes
required by devices/drivers. Therefore replying in this thread.

Current status on arm64: vfio-pci user mapping (Qemu) of PCIe BARs is
Device_nGnRnE. KVM also maps it at Stage 2 with the same attributes. The
user mapping is fine since the VMM may not have detailed knowledge about
how to safely map the BAR. KVM doesn't have such detailed knowledge
either in the device pass-through case but for safety reasons it follows
the same attributes as the user mapping.

From a safety perspective, relaxing the KVM stage 2 mapping has some
implications:

1. It creates multiple memory aliases with different attributes.

2. Speculative loads, unaligned accesses.

3. Potentially new uncontained errors introduced by Normal NC vs Device
   mappings.

From the private discussions I had regarding point (3), Device doesn't
really make any difference and it can be even worse. For example, Normal
mappings would allow a contained error while Device would trigger an
uncontained SError. So the only safe bet here is for the platform, root
complex to behave properly when device pass-through is involved (e.g.
PCIe to ignore writes, return 0xff on reads if there are errors). That's
something Arm is working on to require in the BSA specs (base system
architecture). Presumably cloud vendors allowing device pass-through
already know their system well enough, no need for further policing in
KVM (which it can't do properly anyway).

As long as the errors are contained, point (2) becomes the
responsibility of the guest, given its detailed knowledge of the device,
using the appropriate attributes (whether x86 WC maps exactly onto arm64
Normal NC is the subject of a separate discussion; so far that's the
assumption we took in the arm64 kernel). Regarding vfio-pci, the user
mapping must remain Device_nGnRnE on arm64 to avoid unexpected
speculative loads.

This leaves us with (1) and since the vfio-pci user mapping must remain
Device, there's a potential mismatched alias if the guest maps the
corresponding BAR as Normal NC. Luckily the Arm ARM constrains the
hardware behaviour here to basically allowing the Device mapping in the
VMM to behave somewhat the Normal NC mapping in the guest. IOW, the VMM
loses the Device guarantees (non-speculation, ordering). That's not a
problem for the device since the guest already deemed such mapping to be
safe.

While I think it's less likely for the VMM to access the same BAR that
was mapped into the guest, if we want to be on the safe side from an ABI
perspective (the returned mmap() now has different memory guarantees),
we could make this an opt-in. That's pretty much the VMM stating that it
is ok with losing some of the Device guarantees for the vfio-pci
mapping. A question here is whether we want to this to be done at the
vfio-pci level, the KVM memory slot or a big knob per VMM. Or whether we
need to bother with this at all (if it's just theoretical, maybe we can
have a global opt-out).

Going back to this series, allowing Cacheable mapping in KVM has similar
implications as above. (2) and (3) would be assumed safe by the platform
vendors. Regarding (1), to avoid confusion, I would only allow it if FWB
(force write-back) is present so that KVM can enforce a cacheable
mapping at Stage 2 if the vfio variant driver also maps it as cacheable
in user space.

There were some other thoughts on only allowing different attributes in
KVM if the user counterpart does not have any mapping (e.g. fd-based
KVM memory slots). However, this won't work well with CXL-attached
memory for example where the VMM may want access to it (e.g. virtio). So
I wouldn't spend more thoughts on this.


The TL;DR summary of what I think we should do:

a) Relax the KVM Stage 2 mapping for vfio-pci to Normal NC
   (pgprot_writecombine). TBD whether we need a VMM opt-in is at the
   vfio-pci level, KVM slot or per-VMM level. Another option is to do
   this by default and have a global opt-out (cmdline). I don't think
   the latter is such a bad idea since I find it unlikely for the VMM to
   also touch the same PCIe BAR _and_ expect different memory ordering
   guarantees than the guest device driver.

b) Map the KVM stage 2 mapping as cacheable+FWB iff the VMM has a
   corresponding cacheable mapping.

Thoughts?

-- 
Catalin


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-05-31 11:35       ` Catalin Marinas
@ 2023-06-14 12:44         ` Jason Gunthorpe
  2023-07-14  8:10         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2023-06-14 12:44 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, ankita, alex.williamson, naoya.horiguchi,
	oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, apopple, jhubbard, danw, kvm, linux-kernel,
	linux-arm-kernel, linux-mm, Lorenzo Pieralisi

On Wed, May 31, 2023 at 12:35:57PM +0100, Catalin Marinas wrote:

> Current status on arm64: vfio-pci user mapping (Qemu) of PCIe BARs is
> Device_nGnRnE. KVM also maps it at Stage 2 with the same attributes. The
> user mapping is fine since the VMM may not have detailed knowledge about
> how to safely map the BAR. KVM doesn't have such detailed knowledge
> either in the device pass-through case but for safety reasons it follows
> the same attributes as the user mapping.

To put a fine point on this - it means the KVM VM does not emulate the
same behaviors as a bare metal ARM system and this results in
significant performance degradation (loss of WC) or malfunction (loss
of cachable) when working with some VFIO device scenarios.

> architecture). Presumably cloud vendors allowing device pass-through
> already know their system well enough, no need for further policing in
> KVM (which it can't do properly anyway).

This also matches x86 pretty well. There is alot of platform and BIOS
magic at work to make sure that VFIO is safe in general. As this is
not discoverable by the OS we have no choice in the kernel but to
assume the operator knows what they are doing.

> While I think it's less likely for the VMM to access the same BAR that
> was mapped into the guest, if we want to be on the safe side from an ABI
> perspective (the returned mmap() now has different memory guarantees),
> we could make this an opt-in. That's pretty much the VMM stating that it
> is ok with losing some of the Device guarantees for the vfio-pci
> mapping. 

I don't know of any use case where this matters. The VMM is already
basically not allowed to touch the MMIO spaces for security
reasons. The only reason it is mmap'd into the VMM at all is because
this is the only way to get it into KVM.

So an opt-in seems like overkill to me.

> Going back to this series, allowing Cacheable mapping in KVM has similar
> implications as above. (2) and (3) would be assumed safe by the platform
> vendors. Regarding (1), to avoid confusion, I would only allow it if FWB
> (force write-back) is present so that KVM can enforce a cacheable
> mapping at Stage 2 if the vfio variant driver also maps it as cacheable
> in user space.

Yes, currently if FWB is not available this patch makes KVM crash :\
FWB needs to be enforced as well in this patch.

> There were some other thoughts on only allowing different attributes in
> KVM if the user counterpart does not have any mapping (e.g. fd-based
> KVM memory slots). However, this won't work well with CXL-attached
> memory for example where the VMM may want access to it (e.g. virtio). So
> I wouldn't spend more thoughts on this.

CXL is a different beast - struct page backed CXL memory should be
force-cachable everywhere just like normal system memory (it is normal
system memory). This is the kind of memory that might be mixed with
virtio.

However, some future CXL VFIO device memory should ideally allow the
VM full flexibility to use cachable or not to properly emulate
bare-metal. This approach where we pre-select some of the CXL memory
as cachable is kind of a half job..

I think providing a way for KVM to take in a FD instead of a VMA for
mapping memory could be interesting in general. We waste a fair amount
of PTEs creating a 40GB linear map in the CPU just for KVM. Avoiding
the alias mapping is a nice side effect.

We'd have to do something else for the SIGBUS stuff though..

> a) Relax the KVM Stage 2 mapping for vfio-pci to Normal NC
>    (pgprot_writecombine). TBD whether we need a VMM opt-in is at the

We are working on a tested patch for this, it isn't working yet for
some reason, still debugging.

> b) Map the KVM stage 2 mapping as cacheable+FWB iff the VMM has a
>    corresponding cacheable mapping.

That is this patch, with some improvements to check FWB/etc.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-05-31 11:35       ` Catalin Marinas
  2023-06-14 12:44         ` Jason Gunthorpe
@ 2023-07-14  8:10         ` Benjamin Herrenschmidt
  2023-07-16 15:09           ` Catalin Marinas
  1 sibling, 1 reply; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2023-07-14  8:10 UTC (permalink / raw)
  To: Catalin Marinas, Jason Gunthorpe
  Cc: Marc Zyngier, ankita, alex.williamson, naoya.horiguchi,
	oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, apopple, jhubbard, danw, kvm, linux-kernel,
	linux-arm-kernel, linux-mm, Lorenzo Pieralisi, Clint Sbisa,
	osamaabb

On Wed, 2023-05-31 at 12:35 +0100, Catalin Marinas wrote:
> There were several off-list discussions, I'm trying to summarise my
> understanding here. This series aims to relax the VFIO mapping to
> cacheable and have KVM map it into the guest with the same attributes.
> Somewhat related past threads also tried to relax the KVM device
> pass-through mapping from Device_nGnRnE (pgprot_noncached) to Normal_NC
> (pgprot_writecombine). Those were initially using the PCIe prefetchable
> BAR attribute but that's not a reliable means to infer whether Normal vs
> Device is safe. Anyway, I think we'd need to unify these threads and
> come up with some common handling that can cater for various attributes
> required by devices/drivers. Therefore replying in this thread.

So picking up on this as I was just trying to start a separate
discussion
on the subject for write combine :-)

In this case, not so much for KVM as much as for VFIO to userspace
though.

The rough idea is that the "userspace driver" (ie DPDK or equivalent)
for the device is the one to "know" wether a BAR or portion of a BAR
can/should be mapped write-combine, and is expected to also "know"
what to do to enforce ordering when necessary.

So the userspace component needs to be responsible for selecting the
mapping, the same way using the PCI sysfs resource files today allows
to do that by selecting the _wc variant.

I posted a separate message that Lorenzo CCed back to some of you, but
let's recap here to keep the discussion localized.

I don't know how much of this makes sense for KVM, but I think what we
really want is for userspace to be able to specify some "attributes"
(which we can initially limit to writecombine, full cachability
probably requires a device specific kernel driver providing adequate
authority, separate discussion in any case), for all or a portion of a
BAR mapping.

The easy way is an ioctl to affect the attributes of the next mmap but
it's a rather gross interface.

A better approach (still requires some coordination but not nearly as
bad) would be to have an ioctl to create "subregions", ie, dynamically
add new "struct vfio_pci_region" (using the existing dynamic index
API), which are children of existing regions (including real BARs) and
provide different attributes, which mmap can then honor.

This is particularly suited for the case (which used to exist, I don't
know if it still does) where the buffer that wants write combining
reside in the same BAR as registers that otherwise don't.

A simpler compromise if that latter case is deemed irrelevant would be
an ioctl to selectively set a region index (including BARs) to be WC
prior to mmap.

I don't know if that fits in the ideas you have for KVM, I think it
could by having the userspace component require mappings using a
"special" attribute which we could define as being the most relaxed
allowed to pass to a VM, which can then be architecture defined. The
guest can then enforce specifics. Does this make sense ?

Cheers
Ben.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-07-14  8:10         ` Benjamin Herrenschmidt
@ 2023-07-16 15:09           ` Catalin Marinas
  2023-07-16 22:30             ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: Catalin Marinas @ 2023-07-16 15:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jason Gunthorpe, Marc Zyngier, ankita, alex.williamson,
	naoya.horiguchi, oliver.upton, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw, kvm,
	linux-kernel, linux-arm-kernel, linux-mm, Lorenzo Pieralisi,
	Clint Sbisa, osamaabb

Hi Ben,

On Fri, Jul 14, 2023 at 06:10:39PM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2023-05-31 at 12:35 +0100, Catalin Marinas wrote:
> > There were several off-list discussions, I'm trying to summarise my
> > understanding here. This series aims to relax the VFIO mapping to
> > cacheable and have KVM map it into the guest with the same attributes.
> > Somewhat related past threads also tried to relax the KVM device
> > pass-through mapping from Device_nGnRnE (pgprot_noncached) to Normal_NC
> > (pgprot_writecombine). Those were initially using the PCIe prefetchable
> > BAR attribute but that's not a reliable means to infer whether Normal vs
> > Device is safe. Anyway, I think we'd need to unify these threads and
> > come up with some common handling that can cater for various attributes
> > required by devices/drivers. Therefore replying in this thread.
> 
> So picking up on this as I was just trying to start a separate
> discussion on the subject for write combine :-)

Basically this thread started as a fix/improvement for KVM by mimicking
the VFIO user mapping attributes at the guest but the conclusion we came
to is that the VFIO PCIe driver cannot reliably tell when WC is
possible.

> In this case, not so much for KVM as much as for VFIO to userspace
> though.
> 
> The rough idea is that the "userspace driver" (ie DPDK or equivalent)
> for the device is the one to "know" wether a BAR or portion of a BAR
> can/should be mapped write-combine, and is expected to also "know"
> what to do to enforce ordering when necessary.

I agree in principle. On the KVM side we concluded that it's the guest
driver that knows the attributes, so the hypervisor should not restrict
them. In the DPDK case, it would be the user driver that knows the
device it is mapping and the required attributes.

In terms of security for arm64 at least, Device vs Normal NC (or nc vs
wc in Linux terminology) doesn't make much difference with the former
occasionally being worse. The kernel would probably trust the DPDK code
if it allows direct device access.

> So the userspace component needs to be responsible for selecting the
> mapping, the same way using the PCI sysfs resource files today allows
> to do that by selecting the _wc variant.

I guess the sysfs interface is just trying to work around the VFIO
limitations.

> I don't know how much of this makes sense for KVM, but I think what we
> really want is for userspace to be able to specify some "attributes"
> (which we can initially limit to writecombine, full cachability
> probably requires a device specific kernel driver providing adequate
> authority, separate discussion in any case), for all or a portion of a
> BAR mapping.

For KVM, at least the WC case, user-space doesn't need to be involved as
it normally should not access the same BAR concurrently with the guest.
But at some point, for CXL-attached memory for example, it may need to
be able to map it as cacheable so that it has the same attributes as the
guest.

> The easy way is an ioctl to affect the attributes of the next mmap but
> it's a rather gross interface.
> 
> A better approach (still requires some coordination but not nearly as
> bad) would be to have an ioctl to create "subregions", ie, dynamically
> add new "struct vfio_pci_region" (using the existing dynamic index
> API), which are children of existing regions (including real BARs) and
> provide different attributes, which mmap can then honor.
> 
> This is particularly suited for the case (which used to exist, I don't
> know if it still does) where the buffer that wants write combining
> reside in the same BAR as registers that otherwise don't.

IIUC that's still the case for some devices (I think Jason mentioned
some Mellanox cards).

> A simpler compromise if that latter case is deemed irrelevant would be
> an ioctl to selectively set a region index (including BARs) to be WC
> prior to mmap.
> 
> I don't know if that fits in the ideas you have for KVM, I think it
> could by having the userspace component require mappings using a
> "special" attribute which we could define as being the most relaxed
> allowed to pass to a VM, which can then be architecture defined. The
> guest can then enforce specifics. Does this make sense ?

I think this interface would help KVM when we'll need a cacheable
mapping. For WC, we are ok without any VFIO changes.

-- 
Catalin


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-07-16 15:09           ` Catalin Marinas
@ 2023-07-16 22:30             ` Jason Gunthorpe
  2023-07-17 18:35               ` Alex Williamson
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2023-07-16 22:30 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Benjamin Herrenschmidt, Marc Zyngier, ankita, alex.williamson,
	naoya.horiguchi, oliver.upton, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw, kvm,
	linux-kernel, linux-arm-kernel, linux-mm, Lorenzo Pieralisi,
	Clint Sbisa, osamaabb

On Sun, Jul 16, 2023 at 08:09:02AM -0700, Catalin Marinas wrote:

> In terms of security for arm64 at least, Device vs Normal NC (or nc vs
> wc in Linux terminology) doesn't make much difference with the former
> occasionally being worse. The kernel would probably trust the DPDK code
> if it allows direct device access.

RDMA and DRM already allow device drivers to map WC to userspace on
demand, we expect the platform to support this.

> > So the userspace component needs to be responsible for selecting the
> > mapping, the same way using the PCI sysfs resource files today allows
> > to do that by selecting the _wc variant.
> 
> I guess the sysfs interface is just trying to work around the VFIO
> limitations.

I think just nobody has ever asked for VFIO WC support. The main
non-VM user is DPDK and none of the NIC drivers have wanted this (DPDK
applications areis more of throughput than latency focused typically)

> > This is particularly suited for the case (which used to exist, I don't
> > know if it still does) where the buffer that wants write combining
> > reside in the same BAR as registers that otherwise don't.
> 
> IIUC that's still the case for some devices (I think Jason mentioned
> some Mellanox cards).

Right, VFIO will have to allow it page-by-page

> I think this interface would help KVM when we'll need a cacheable
> mapping. For WC, we are ok without any VFIO changes.

Yes, it may be interesting to map cachable CXL memory as NORMAL_NC
into userspace for similar reasons.

Jason


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-07-16 22:30             ` Jason Gunthorpe
@ 2023-07-17 18:35               ` Alex Williamson
  2023-07-25  6:18                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Williamson @ 2023-07-17 18:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Catalin Marinas, Benjamin Herrenschmidt, Marc Zyngier, ankita,
	naoya.horiguchi, oliver.upton, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw, kvm,
	linux-kernel, linux-arm-kernel, linux-mm, Lorenzo Pieralisi,
	Clint Sbisa, osamaabb

On Sun, 16 Jul 2023 19:30:23 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Sun, Jul 16, 2023 at 08:09:02AM -0700, Catalin Marinas wrote:
> 
> > In terms of security for arm64 at least, Device vs Normal NC (or nc vs
> > wc in Linux terminology) doesn't make much difference with the former
> > occasionally being worse. The kernel would probably trust the DPDK code
> > if it allows direct device access.  
> 
> RDMA and DRM already allow device drivers to map WC to userspace on
> demand, we expect the platform to support this.
> 
> > > So the userspace component needs to be responsible for selecting the
> > > mapping, the same way using the PCI sysfs resource files today allows
> > > to do that by selecting the _wc variant.  
> > 
> > I guess the sysfs interface is just trying to work around the VFIO
> > limitations.  
> 
> I think just nobody has ever asked for VFIO WC support. The main
> non-VM user is DPDK and none of the NIC drivers have wanted this (DPDK
> applications areis more of throughput than latency focused typically)

Yes, QEMU can't know whether the device or driver want a WC BAR
mapping, so we've left it for KVM manipulation relative to VM use
cases.  Nobody has followed through with a complete proposal to enable
it otherwise for direct userspace driver access, but I don't think
there's opposition to providing such a thing.  Thanks,

Alex

> > > This is particularly suited for the case (which used to exist, I don't
> > > know if it still does) where the buffer that wants write combining
> > > reside in the same BAR as registers that otherwise don't.  
> > 
> > IIUC that's still the case for some devices (I think Jason mentioned
> > some Mellanox cards).  
> 
> Right, VFIO will have to allow it page-by-page
> 
> > I think this interface would help KVM when we'll need a cacheable
> > mapping. For WC, we are ok without any VFIO changes.  
> 
> Yes, it may be interesting to map cachable CXL memory as NORMAL_NC
> into userspace for similar reasons.
> 
> Jason
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/6] kvm: determine memory type from VMA
  2023-07-17 18:35               ` Alex Williamson
@ 2023-07-25  6:18                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2023-07-25  6:18 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Catalin Marinas, Marc Zyngier, ankita, naoya.horiguchi,
	oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, apopple, jhubbard, danw, kvm, linux-kernel,
	linux-arm-kernel, linux-mm, Lorenzo Pieralisi, Clint Sbisa,
	osamaabb

On Mon, 2023-07-17 at 12:35 -0600, Alex Williamson wrote:
> On Sun, 16 Jul 2023 19:30:23 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Sun, Jul 16, 2023 at 08:09:02AM -0700, Catalin Marinas wrote:
> > 
> > > In terms of security for arm64 at least, Device vs Normal NC (or nc vs
> > > wc in Linux terminology) doesn't make much difference with the former
> > > occasionally being worse. The kernel would probably trust the DPDK code
> > > if it allows direct device access.  
> > 
> > RDMA and DRM already allow device drivers to map WC to userspace on
> > demand, we expect the platform to support this.
> > 
> > > > So the userspace component needs to be responsible for selecting the
> > > > mapping, the same way using the PCI sysfs resource files today allows
> > > > to do that by selecting the _wc variant.  
> > > 
> > > I guess the sysfs interface is just trying to work around the VFIO
> > > limitations.  
> > 
> > I think just nobody has ever asked for VFIO WC support. The main
> > non-VM user is DPDK and none of the NIC drivers have wanted this (DPDK
> > applications areis more of throughput than latency focused typically)
> 
> Yes, QEMU can't know whether the device or driver want a WC BAR
> mapping, so we've left it for KVM manipulation relative to VM use
> cases.  Nobody has followed through with a complete proposal to enable
> it otherwise for direct userspace driver access, but I don't think
> there's opposition to providing such a thing.  Thanks,

Ok, this is really backburner work for me but I'll try to cook up a POC
patch in the near (hopefully) future along the lines of the subregions
I proposed and we can discuss from there.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-07-25  6:19 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
2023-04-05 18:01 ` [PATCH v3 1/6] kvm: determine memory type from VMA ankita
2023-04-12 12:43   ` Marc Zyngier
2023-04-12 13:01     ` Jason Gunthorpe
2023-05-31 11:35       ` Catalin Marinas
2023-06-14 12:44         ` Jason Gunthorpe
2023-07-14  8:10         ` Benjamin Herrenschmidt
2023-07-16 15:09           ` Catalin Marinas
2023-07-16 22:30             ` Jason Gunthorpe
2023-07-17 18:35               ` Alex Williamson
2023-07-25  6:18                 ` Benjamin Herrenschmidt
2023-04-05 18:01 ` [PATCH v3 2/6] vfio/nvgpu: expose GPU device memory as BAR1 ankita
2023-04-05 21:07   ` kernel test robot
2023-04-05 18:01 ` [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages ankita
2023-04-05 21:07   ` kernel test robot
2023-05-09  9:51   ` HORIGUCHI NAOYA(堀口 直也)
2023-05-15 11:18     ` Ankit Agrawal
2023-05-23  5:43       ` HORIGUCHI NAOYA(堀口 直也)
2023-04-05 18:01 ` [PATCH v3 4/6] mm: Add poison error check in fixup_user_fault() for mapped PFN ankita
2023-04-05 18:01 ` [PATCH v3 5/6] mm: Change ghes code to allow poison of non-struct PFN ankita
2023-04-05 18:01 ` [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling ankita
2023-04-05 20:24   ` Zhi Wang
2023-04-05 21:50   ` kernel test robot
2023-05-24  9:53   ` Dan Carpenter
2023-04-06 12:07 ` [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible David Hildenbrand
2023-04-12  8:43   ` Ankit Agrawal
2023-04-12  9:48     ` Marc Zyngier
2023-04-12 12:28 ` Marc Zyngier
2023-04-12 12:53   ` Jason Gunthorpe
2023-04-13  9:52     ` Marc Zyngier
2023-04-13 13:19       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).