All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
@ 2023-01-25 22:40 Peter Xu
  2023-01-25 22:40 ` [PATCH 1/3] linux-headers: Update to v6.1 Peter Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Peter Xu @ 2023-01-25 22:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, peterx, Dr . David Alan Gilbert,
	Juan Quintela

The new /dev/userfaultfd handle is superior to the system call with a
better permission control and also works for a restricted seccomp
environment.

The new device was only introduced in v6.1 so we need a header update.

Please have a look, thanks.

Peter Xu (3):
  linux-headers: Update to v6.1
  util/userfaultfd: Add uffd_open()
  util/userfaultfd: Support /dev/userfaultfd

 include/qemu/userfaultfd.h                    |   1 +
 include/standard-headers/drm/drm_fourcc.h     |  34 ++++-
 include/standard-headers/linux/ethtool.h      |  63 +++++++-
 include/standard-headers/linux/fuse.h         |   6 +-
 .../linux/input-event-codes.h                 |   1 +
 include/standard-headers/linux/virtio_blk.h   |  19 +++
 linux-headers/asm-generic/hugetlb_encode.h    |  26 ++--
 linux-headers/asm-generic/mman-common.h       |   2 +
 linux-headers/asm-mips/mman.h                 |   2 +
 linux-headers/asm-riscv/kvm.h                 |   4 +
 linux-headers/linux/kvm.h                     |   1 +
 linux-headers/linux/psci.h                    |  14 ++
 linux-headers/linux/userfaultfd.h             |   4 +
 linux-headers/linux/vfio.h                    | 142 ++++++++++++++++++
 migration/postcopy-ram.c                      |  11 +-
 tests/qtest/migration-test.c                  |   3 +-
 util/trace-events                             |   1 +
 util/userfaultfd.c                            |  49 +++++-
 18 files changed, 354 insertions(+), 29 deletions(-)

-- 
2.37.3



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] linux-headers: Update to v6.1
  2023-01-25 22:40 [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
@ 2023-01-25 22:40 ` Peter Xu
  2023-01-25 22:40 ` [PATCH 2/3] util/userfaultfd: Add uffd_open() Peter Xu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Peter Xu @ 2023-01-25 22:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, peterx, Dr . David Alan Gilbert,
	Juan Quintela

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/standard-headers/drm/drm_fourcc.h     |  34 ++++-
 include/standard-headers/linux/ethtool.h      |  63 +++++++-
 include/standard-headers/linux/fuse.h         |   6 +-
 .../linux/input-event-codes.h                 |   1 +
 include/standard-headers/linux/virtio_blk.h   |  19 +++
 linux-headers/asm-generic/hugetlb_encode.h    |  26 ++--
 linux-headers/asm-generic/mman-common.h       |   2 +
 linux-headers/asm-mips/mman.h                 |   2 +
 linux-headers/asm-riscv/kvm.h                 |   4 +
 linux-headers/linux/kvm.h                     |   1 +
 linux-headers/linux/psci.h                    |  14 ++
 linux-headers/linux/userfaultfd.h             |   4 +
 linux-headers/linux/vfio.h                    | 142 ++++++++++++++++++
 13 files changed, 298 insertions(+), 20 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h b/include/standard-headers/drm/drm_fourcc.h
index 48b620cbef..b868488f93 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -98,18 +98,42 @@ extern "C" {
 #define DRM_FORMAT_INVALID	0
 
 /* color index */
+#define DRM_FORMAT_C1		fourcc_code('C', '1', ' ', ' ') /* [7:0] C0:C1:C2:C3:C4:C5:C6:C7 1:1:1:1:1:1:1:1 eight pixels/byte */
+#define DRM_FORMAT_C2		fourcc_code('C', '2', ' ', ' ') /* [7:0] C0:C1:C2:C3 2:2:2:2 four pixels/byte */
+#define DRM_FORMAT_C4		fourcc_code('C', '4', ' ', ' ') /* [7:0] C0:C1 4:4 two pixels/byte */
 #define DRM_FORMAT_C8		fourcc_code('C', '8', ' ', ' ') /* [7:0] C */
 
-/* 8 bpp Red */
+/* 1 bpp Darkness (inverse relationship between channel value and brightness) */
+#define DRM_FORMAT_D1		fourcc_code('D', '1', ' ', ' ') /* [7:0] D0:D1:D2:D3:D4:D5:D6:D7 1:1:1:1:1:1:1:1 eight pixels/byte */
+
+/* 2 bpp Darkness (inverse relationship between channel value and brightness) */
+#define DRM_FORMAT_D2		fourcc_code('D', '2', ' ', ' ') /* [7:0] D0:D1:D2:D3 2:2:2:2 four pixels/byte */
+
+/* 4 bpp Darkness (inverse relationship between channel value and brightness) */
+#define DRM_FORMAT_D4		fourcc_code('D', '4', ' ', ' ') /* [7:0] D0:D1 4:4 two pixels/byte */
+
+/* 8 bpp Darkness (inverse relationship between channel value and brightness) */
+#define DRM_FORMAT_D8		fourcc_code('D', '8', ' ', ' ') /* [7:0] D */
+
+/* 1 bpp Red (direct relationship between channel value and brightness) */
+#define DRM_FORMAT_R1		fourcc_code('R', '1', ' ', ' ') /* [7:0] R0:R1:R2:R3:R4:R5:R6:R7 1:1:1:1:1:1:1:1 eight pixels/byte */
+
+/* 2 bpp Red (direct relationship between channel value and brightness) */
+#define DRM_FORMAT_R2		fourcc_code('R', '2', ' ', ' ') /* [7:0] R0:R1:R2:R3 2:2:2:2 four pixels/byte */
+
+/* 4 bpp Red (direct relationship between channel value and brightness) */
+#define DRM_FORMAT_R4		fourcc_code('R', '4', ' ', ' ') /* [7:0] R0:R1 4:4 two pixels/byte */
+
+/* 8 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R8		fourcc_code('R', '8', ' ', ' ') /* [7:0] R */
 
-/* 10 bpp Red */
+/* 10 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R10		fourcc_code('R', '1', '0', ' ') /* [15:0] x:R 6:10 little endian */
 
-/* 12 bpp Red */
+/* 12 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R12		fourcc_code('R', '1', '2', ' ') /* [15:0] x:R 4:12 little endian */
 
-/* 16 bpp Red */
+/* 16 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R16		fourcc_code('R', '1', '6', ' ') /* [15:0] R little endian */
 
 /* 16 bpp RG */
@@ -204,7 +228,9 @@ extern "C" {
 #define DRM_FORMAT_VYUY		fourcc_code('V', 'Y', 'U', 'Y') /* [31:0] Y1:Cb0:Y0:Cr0 8:8:8:8 little endian */
 
 #define DRM_FORMAT_AYUV		fourcc_code('A', 'Y', 'U', 'V') /* [31:0] A:Y:Cb:Cr 8:8:8:8 little endian */
+#define DRM_FORMAT_AVUY8888	fourcc_code('A', 'V', 'U', 'Y') /* [31:0] A:Cr:Cb:Y 8:8:8:8 little endian */
 #define DRM_FORMAT_XYUV8888	fourcc_code('X', 'Y', 'U', 'V') /* [31:0] X:Y:Cb:Cr 8:8:8:8 little endian */
+#define DRM_FORMAT_XVUY8888	fourcc_code('X', 'V', 'U', 'Y') /* [31:0] X:Cr:Cb:Y 8:8:8:8 little endian */
 #define DRM_FORMAT_VUY888	fourcc_code('V', 'U', '2', '4') /* [23:0] Cr:Cb:Y 8:8:8 little endian */
 #define DRM_FORMAT_VUY101010	fourcc_code('V', 'U', '3', '0') /* Y followed by U then V, 10:10:10. Non-linear modifier only */
 
diff --git a/include/standard-headers/linux/ethtool.h b/include/standard-headers/linux/ethtool.h
index 4537da20cc..1dc56cdc0a 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -736,6 +736,51 @@ enum ethtool_module_power_mode {
 	ETHTOOL_MODULE_POWER_MODE_HIGH,
 };
 
+/**
+ * enum ethtool_podl_pse_admin_state - operational state of the PoDL PSE
+ *	functions. IEEE 802.3-2018 30.15.1.1.2 aPoDLPSEAdminState
+ * @ETHTOOL_PODL_PSE_ADMIN_STATE_UNKNOWN: state of PoDL PSE functions are
+ * 	unknown
+ * @ETHTOOL_PODL_PSE_ADMIN_STATE_DISABLED: PoDL PSE functions are disabled
+ * @ETHTOOL_PODL_PSE_ADMIN_STATE_ENABLED: PoDL PSE functions are enabled
+ */
+enum ethtool_podl_pse_admin_state {
+	ETHTOOL_PODL_PSE_ADMIN_STATE_UNKNOWN = 1,
+	ETHTOOL_PODL_PSE_ADMIN_STATE_DISABLED,
+	ETHTOOL_PODL_PSE_ADMIN_STATE_ENABLED,
+};
+
+/**
+ * enum ethtool_podl_pse_pw_d_status - power detection status of the PoDL PSE.
+ *	IEEE 802.3-2018 30.15.1.1.3 aPoDLPSEPowerDetectionStatus:
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_UNKNOWN: PoDL PSE
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_DISABLED: "The enumeration “disabled” is
+ *	asserted true when the PoDL PSE state diagram variable mr_pse_enable is
+ *	false"
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_SEARCHING: "The enumeration “searching” is
+ *	asserted true when either of the PSE state diagram variables
+ *	pi_detecting or pi_classifying is true."
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_DELIVERING: "The enumeration “deliveringPower”
+ *	is asserted true when the PoDL PSE state diagram variable pi_powered is
+ *	true."
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_SLEEP: "The enumeration “sleep” is asserted
+ *	true when the PoDL PSE state diagram variable pi_sleeping is true."
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_IDLE: "The enumeration “idle” is asserted true
+ *	when the logical combination of the PoDL PSE state diagram variables
+ *	pi_prebiased*!pi_sleeping is true."
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR: "The enumeration “error” is asserted
+ *	true when the PoDL PSE state diagram variable overload_held is true."
+ */
+enum ethtool_podl_pse_pw_d_status {
+	ETHTOOL_PODL_PSE_PW_D_STATUS_UNKNOWN = 1,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_DISABLED,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_SEARCHING,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_DELIVERING,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_SLEEP,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_IDLE,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR,
+};
+
 /**
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
@@ -1840,6 +1885,20 @@ static inline int ethtool_validate_duplex(uint8_t duplex)
 #define MASTER_SLAVE_STATE_SLAVE		3
 #define MASTER_SLAVE_STATE_ERR			4
 
+/* These are used to throttle the rate of data on the phy interface when the
+ * native speed of the interface is higher than the link speed. These should
+ * not be used for phy interfaces which natively support multiple speeds (e.g.
+ * MII or SGMII).
+ */
+/* No rate matching performed. */
+#define RATE_MATCH_NONE		0
+/* The phy sends pause frames to throttle the MAC. */
+#define RATE_MATCH_PAUSE	1
+/* The phy asserts CRS to prevent the MAC from transmitting. */
+#define RATE_MATCH_CRS		2
+/* The MAC is programmed with a sufficiently-large IPG. */
+#define RATE_MATCH_OPEN_LOOP	3
+
 /* Which connector port. */
 #define PORT_TP			0x00
 #define PORT_AUI		0x01
@@ -2033,8 +2092,8 @@ enum ethtool_reset_flags {
  *	reported consistently by PHYLIB.  Read-only.
  * @master_slave_cfg: Master/slave port mode.
  * @master_slave_state: Master/slave port state.
+ * @rate_matching: Rate adaptation performed by the PHY
  * @reserved: Reserved for future use; see the note on reserved space.
- * @reserved1: Reserved for future use; see the note on reserved space.
  * @link_mode_masks: Variable length bitmaps.
  *
  * If autonegotiation is disabled, the speed and @duplex represent the
@@ -2085,7 +2144,7 @@ struct ethtool_link_settings {
 	uint8_t	transceiver;
 	uint8_t	master_slave_cfg;
 	uint8_t	master_slave_state;
-	uint8_t	reserved1[1];
+	uint8_t	rate_matching;
 	uint32_t	reserved[7];
 	uint32_t	link_mode_masks[];
 	/* layout of link_mode_masks fields:
diff --git a/include/standard-headers/linux/fuse.h b/include/standard-headers/linux/fuse.h
index bda06258be..713d259768 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -194,6 +194,9 @@
  *  - add FUSE_SECURITY_CTX init flag
  *  - add security context to create, mkdir, symlink, and mknod requests
  *  - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
+ *
+ *  7.37
+ *  - add FUSE_TMPFILE
  */
 
 #ifndef _LINUX_FUSE_H
@@ -225,7 +228,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 36
+#define FUSE_KERNEL_MINOR_VERSION 37
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -533,6 +536,7 @@ enum fuse_opcode {
 	FUSE_SETUPMAPPING	= 48,
 	FUSE_REMOVEMAPPING	= 49,
 	FUSE_SYNCFS		= 50,
+	FUSE_TMPFILE		= 51,
 
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
diff --git a/include/standard-headers/linux/input-event-codes.h b/include/standard-headers/linux/input-event-codes.h
index 50790aee5a..815f7a1dff 100644
--- a/include/standard-headers/linux/input-event-codes.h
+++ b/include/standard-headers/linux/input-event-codes.h
@@ -862,6 +862,7 @@
 #define ABS_TOOL_WIDTH		0x1c
 
 #define ABS_VOLUME		0x20
+#define ABS_PROFILE		0x21
 
 #define ABS_MISC		0x28
 
diff --git a/include/standard-headers/linux/virtio_blk.h b/include/standard-headers/linux/virtio_blk.h
index 2dcc90826a..e81715cd70 100644
--- a/include/standard-headers/linux/virtio_blk.h
+++ b/include/standard-headers/linux/virtio_blk.h
@@ -40,6 +40,7 @@
 #define VIRTIO_BLK_F_MQ		12	/* support more than one vq */
 #define VIRTIO_BLK_F_DISCARD	13	/* DISCARD is supported */
 #define VIRTIO_BLK_F_WRITE_ZEROES	14	/* WRITE ZEROES is supported */
+#define VIRTIO_BLK_F_SECURE_ERASE	16 /* Secure Erase is supported */
 
 /* Legacy feature bits */
 #ifndef VIRTIO_BLK_NO_LEGACY
@@ -119,6 +120,21 @@ struct virtio_blk_config {
 	uint8_t write_zeroes_may_unmap;
 
 	uint8_t unused1[3];
+
+	/* the next 3 entries are guarded by VIRTIO_BLK_F_SECURE_ERASE */
+	/*
+	 * The maximum secure erase sectors (in 512-byte sectors) for
+	 * one segment.
+	 */
+	__virtio32 max_secure_erase_sectors;
+	/*
+	 * The maximum number of secure erase segments in a
+	 * secure erase command.
+	 */
+	__virtio32 max_secure_erase_seg;
+	/* Secure erase commands must be aligned to this number of sectors. */
+	__virtio32 secure_erase_sector_alignment;
+
 } QEMU_PACKED;
 
 /*
@@ -153,6 +169,9 @@ struct virtio_blk_config {
 /* Write zeroes command */
 #define VIRTIO_BLK_T_WRITE_ZEROES	13
 
+/* Secure erase command */
+#define VIRTIO_BLK_T_SECURE_ERASE	14
+
 #ifndef VIRTIO_BLK_NO_LEGACY
 /* Barrier before this op. */
 #define VIRTIO_BLK_T_BARRIER	0x80000000
diff --git a/linux-headers/asm-generic/hugetlb_encode.h b/linux-headers/asm-generic/hugetlb_encode.h
index 4f3d5aaa11..de687009bf 100644
--- a/linux-headers/asm-generic/hugetlb_encode.h
+++ b/linux-headers/asm-generic/hugetlb_encode.h
@@ -20,18 +20,18 @@
 #define HUGETLB_FLAG_ENCODE_SHIFT	26
 #define HUGETLB_FLAG_ENCODE_MASK	0x3f
 
-#define HUGETLB_FLAG_ENCODE_16KB	(14 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_64KB	(16 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_512KB	(19 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_1MB		(20 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_2MB		(21 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_8MB		(23 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_16MB	(24 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_32MB	(25 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_256MB	(28 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_512MB	(29 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_1GB		(30 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_2GB		(31 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_16GB	(34 << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_16KB	(14U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_64KB	(16U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_512KB	(19U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_1MB		(20U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_2MB		(21U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_8MB		(23U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_16MB	(24U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_32MB	(25U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_256MB	(28U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_512MB	(29U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_1GB		(30U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_2GB		(31U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_16GB	(34U << HUGETLB_FLAG_ENCODE_SHIFT)
 
 #endif /* _ASM_GENERIC_HUGETLB_ENCODE_H_ */
diff --git a/linux-headers/asm-generic/mman-common.h b/linux-headers/asm-generic/mman-common.h
index 6c1aa92a92..6ce1f1ceb4 100644
--- a/linux-headers/asm-generic/mman-common.h
+++ b/linux-headers/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/linux-headers/asm-mips/mman.h b/linux-headers/asm-mips/mman.h
index 1be428663c..c6e1fc77c9 100644
--- a/linux-headers/asm-mips/mman.h
+++ b/linux-headers/asm-mips/mman.h
@@ -103,6 +103,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/linux-headers/asm-riscv/kvm.h b/linux-headers/asm-riscv/kvm.h
index 7351417afd..8985ff234c 100644
--- a/linux-headers/asm-riscv/kvm.h
+++ b/linux-headers/asm-riscv/kvm.h
@@ -48,6 +48,7 @@ struct kvm_sregs {
 /* CONFIG registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
 struct kvm_riscv_config {
 	unsigned long isa;
+	unsigned long zicbom_block_size;
 };
 
 /* CORE registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
@@ -98,6 +99,9 @@ enum KVM_RISCV_ISA_EXT_ID {
 	KVM_RISCV_ISA_EXT_M,
 	KVM_RISCV_ISA_EXT_SVPBMT,
 	KVM_RISCV_ISA_EXT_SSTC,
+	KVM_RISCV_ISA_EXT_SVINVAL,
+	KVM_RISCV_ISA_EXT_ZIHINTPAUSE,
+	KVM_RISCV_ISA_EXT_ZICBOM,
 	KVM_RISCV_ISA_EXT_MAX,
 };
 
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index ebdafa576d..b2783c5202 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -1175,6 +1175,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_VM_DISABLE_NX_HUGE_PAGES 220
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
+#define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/linux-headers/linux/psci.h b/linux-headers/linux/psci.h
index 213b2a0f70..e60dfd8907 100644
--- a/linux-headers/linux/psci.h
+++ b/linux-headers/linux/psci.h
@@ -48,12 +48,26 @@
 #define PSCI_0_2_FN64_MIGRATE_INFO_UP_CPU	PSCI_0_2_FN64(7)
 
 #define PSCI_1_0_FN_PSCI_FEATURES		PSCI_0_2_FN(10)
+#define PSCI_1_0_FN_CPU_FREEZE			PSCI_0_2_FN(11)
+#define PSCI_1_0_FN_CPU_DEFAULT_SUSPEND		PSCI_0_2_FN(12)
+#define PSCI_1_0_FN_NODE_HW_STATE		PSCI_0_2_FN(13)
 #define PSCI_1_0_FN_SYSTEM_SUSPEND		PSCI_0_2_FN(14)
 #define PSCI_1_0_FN_SET_SUSPEND_MODE		PSCI_0_2_FN(15)
+#define PSCI_1_0_FN_STAT_RESIDENCY		PSCI_0_2_FN(16)
+#define PSCI_1_0_FN_STAT_COUNT			PSCI_0_2_FN(17)
+
 #define PSCI_1_1_FN_SYSTEM_RESET2		PSCI_0_2_FN(18)
+#define PSCI_1_1_FN_MEM_PROTECT			PSCI_0_2_FN(19)
+#define PSCI_1_1_FN_MEM_PROTECT_CHECK_RANGE	PSCI_0_2_FN(19)
 
+#define PSCI_1_0_FN64_CPU_DEFAULT_SUSPEND	PSCI_0_2_FN64(12)
+#define PSCI_1_0_FN64_NODE_HW_STATE		PSCI_0_2_FN64(13)
 #define PSCI_1_0_FN64_SYSTEM_SUSPEND		PSCI_0_2_FN64(14)
+#define PSCI_1_0_FN64_STAT_RESIDENCY		PSCI_0_2_FN64(16)
+#define PSCI_1_0_FN64_STAT_COUNT		PSCI_0_2_FN64(17)
+
 #define PSCI_1_1_FN64_SYSTEM_RESET2		PSCI_0_2_FN64(18)
+#define PSCI_1_1_FN64_MEM_PROTECT_CHECK_RANGE	PSCI_0_2_FN64(19)
 
 /* PSCI v0.2 power state encoding for CPU_SUSPEND function */
 #define PSCI_0_2_POWER_STATE_ID_MASK		0xffff
diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h
index a3a377cd44..ba5d0df52f 100644
--- a/linux-headers/linux/userfaultfd.h
+++ b/linux-headers/linux/userfaultfd.h
@@ -12,6 +12,10 @@
 
 #include <linux/types.h>
 
+/* ioctls for /dev/userfaultfd */
+#define USERFAULTFD_IOC 0xAA
+#define USERFAULTFD_IOC_NEW _IO(USERFAULTFD_IOC, 0x00)
+
 /*
  * If the UFFDIO_API is upgraded someday, the UFFDIO_UNREGISTER and
  * UFFDIO_WAKE ioctls should be defined as _IOW and not as _IOR.  In
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index ede44b5572..bee7e42198 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -986,6 +986,148 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
+ * state with the platform-based power management.  Device use of lower power
+ * states depends on factors managed by the runtime power management core,
+ * including system level support and coordinating support among dependent
+ * devices.  Enabling device low power entry does not guarantee lower power
+ * usage by the device, nor is a mechanism provided through this feature to
+ * know the current power state of the device.  If any device access happens
+ * (either from the host or through the vfio uAPI) when the device is in the
+ * low power state, then the host will move the device out of the low power
+ * state as necessary prior to the access.  Once the access is completed, the
+ * device may re-enter the low power state.  For single shot low power support
+ * with wake-up notification, see
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
+ * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
+ * calling LOW_POWER_EXIT.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
+
+/*
+ * This device feature has the same behavior as
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
+ * provides an eventfd for wake-up notification.  When the device moves out of
+ * the low power state for the wake-up, the host will not allow the device to
+ * re-enter a low power state without a subsequent user call to one of the low
+ * power entry device feature IOCTLs.  Access to mmap'd device regions is
+ * disabled on LOW_POWER_ENTRY_WITH_WAKEUP and may only be resumed after the
+ * low power exit.  The low power exit can happen either through LOW_POWER_EXIT
+ * or through any other access (where the wake-up notification has been
+ * generated).  The access to mmap'd device regions will not trigger low power
+ * exit.
+ *
+ * The notification through the provided eventfd will be generated only when
+ * the device has entered and is resumed from a low power state after
+ * calling this device feature IOCTL.  A device that has not entered low power
+ * state, as managed through the runtime power management core, will not
+ * generate a notification through the provided eventfd on access.  Calling the
+ * LOW_POWER_EXIT feature is optional in the case where notification has been
+ * signaled on the provided eventfd that a resume from low power has occurred.
+ */
+struct vfio_device_low_power_entry_with_wakeup {
+	__s32 wakeup_eventfd;
+	__u32 reserved;
+};
+
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
+ * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
+ * This device feature IOCTL may itself generate a wakeup eventfd notification
+ * in the latter case if the device had previously entered a low power state.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET start/stop device DMA logging.
+ * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
+ * DMA logging.
+ *
+ * DMA logging allows a device to internally record what DMAs the device is
+ * initiating and report them back to userspace. It is part of the VFIO
+ * migration infrastructure that allows implementing dirty page tracking
+ * during the pre copy phase of live migration. Only DMA WRITEs are logged,
+ * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
+ *
+ * When DMA logging is started a range of IOVAs to monitor is provided and the
+ * device can optimize its logging to cover only the IOVA range given. Each
+ * DMA that the device initiates inside the range will be logged by the device
+ * for later retrieval.
+ *
+ * page_size is an input that hints what tracking granularity the device
+ * should try to achieve. If the device cannot do the hinted page size then
+ * it's the driver choice which page size to pick based on its support.
+ * On output the device will return the page size it selected.
+ *
+ * ranges is a pointer to an array of
+ * struct vfio_device_feature_dma_logging_range.
+ *
+ * The core kernel code guarantees to support by minimum num_ranges that fit
+ * into a single kernel page. User space can try higher values but should give
+ * up if the above can't be achieved as of some driver limitations.
+ *
+ * A single call to start device DMA logging can be issued and a matching stop
+ * should follow at the end. Another start is not allowed in the meantime.
+ */
+struct vfio_device_feature_dma_logging_control {
+	__aligned_u64 page_size;
+	__u32 num_ranges;
+	__u32 __reserved;
+	__aligned_u64 ranges;
+};
+
+struct vfio_device_feature_dma_logging_range {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 6
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
+ * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
+ */
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 7
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
+ *
+ * Query the device's DMA log for written pages within the given IOVA range.
+ * During querying the log is cleared for the IOVA range.
+ *
+ * bitmap is a pointer to an array of u64s that will hold the output bitmap
+ * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
+ * is given by:
+ *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
+ *
+ * The input page_size can be any power of two value and does not have to
+ * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
+ * will format its internal logging to match the reporting page size, possibly
+ * by replicating bits if the internal page size is lower than requested.
+ *
+ * The LOGGING_REPORT will only set bits in the bitmap and never clear or
+ * perform any initialization of the user provided bitmap.
+ *
+ * If any error is returned userspace should assume that the dirty log is
+ * corrupted. Error recovery is to consider all memory dirty and try to
+ * restart the dirty tracking, or to abort/restart the whole migration.
+ *
+ * If DMA logging is not enabled, an error will be returned.
+ *
+ */
+struct vfio_device_feature_dma_logging_report {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 bitmap;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 8
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] util/userfaultfd: Add uffd_open()
  2023-01-25 22:40 [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
  2023-01-25 22:40 ` [PATCH 1/3] linux-headers: Update to v6.1 Peter Xu
@ 2023-01-25 22:40 ` Peter Xu
  2023-01-25 23:04   ` Philippe Mathieu-Daudé
  2023-01-25 22:40 ` [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
  2023-01-26 14:13 ` [PATCH 0/3] " Michal Prívozník
  3 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-01-25 22:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, peterx, Dr . David Alan Gilbert,
	Juan Quintela

Add a helper to create the uffd handle.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/userfaultfd.h   |  1 +
 migration/postcopy-ram.c     | 11 +++++------
 tests/qtest/migration-test.c |  3 ++-
 util/userfaultfd.c           | 13 +++++++++++--
 4 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h
index 6b74f92792..a19a05d5f7 100644
--- a/include/qemu/userfaultfd.h
+++ b/include/qemu/userfaultfd.h
@@ -17,6 +17,7 @@
 #include "exec/hwaddr.h"
 #include <linux/userfaultfd.h>
 
+int uffd_open(int flags);
 int uffd_query_features(uint64_t *features);
 int uffd_create_fd(uint64_t features, bool non_blocking);
 void uffd_close_fd(int uffd_fd);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index b9a37ef255..0c55df0e52 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -37,6 +37,7 @@
 #include "qemu-file.h"
 #include "yank_functions.h"
 #include "tls.h"
+#include "qemu/userfaultfd.h"
 
 /* Arbitrary limit on size of each discard command,
  * keeps them around ~200 bytes
@@ -226,11 +227,9 @@ static bool receive_ufd_features(uint64_t *features)
     int ufd;
     bool ret = true;
 
-    /* if we are here __NR_userfaultfd should exists */
-    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
+    ufd = uffd_open(O_CLOEXEC);
     if (ufd == -1) {
-        error_report("%s: syscall __NR_userfaultfd failed: %s", __func__,
-                     strerror(errno));
+        error_report("%s: uffd_open() failed: %s", __func__, strerror(errno));
         return false;
     }
 
@@ -375,7 +374,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
         goto out;
     }
 
-    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
+    ufd = uffd_open(O_CLOEXEC);
     if (ufd == -1) {
         error_report("%s: userfaultfd not available: %s", __func__,
                      strerror(errno));
@@ -1160,7 +1159,7 @@ static int postcopy_temp_pages_setup(MigrationIncomingState *mis)
 int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
 {
     /* Open the fd for the kernel to give us userfaults */
-    mis->userfault_fd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+    mis->userfault_fd = uffd_open(O_CLOEXEC | O_NONBLOCK);
     if (mis->userfault_fd == -1) {
         error_report("%s: Failed to open userfault fd: %s", __func__,
                      strerror(errno));
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 1dd32c9506..7a5d1922dd 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -62,13 +62,14 @@ static bool uffd_feature_thread_id;
 #include <sys/eventfd.h>
 #include <sys/ioctl.h>
 #include <linux/userfaultfd.h>
+#include "qemu/userfaultfd.h"
 
 static bool ufd_version_check(void)
 {
     struct uffdio_api api_struct;
     uint64_t ioctl_mask;
 
-    int ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
+    int ufd = uffd_open(O_CLOEXEC);
 
     if (ufd == -1) {
         g_test_message("Skipping test: userfaultfd not available");
diff --git a/util/userfaultfd.c b/util/userfaultfd.c
index f1cd6af2b1..9845a2ec81 100644
--- a/util/userfaultfd.c
+++ b/util/userfaultfd.c
@@ -19,6 +19,15 @@
 #include <sys/syscall.h>
 #include <sys/ioctl.h>
 
+int uffd_open(int flags)
+{
+#if defined(__linux__) && defined(__NR_userfaultfd)
+    return syscall(__NR_userfaultfd, flags);
+#else
+    return -EINVAL;
+#endif
+}
+
 /**
  * uffd_query_features: query UFFD features
  *
@@ -32,7 +41,7 @@ int uffd_query_features(uint64_t *features)
     struct uffdio_api api_struct = { 0 };
     int ret = -1;
 
-    uffd_fd = syscall(__NR_userfaultfd, O_CLOEXEC);
+    uffd_fd = uffd_open(O_CLOEXEC);
     if (uffd_fd < 0) {
         trace_uffd_query_features_nosys(errno);
         return -1;
@@ -69,7 +78,7 @@ int uffd_create_fd(uint64_t features, bool non_blocking)
     uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
 
     flags = O_CLOEXEC | (non_blocking ? O_NONBLOCK : 0);
-    uffd_fd = syscall(__NR_userfaultfd, flags);
+    uffd_fd = uffd_open(flags);
     if (uffd_fd < 0) {
         trace_uffd_create_fd_nosys(errno);
         return -1;
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-25 22:40 [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
  2023-01-25 22:40 ` [PATCH 1/3] linux-headers: Update to v6.1 Peter Xu
  2023-01-25 22:40 ` [PATCH 2/3] util/userfaultfd: Add uffd_open() Peter Xu
@ 2023-01-25 22:40 ` Peter Xu
  2023-01-25 23:08   ` Philippe Mathieu-Daudé
  2023-01-26  9:02   ` Daniel P. Berrangé
  2023-01-26 14:13 ` [PATCH 0/3] " Michal Prívozník
  3 siblings, 2 replies; 23+ messages in thread
From: Peter Xu @ 2023-01-25 22:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, peterx, Dr . David Alan Gilbert,
	Juan Quintela

Teach QEMU to use /dev/userfaultfd when it existed and fallback to the
system call if either it's not there or doesn't have enough permission.

Firstly, as long as the app has permission to access /dev/userfaultfd, it
always have the ability to trap kernel faults which QEMU mostly wants.
Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be
forbidden, so it can be the major way to use postcopy in a restricted
environment with strict seccomp setup.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 util/trace-events  |  1 +
 util/userfaultfd.c | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/util/trace-events b/util/trace-events
index c8f53d7d9f..16f78d8fe5 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_siz
 qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p"
 
 #userfaultfd.c
+uffd_detect_open_mode(int mode) "%d"
 uffd_query_features_nosys(int err) "errno: %i"
 uffd_query_features_api_failed(int err) "errno: %i"
 uffd_create_fd_nosys(int err) "errno: %i"
diff --git a/util/userfaultfd.c b/util/userfaultfd.c
index 9845a2ec81..360ecf8084 100644
--- a/util/userfaultfd.c
+++ b/util/userfaultfd.c
@@ -18,10 +18,46 @@
 #include <poll.h>
 #include <sys/syscall.h>
 #include <sys/ioctl.h>
+#include <fcntl.h>
+
+typedef enum {
+    UFFD_UNINITIALIZED = 0,
+    UFFD_USE_DEV_PATH,
+    UFFD_USE_SYSCALL,
+} uffd_open_mode;
+
+static uffd_open_mode open_mode;
+static int uffd_dev;
+
+static uffd_open_mode uffd_detect_open_mode(void)
+{
+    if (open_mode == UFFD_UNINITIALIZED) {
+        /*
+         * Make /dev/userfaultfd the default approach because it has better
+         * permission controls, meanwhile allows kernel faults without any
+         * privilege requirement (e.g. SYS_CAP_PTRACE).
+         */
+        uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
+        if (uffd_dev >= 0) {
+            open_mode = UFFD_USE_DEV_PATH;
+        } else {
+            /* Fallback to the system call */
+            open_mode = UFFD_USE_SYSCALL;
+        }
+        trace_uffd_detect_open_mode(open_mode);
+    }
+
+    return open_mode;
+}
 
 int uffd_open(int flags)
 {
 #if defined(__linux__) && defined(__NR_userfaultfd)
+    if (uffd_detect_open_mode() == UFFD_USE_DEV_PATH) {
+        assert(uffd_dev >= 0);
+        return ioctl(uffd_dev, USERFAULTFD_IOC_NEW, flags);
+    }
+
     return syscall(__NR_userfaultfd, flags);
 #else
     return -EINVAL;
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] util/userfaultfd: Add uffd_open()
  2023-01-25 22:40 ` [PATCH 2/3] util/userfaultfd: Add uffd_open() Peter Xu
@ 2023-01-25 23:04   ` Philippe Mathieu-Daudé
  2023-01-26 15:58     ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-01-25 23:04 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Leonardo Bras Soares Passos, Dr . David Alan Gilbert, Juan Quintela

On 25/1/23 23:40, Peter Xu wrote:
> Add a helper to create the uffd handle.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/qemu/userfaultfd.h   |  1 +
>   migration/postcopy-ram.c     | 11 +++++------
>   tests/qtest/migration-test.c |  3 ++-
>   util/userfaultfd.c           | 13 +++++++++++--
>   4 files changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h
> index 6b74f92792..a19a05d5f7 100644
> --- a/include/qemu/userfaultfd.h
> +++ b/include/qemu/userfaultfd.h
> @@ -17,6 +17,7 @@
>   #include "exec/hwaddr.h"
>   #include <linux/userfaultfd.h>
>   
> +int uffd_open(int flags);

Preferably documenting what this function returns:
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-25 22:40 ` [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
@ 2023-01-25 23:08   ` Philippe Mathieu-Daudé
  2023-01-26 17:33     ` Peter Xu
  2023-01-26  9:02   ` Daniel P. Berrangé
  1 sibling, 1 reply; 23+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-01-25 23:08 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Leonardo Bras Soares Passos, Dr . David Alan Gilbert, Juan Quintela

On 25/1/23 23:40, Peter Xu wrote:
> Teach QEMU to use /dev/userfaultfd when it existed and fallback to the
> system call if either it's not there or doesn't have enough permission.
> 
> Firstly, as long as the app has permission to access /dev/userfaultfd, it
> always have the ability to trap kernel faults which QEMU mostly wants.
> Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be
> forbidden, so it can be the major way to use postcopy in a restricted
> environment with strict seccomp setup.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   util/trace-events  |  1 +
>   util/userfaultfd.c | 36 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 37 insertions(+)
> 
> diff --git a/util/trace-events b/util/trace-events
> index c8f53d7d9f..16f78d8fe5 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_siz
>   qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p"
>   
>   #userfaultfd.c
> +uffd_detect_open_mode(int mode) "%d"
>   uffd_query_features_nosys(int err) "errno: %i"
>   uffd_query_features_api_failed(int err) "errno: %i"
>   uffd_create_fd_nosys(int err) "errno: %i"
> diff --git a/util/userfaultfd.c b/util/userfaultfd.c
> index 9845a2ec81..360ecf8084 100644
> --- a/util/userfaultfd.c
> +++ b/util/userfaultfd.c
> @@ -18,10 +18,46 @@
>   #include <poll.h>
>   #include <sys/syscall.h>
>   #include <sys/ioctl.h>
> +#include <fcntl.h>
> +
> +typedef enum {
> +    UFFD_UNINITIALIZED = 0,
> +    UFFD_USE_DEV_PATH,
> +    UFFD_USE_SYSCALL,
> +} uffd_open_mode;
> +
> +static uffd_open_mode open_mode;

'open_mode' could be reduced to uffd_detect_open_mode()'s
scope.

> +static int uffd_dev;
> +
> +static uffd_open_mode uffd_detect_open_mode(void)
> +{
> +    if (open_mode == UFFD_UNINITIALIZED) {
> +        /*
> +         * Make /dev/userfaultfd the default approach because it has better
> +         * permission controls, meanwhile allows kernel faults without any
> +         * privilege requirement (e.g. SYS_CAP_PTRACE).
> +         */
> +        uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
> +        if (uffd_dev >= 0) {
> +            open_mode = UFFD_USE_DEV_PATH;
> +        } else {
> +            /* Fallback to the system call */
> +            open_mode = UFFD_USE_SYSCALL;
> +        }
> +        trace_uffd_detect_open_mode(open_mode);
> +    }
> +
> +    return open_mode;

If 'open_mode' isn't relevant, this function could return uffd_dev/-1 
instead. Not really an improvement :)

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

> +}
>   
>   int uffd_open(int flags)
>   {
>   #if defined(__linux__) && defined(__NR_userfaultfd)
> +    if (uffd_detect_open_mode() == UFFD_USE_DEV_PATH) {
> +        assert(uffd_dev >= 0);
> +        return ioctl(uffd_dev, USERFAULTFD_IOC_NEW, flags);
> +    }
> +
>       return syscall(__NR_userfaultfd, flags);
>   #else
>       return -EINVAL;



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-25 22:40 ` [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
  2023-01-25 23:08   ` Philippe Mathieu-Daudé
@ 2023-01-26  9:02   ` Daniel P. Berrangé
  2023-01-26  9:05     ` Daniel P. Berrangé
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2023-01-26  9:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, Dr . David Alan Gilbert,
	Juan Quintela

On Wed, Jan 25, 2023 at 05:40:16PM -0500, Peter Xu wrote:
> Teach QEMU to use /dev/userfaultfd when it existed and fallback to the
> system call if either it's not there or doesn't have enough permission.
> 
> Firstly, as long as the app has permission to access /dev/userfaultfd, it
> always have the ability to trap kernel faults which QEMU mostly wants.
> Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be
> forbidden, so it can be the major way to use postcopy in a restricted
> environment with strict seccomp setup.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  util/trace-events  |  1 +
>  util/userfaultfd.c | 36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/util/trace-events b/util/trace-events
> index c8f53d7d9f..16f78d8fe5 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_siz
>  qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p"
>  
>  #userfaultfd.c
> +uffd_detect_open_mode(int mode) "%d"
>  uffd_query_features_nosys(int err) "errno: %i"
>  uffd_query_features_api_failed(int err) "errno: %i"
>  uffd_create_fd_nosys(int err) "errno: %i"
> diff --git a/util/userfaultfd.c b/util/userfaultfd.c
> index 9845a2ec81..360ecf8084 100644
> --- a/util/userfaultfd.c
> +++ b/util/userfaultfd.c
> @@ -18,10 +18,46 @@
>  #include <poll.h>
>  #include <sys/syscall.h>
>  #include <sys/ioctl.h>
> +#include <fcntl.h>
> +
> +typedef enum {
> +    UFFD_UNINITIALIZED = 0,
> +    UFFD_USE_DEV_PATH,
> +    UFFD_USE_SYSCALL,
> +} uffd_open_mode;
> +
> +static uffd_open_mode open_mode;
> +static int uffd_dev;
> +
> +static uffd_open_mode uffd_detect_open_mode(void)
> +{
> +    if (open_mode == UFFD_UNINITIALIZED) {
> +        /*
> +         * Make /dev/userfaultfd the default approach because it has better
> +         * permission controls, meanwhile allows kernel faults without any
> +         * privilege requirement (e.g. SYS_CAP_PTRACE).
> +         */
> +        uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);

qemu_open(), otherwise FD passing from the mgmt app won't work.

> +        if (uffd_dev >= 0) {
> +            open_mode = UFFD_USE_DEV_PATH;
> +        } else {
> +            /* Fallback to the system call */
> +            open_mode = UFFD_USE_SYSCALL;
> +        }
> +        trace_uffd_detect_open_mode(open_mode);
> +    }
> +
> +    return open_mode;
> +}

This leaves the /dev/userfaultfd FD open forever once it has been used
once. Is this really needed ? IIUC, the place where we call this is
not going to be impacted if we open + close it every time we need to
create a new FD, and it'll simplify this code right down.

>  
>  int uffd_open(int flags)
>  {
>  #if defined(__linux__) && defined(__NR_userfaultfd)
> +    if (uffd_detect_open_mode() == UFFD_USE_DEV_PATH) {
> +        assert(uffd_dev >= 0);
> +        return ioctl(uffd_dev, USERFAULTFD_IOC_NEW, flags);
> +    }
> +
>      return syscall(__NR_userfaultfd, flags);
>  #else
>      return -EINVAL;
> -- 
> 2.37.3
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-26  9:02   ` Daniel P. Berrangé
@ 2023-01-26  9:05     ` Daniel P. Berrangé
  2023-01-26 20:03       ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2023-01-26  9:05 UTC (permalink / raw)
  To: Peter Xu, qemu-devel, Leonardo Bras Soares Passos,
	Dr . David Alan Gilbert, Juan Quintela

On Thu, Jan 26, 2023 at 09:02:09AM +0000, Daniel P. Berrangé wrote:
> On Wed, Jan 25, 2023 at 05:40:16PM -0500, Peter Xu wrote:
> > Teach QEMU to use /dev/userfaultfd when it existed and fallback to the
> > system call if either it's not there or doesn't have enough permission.
> > 
> > Firstly, as long as the app has permission to access /dev/userfaultfd, it
> > always have the ability to trap kernel faults which QEMU mostly wants.
> > Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be
> > forbidden, so it can be the major way to use postcopy in a restricted
> > environment with strict seccomp setup.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  util/trace-events  |  1 +
> >  util/userfaultfd.c | 36 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 37 insertions(+)
> > 
> > diff --git a/util/trace-events b/util/trace-events
> > index c8f53d7d9f..16f78d8fe5 100644
> > --- a/util/trace-events
> > +++ b/util/trace-events
> > @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_siz
> >  qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p"
> >  
> >  #userfaultfd.c
> > +uffd_detect_open_mode(int mode) "%d"
> >  uffd_query_features_nosys(int err) "errno: %i"
> >  uffd_query_features_api_failed(int err) "errno: %i"
> >  uffd_create_fd_nosys(int err) "errno: %i"
> > diff --git a/util/userfaultfd.c b/util/userfaultfd.c
> > index 9845a2ec81..360ecf8084 100644
> > --- a/util/userfaultfd.c
> > +++ b/util/userfaultfd.c
> > @@ -18,10 +18,46 @@
> >  #include <poll.h>
> >  #include <sys/syscall.h>
> >  #include <sys/ioctl.h>
> > +#include <fcntl.h>
> > +
> > +typedef enum {
> > +    UFFD_UNINITIALIZED = 0,
> > +    UFFD_USE_DEV_PATH,
> > +    UFFD_USE_SYSCALL,
> > +} uffd_open_mode;
> > +
> > +static uffd_open_mode open_mode;
> > +static int uffd_dev;
> > +
> > +static uffd_open_mode uffd_detect_open_mode(void)
> > +{
> > +    if (open_mode == UFFD_UNINITIALIZED) {
> > +        /*
> > +         * Make /dev/userfaultfd the default approach because it has better
> > +         * permission controls, meanwhile allows kernel faults without any
> > +         * privilege requirement (e.g. SYS_CAP_PTRACE).
> > +         */
> > +        uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
> 
> qemu_open(), otherwise FD passing from the mgmt app won't work.
> 
> > +        if (uffd_dev >= 0) {
> > +            open_mode = UFFD_USE_DEV_PATH;
> > +        } else {
> > +            /* Fallback to the system call */
> > +            open_mode = UFFD_USE_SYSCALL;
> > +        }
> > +        trace_uffd_detect_open_mode(open_mode);
> > +    }
> > +
> > +    return open_mode;
> > +}
> 
> This leaves the /dev/userfaultfd FD open forever once it has been used
> once. Is this really needed ? IIUC, the place where we call this is
> not going to be impacted if we open + close it every time we need to
> create a new FD, and it'll simplify this code right down.

Having said that, if we want to support passing the FD in from the
mgmt app, we need to keep it open persistently.

> 
> >  
> >  int uffd_open(int flags)
> >  {
> >  #if defined(__linux__) && defined(__NR_userfaultfd)
> > +    if (uffd_detect_open_mode() == UFFD_USE_DEV_PATH) {
> > +        assert(uffd_dev >= 0);
> > +        return ioctl(uffd_dev, USERFAULTFD_IOC_NEW, flags);
> > +    }
> > +
> >      return syscall(__NR_userfaultfd, flags);
> >  #else
> >      return -EINVAL;
> > -- 
> > 2.37.3
> > 
> > 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-25 22:40 [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
                   ` (2 preceding siblings ...)
  2023-01-25 22:40 ` [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
@ 2023-01-26 14:13 ` Michal Prívozník
  2023-01-26 14:15   ` Dr. David Alan Gilbert
  3 siblings, 1 reply; 23+ messages in thread
From: Michal Prívozník @ 2023-01-26 14:13 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Leonardo Bras Soares Passos, Dr . David Alan Gilbert, Juan Quintela

On 1/25/23 23:40, Peter Xu wrote:
> The new /dev/userfaultfd handle is superior to the system call with a
> better permission control and also works for a restricted seccomp
> environment.
> 
> The new device was only introduced in v6.1 so we need a header update.
> 
> Please have a look, thanks.

I was wondering whether it would make sense/be possible for mgmt app
(libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
itself. But looking into the code, libvirt would need to do that when
spawning QEMU because that's when QEMU itself initializes internal state
and queries userfaultfd caps.

Michal



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-26 14:13 ` [PATCH 0/3] " Michal Prívozník
@ 2023-01-26 14:15   ` Dr. David Alan Gilbert
  2023-01-26 15:25     ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-26 14:15 UTC (permalink / raw)
  To: Michal Prívozník
  Cc: Peter Xu, qemu-devel, Leonardo Bras Soares Passos, Juan Quintela

* Michal Prívozník (mprivozn@redhat.com) wrote:
> On 1/25/23 23:40, Peter Xu wrote:
> > The new /dev/userfaultfd handle is superior to the system call with a
> > better permission control and also works for a restricted seccomp
> > environment.
> > 
> > The new device was only introduced in v6.1 so we need a header update.
> > 
> > Please have a look, thanks.
> 
> I was wondering whether it would make sense/be possible for mgmt app
> (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
> itself. But looking into the code, libvirt would need to do that when
> spawning QEMU because that's when QEMU itself initializes internal state
> and queries userfaultfd caps.

You also have to be careful about what the userfaultfd semantics are; I
can't remember them - but if you open it in one process and pass it to
another process, which processes address space are you trying to
monitor?

Dave

> Michal
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-26 14:15   ` Dr. David Alan Gilbert
@ 2023-01-26 15:25     ` Peter Xu
  2023-01-26 15:29       ` Michal Prívozník
  2023-01-26 15:59       ` Daniel P. Berrangé
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Xu @ 2023-01-26 15:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Michal Prívozník, qemu-devel,
	Leonardo Bras Soares Passos, Juan Quintela, Daniel P. Berrange

On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote:
> * Michal Prívozník (mprivozn@redhat.com) wrote:
> > On 1/25/23 23:40, Peter Xu wrote:
> > > The new /dev/userfaultfd handle is superior to the system call with a
> > > better permission control and also works for a restricted seccomp
> > > environment.
> > > 
> > > The new device was only introduced in v6.1 so we need a header update.
> > > 
> > > Please have a look, thanks.
> > 
> > I was wondering whether it would make sense/be possible for mgmt app
> > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
> > itself. But looking into the code, libvirt would need to do that when
> > spawning QEMU because that's when QEMU itself initializes internal state
> > and queries userfaultfd caps.
> 
> You also have to be careful about what the userfaultfd semantics are; I
> can't remember them - but if you open it in one process and pass it to
> another process, which processes address space are you trying to
> monitor?

Yes it's a problem.  The kernel always fetches the current mm_struct* which
represents the current context of virtual address space when creating the
uffd handle (for either the syscall or the ioctl() approach).

It works only if Libvirt will invoke QEMU as a thread and they'll share the
same address space.

Why libvirt would like to do so?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-26 15:25     ` Peter Xu
@ 2023-01-26 15:29       ` Michal Prívozník
  2023-01-26 15:49         ` Peter Xu
  2023-01-26 15:59       ` Daniel P. Berrangé
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Prívozník @ 2023-01-26 15:29 UTC (permalink / raw)
  To: Peter Xu, Dr. David Alan Gilbert
  Cc: qemu-devel, Leonardo Bras Soares Passos, Juan Quintela,
	Daniel P. Berrange

On 1/26/23 16:25, Peter Xu wrote:
> On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote:
>> * Michal Prívozník (mprivozn@redhat.com) wrote:
>>> On 1/25/23 23:40, Peter Xu wrote:
>>>> The new /dev/userfaultfd handle is superior to the system call with a
>>>> better permission control and also works for a restricted seccomp
>>>> environment.
>>>>
>>>> The new device was only introduced in v6.1 so we need a header update.
>>>>
>>>> Please have a look, thanks.
>>>
>>> I was wondering whether it would make sense/be possible for mgmt app
>>> (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
>>> itself. But looking into the code, libvirt would need to do that when
>>> spawning QEMU because that's when QEMU itself initializes internal state
>>> and queries userfaultfd caps.
>>
>> You also have to be careful about what the userfaultfd semantics are; I
>> can't remember them - but if you open it in one process and pass it to
>> another process, which processes address space are you trying to
>> monitor?
> 
> Yes it's a problem.  The kernel always fetches the current mm_struct* which
> represents the current context of virtual address space when creating the
> uffd handle (for either the syscall or the ioctl() approach).

Ah, I did not realize that.

> 
> It works only if Libvirt will invoke QEMU as a thread and they'll share the
> same address space.
> 
> Why libvirt would like to do so?

Well, we tend to pass files as FD more and more, because it allows us to
give access to "privileged" files to unprivileged process. What I did
not realize is that userfaultfd is different, not yet another file.

Michal



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-26 15:29       ` Michal Prívozník
@ 2023-01-26 15:49         ` Peter Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Xu @ 2023-01-26 15:49 UTC (permalink / raw)
  To: Michal Prívozník
  Cc: Dr. David Alan Gilbert, qemu-devel, Leonardo Bras Soares Passos,
	Juan Quintela, Daniel P. Berrange

On Thu, Jan 26, 2023 at 04:29:10PM +0100, Michal Prívozník wrote:
> On 1/26/23 16:25, Peter Xu wrote:
> > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote:
> >> * Michal Prívozník (mprivozn@redhat.com) wrote:
> >>> On 1/25/23 23:40, Peter Xu wrote:
> >>>> The new /dev/userfaultfd handle is superior to the system call with a
> >>>> better permission control and also works for a restricted seccomp
> >>>> environment.
> >>>>
> >>>> The new device was only introduced in v6.1 so we need a header update.
> >>>>
> >>>> Please have a look, thanks.
> >>>
> >>> I was wondering whether it would make sense/be possible for mgmt app
> >>> (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
> >>> itself. But looking into the code, libvirt would need to do that when
> >>> spawning QEMU because that's when QEMU itself initializes internal state
> >>> and queries userfaultfd caps.
> >>
> >> You also have to be careful about what the userfaultfd semantics are; I
> >> can't remember them - but if you open it in one process and pass it to
> >> another process, which processes address space are you trying to
> >> monitor?
> > 
> > Yes it's a problem.  The kernel always fetches the current mm_struct* which
> > represents the current context of virtual address space when creating the
> > uffd handle (for either the syscall or the ioctl() approach).
> 
> Ah, I did not realize that.
> 
> > 
> > It works only if Libvirt will invoke QEMU as a thread and they'll share the
> > same address space.
> > 
> > Why libvirt would like to do so?
> 
> Well, we tend to pass files as FD more and more, because it allows us to
> give access to "privileged" files to unprivileged process. What I did
> not realize is that userfaultfd is different, not yet another file.

I see.  Yes uffd is special comparing to most of the other fds, IMHO
majorly because it's a resource not being public but closely bound to the
process context of the mm.

There used to have proposals that grant permission to open uffd handle for
other processes, but the security implication was still not fully clear and
that discussion discontinued.

Then the question is whether there is still any scenario that QEMU may not
have privilege to either /dev/userfaultfd or using the syscall.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] util/userfaultfd: Add uffd_open()
  2023-01-25 23:04   ` Philippe Mathieu-Daudé
@ 2023-01-26 15:58     ` Peter Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Xu @ 2023-01-26 15:58 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: qemu-devel, Leonardo Bras Soares Passos, Dr . David Alan Gilbert,
	Juan Quintela

On Thu, Jan 26, 2023 at 12:04:35AM +0100, Philippe Mathieu-Daudé wrote:
> On 25/1/23 23:40, Peter Xu wrote:
> > Add a helper to create the uffd handle.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/qemu/userfaultfd.h   |  1 +
> >   migration/postcopy-ram.c     | 11 +++++------
> >   tests/qtest/migration-test.c |  3 ++-
> >   util/userfaultfd.c           | 13 +++++++++++--
> >   4 files changed, 19 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h
> > index 6b74f92792..a19a05d5f7 100644
> > --- a/include/qemu/userfaultfd.h
> > +++ b/include/qemu/userfaultfd.h
> > @@ -17,6 +17,7 @@
> >   #include "exec/hwaddr.h"
> >   #include <linux/userfaultfd.h>
> > +int uffd_open(int flags);
> 
> Preferably documenting what this function returns:

Sure thing.

> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-26 15:25     ` Peter Xu
  2023-01-26 15:29       ` Michal Prívozník
@ 2023-01-26 15:59       ` Daniel P. Berrangé
  2023-01-26 17:26         ` Peter Xu
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2023-01-26 15:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Dr. David Alan Gilbert, Michal Prívozník, qemu-devel,
	Leonardo Bras Soares Passos, Juan Quintela

On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote:
> On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote:
> > * Michal Prívozník (mprivozn@redhat.com) wrote:
> > > On 1/25/23 23:40, Peter Xu wrote:
> > > > The new /dev/userfaultfd handle is superior to the system call with a
> > > > better permission control and also works for a restricted seccomp
> > > > environment.
> > > > 
> > > > The new device was only introduced in v6.1 so we need a header update.
> > > > 
> > > > Please have a look, thanks.
> > > 
> > > I was wondering whether it would make sense/be possible for mgmt app
> > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
> > > itself. But looking into the code, libvirt would need to do that when
> > > spawning QEMU because that's when QEMU itself initializes internal state
> > > and queries userfaultfd caps.
> > 
> > You also have to be careful about what the userfaultfd semantics are; I
> > can't remember them - but if you open it in one process and pass it to
> > another process, which processes address space are you trying to
> > monitor?
> 
> Yes it's a problem.  The kernel always fetches the current mm_struct* which
> represents the current context of virtual address space when creating the
> uffd handle (for either the syscall or the ioctl() approach).

At what point does the process address space get associated ? When
the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW)
is called ?  If it is the former, then we have no choice, QEMU must open
it. if it is the latter, then libvirt can open /dev/userfaultfd, pass
it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW).

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-26 15:59       ` Daniel P. Berrangé
@ 2023-01-26 17:26         ` Peter Xu
  2023-01-31 19:48           ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-01-26 17:26 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Dr. David Alan Gilbert, Michal Prívozník, qemu-devel,
	Leonardo Bras Soares Passos, Juan Quintela

On Thu, Jan 26, 2023 at 03:59:33PM +0000, Daniel P. Berrangé wrote:
> On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote:
> > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote:
> > > * Michal Prívozník (mprivozn@redhat.com) wrote:
> > > > On 1/25/23 23:40, Peter Xu wrote:
> > > > > The new /dev/userfaultfd handle is superior to the system call with a
> > > > > better permission control and also works for a restricted seccomp
> > > > > environment.
> > > > > 
> > > > > The new device was only introduced in v6.1 so we need a header update.
> > > > > 
> > > > > Please have a look, thanks.
> > > > 
> > > > I was wondering whether it would make sense/be possible for mgmt app
> > > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
> > > > itself. But looking into the code, libvirt would need to do that when
> > > > spawning QEMU because that's when QEMU itself initializes internal state
> > > > and queries userfaultfd caps.
> > > 
> > > You also have to be careful about what the userfaultfd semantics are; I
> > > can't remember them - but if you open it in one process and pass it to
> > > another process, which processes address space are you trying to
> > > monitor?
> > 
> > Yes it's a problem.  The kernel always fetches the current mm_struct* which
> > represents the current context of virtual address space when creating the
> > uffd handle (for either the syscall or the ioctl() approach).
> 
> At what point does the process address space get associated ? When
> the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW)
> is called ?  If it is the former, then we have no choice, QEMU must open
> it. if it is the latter, then libvirt can open /dev/userfaultfd, pass
> it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW).

Good point.. It should be the latter, so should be doable.

What should be the best interface for QEMU to detect the fd passing over to
it?  IIUC qemu_open() requires the name to be /dev/fdset/*, but there's no
existing cmdline that QEMU can know which fd number to fetch from fdset to
be used as the /dev/userfaultfd descriptor.

monitor_get_fd() seems more proper, where we can define an unique string so
Libvirt can preset the descriptor with the same string attached to it, then
I can opt-in monitor_get_fd() before trying to open() or doing the syscall.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-25 23:08   ` Philippe Mathieu-Daudé
@ 2023-01-26 17:33     ` Peter Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Xu @ 2023-01-26 17:33 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: qemu-devel, Leonardo Bras Soares Passos, Dr . David Alan Gilbert,
	Juan Quintela

On Thu, Jan 26, 2023 at 12:08:33AM +0100, Philippe Mathieu-Daudé wrote:
> On 25/1/23 23:40, Peter Xu wrote:
> > Teach QEMU to use /dev/userfaultfd when it existed and fallback to the
> > system call if either it's not there or doesn't have enough permission.
> > 
> > Firstly, as long as the app has permission to access /dev/userfaultfd, it
> > always have the ability to trap kernel faults which QEMU mostly wants.
> > Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be
> > forbidden, so it can be the major way to use postcopy in a restricted
> > environment with strict seccomp setup.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   util/trace-events  |  1 +
> >   util/userfaultfd.c | 36 ++++++++++++++++++++++++++++++++++++
> >   2 files changed, 37 insertions(+)
> > 
> > diff --git a/util/trace-events b/util/trace-events
> > index c8f53d7d9f..16f78d8fe5 100644
> > --- a/util/trace-events
> > +++ b/util/trace-events
> > @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_siz
> >   qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p"
> >   #userfaultfd.c
> > +uffd_detect_open_mode(int mode) "%d"
> >   uffd_query_features_nosys(int err) "errno: %i"
> >   uffd_query_features_api_failed(int err) "errno: %i"
> >   uffd_create_fd_nosys(int err) "errno: %i"
> > diff --git a/util/userfaultfd.c b/util/userfaultfd.c
> > index 9845a2ec81..360ecf8084 100644
> > --- a/util/userfaultfd.c
> > +++ b/util/userfaultfd.c
> > @@ -18,10 +18,46 @@
> >   #include <poll.h>
> >   #include <sys/syscall.h>
> >   #include <sys/ioctl.h>
> > +#include <fcntl.h>
> > +
> > +typedef enum {
> > +    UFFD_UNINITIALIZED = 0,
> > +    UFFD_USE_DEV_PATH,
> > +    UFFD_USE_SYSCALL,
> > +} uffd_open_mode;
> > +
> > +static uffd_open_mode open_mode;
> 
> 'open_mode' could be reduced to uffd_detect_open_mode()'s
> scope.

Yes, will do.

> 
> > +static int uffd_dev;
> > +
> > +static uffd_open_mode uffd_detect_open_mode(void)
> > +{
> > +    if (open_mode == UFFD_UNINITIALIZED) {
> > +        /*
> > +         * Make /dev/userfaultfd the default approach because it has better
> > +         * permission controls, meanwhile allows kernel faults without any
> > +         * privilege requirement (e.g. SYS_CAP_PTRACE).
> > +         */
> > +        uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
> > +        if (uffd_dev >= 0) {
> > +            open_mode = UFFD_USE_DEV_PATH;
> > +        } else {
> > +            /* Fallback to the system call */
> > +            open_mode = UFFD_USE_SYSCALL;
> > +        }
> > +        trace_uffd_detect_open_mode(open_mode);
> > +    }
> > +
> > +    return open_mode;
> 
> If 'open_mode' isn't relevant, this function could return uffd_dev/-1
> instead. Not really an improvement :)

Logically I think the two variables can be squashed into one.  I kept that
for clearance just to easily identify e.g. uffd_dev is not chosen to be
used, or uffd_dev open failed.

> 
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

Thanks, Phil.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-26  9:05     ` Daniel P. Berrangé
@ 2023-01-26 20:03       ` Peter Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Xu @ 2023-01-26 20:03 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Leonardo Bras Soares Passos, Dr . David Alan Gilbert,
	Juan Quintela

On Thu, Jan 26, 2023 at 09:05:01AM +0000, Daniel P. Berrangé wrote:
> On Thu, Jan 26, 2023 at 09:02:09AM +0000, Daniel P. Berrangé wrote:
> > On Wed, Jan 25, 2023 at 05:40:16PM -0500, Peter Xu wrote:
> > > Teach QEMU to use /dev/userfaultfd when it existed and fallback to the
> > > system call if either it's not there or doesn't have enough permission.
> > > 
> > > Firstly, as long as the app has permission to access /dev/userfaultfd, it
> > > always have the ability to trap kernel faults which QEMU mostly wants.
> > > Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be
> > > forbidden, so it can be the major way to use postcopy in a restricted
> > > environment with strict seccomp setup.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  util/trace-events  |  1 +
> > >  util/userfaultfd.c | 36 ++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 37 insertions(+)
> > > 
> > > diff --git a/util/trace-events b/util/trace-events
> > > index c8f53d7d9f..16f78d8fe5 100644
> > > --- a/util/trace-events
> > > +++ b/util/trace-events
> > > @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_siz
> > >  qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p"
> > >  
> > >  #userfaultfd.c
> > > +uffd_detect_open_mode(int mode) "%d"
> > >  uffd_query_features_nosys(int err) "errno: %i"
> > >  uffd_query_features_api_failed(int err) "errno: %i"
> > >  uffd_create_fd_nosys(int err) "errno: %i"
> > > diff --git a/util/userfaultfd.c b/util/userfaultfd.c
> > > index 9845a2ec81..360ecf8084 100644
> > > --- a/util/userfaultfd.c
> > > +++ b/util/userfaultfd.c
> > > @@ -18,10 +18,46 @@
> > >  #include <poll.h>
> > >  #include <sys/syscall.h>
> > >  #include <sys/ioctl.h>
> > > +#include <fcntl.h>
> > > +
> > > +typedef enum {
> > > +    UFFD_UNINITIALIZED = 0,
> > > +    UFFD_USE_DEV_PATH,
> > > +    UFFD_USE_SYSCALL,
> > > +} uffd_open_mode;
> > > +
> > > +static uffd_open_mode open_mode;
> > > +static int uffd_dev;
> > > +
> > > +static uffd_open_mode uffd_detect_open_mode(void)
> > > +{
> > > +    if (open_mode == UFFD_UNINITIALIZED) {
> > > +        /*
> > > +         * Make /dev/userfaultfd the default approach because it has better
> > > +         * permission controls, meanwhile allows kernel faults without any
> > > +         * privilege requirement (e.g. SYS_CAP_PTRACE).
> > > +         */
> > > +        uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
> > 
> > qemu_open(), otherwise FD passing from the mgmt app won't work.

[I've followed this up in the other thread on interfacing libvirt, so will
 skip here] 

> > 
> > > +        if (uffd_dev >= 0) {
> > > +            open_mode = UFFD_USE_DEV_PATH;
> > > +        } else {
> > > +            /* Fallback to the system call */
> > > +            open_mode = UFFD_USE_SYSCALL;
> > > +        }
> > > +        trace_uffd_detect_open_mode(open_mode);
> > > +    }
> > > +
> > > +    return open_mode;
> > > +}
> > 
> > This leaves the /dev/userfaultfd FD open forever once it has been used
> > once. Is this really needed ? IIUC, the place where we call this is
> > not going to be impacted if we open + close it every time we need to
> > create a new FD, and it'll simplify this code right down.
> 
> Having said that, if we want to support passing the FD in from the
> mgmt app, we need to keep it open persistently.

Right, since the plan is to further support libvirt, I'll keep it as is.

Meanwhile, right now QEMU detects uffd features by creating an uffd and
quickly close it, it's also efficient to keep it when it's firstly opened.
IIRC for each postcopy procedure we'll open uffd at least three times
during different phases.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-26 17:26         ` Peter Xu
@ 2023-01-31 19:48           ` Peter Xu
  2023-01-31 20:06             ` Daniel P. Berrangé
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-01-31 19:48 UTC (permalink / raw)
  To: Daniel P. Berrangé, Michal Prívozník
  Cc: Dr. David Alan Gilbert, Michal Prívozník, qemu-devel,
	Leonardo Bras Soares Passos, Juan Quintela

On Thu, Jan 26, 2023 at 12:26:45PM -0500, Peter Xu wrote:
> On Thu, Jan 26, 2023 at 03:59:33PM +0000, Daniel P. Berrangé wrote:
> > On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote:
> > > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Michal Prívozník (mprivozn@redhat.com) wrote:
> > > > > On 1/25/23 23:40, Peter Xu wrote:
> > > > > > The new /dev/userfaultfd handle is superior to the system call with a
> > > > > > better permission control and also works for a restricted seccomp
> > > > > > environment.
> > > > > > 
> > > > > > The new device was only introduced in v6.1 so we need a header update.
> > > > > > 
> > > > > > Please have a look, thanks.
> > > > > 
> > > > > I was wondering whether it would make sense/be possible for mgmt app
> > > > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
> > > > > itself. But looking into the code, libvirt would need to do that when
> > > > > spawning QEMU because that's when QEMU itself initializes internal state
> > > > > and queries userfaultfd caps.
> > > > 
> > > > You also have to be careful about what the userfaultfd semantics are; I
> > > > can't remember them - but if you open it in one process and pass it to
> > > > another process, which processes address space are you trying to
> > > > monitor?
> > > 
> > > Yes it's a problem.  The kernel always fetches the current mm_struct* which
> > > represents the current context of virtual address space when creating the
> > > uffd handle (for either the syscall or the ioctl() approach).
> > 
> > At what point does the process address space get associated ? When
> > the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW)
> > is called ?  If it is the former, then we have no choice, QEMU must open
> > it. if it is the latter, then libvirt can open /dev/userfaultfd, pass
> > it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW).
> 
> Good point.. It should be the latter, so should be doable.
> 
> What should be the best interface for QEMU to detect the fd passing over to
> it?  IIUC qemu_open() requires the name to be /dev/fdset/*, but there's no
> existing cmdline that QEMU can know which fd number to fetch from fdset to
> be used as the /dev/userfaultfd descriptor.
> 
> monitor_get_fd() seems more proper, where we can define an unique string so
> Libvirt can preset the descriptor with the same string attached to it, then
> I can opt-in monitor_get_fd() before trying to open() or doing the syscall.

Daniel/Michal, any input here from Libvirt side?

I just noticed that monitor_get_fd() is bound to a specific monitor, then
it seems not clear which one is from Libvirt.  If to use qemu_open() and
add-fd I think we need another QEMU cmdline to set the fd path, iiuc.

I can also leave that for later if opening /dev/userfaultfd is already
resolving the immediate problem in containers.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-31 19:48           ` Peter Xu
@ 2023-01-31 20:06             ` Daniel P. Berrangé
  2023-01-31 21:01               ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2023-01-31 20:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: Michal Prívozník, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras Soares Passos, Juan Quintela

On Tue, Jan 31, 2023 at 02:48:54PM -0500, Peter Xu wrote:
> On Thu, Jan 26, 2023 at 12:26:45PM -0500, Peter Xu wrote:
> > On Thu, Jan 26, 2023 at 03:59:33PM +0000, Daniel P. Berrangé wrote:
> > > On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote:
> > > > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Michal Prívozník (mprivozn@redhat.com) wrote:
> > > > > > On 1/25/23 23:40, Peter Xu wrote:
> > > > > > > The new /dev/userfaultfd handle is superior to the system call with a
> > > > > > > better permission control and also works for a restricted seccomp
> > > > > > > environment.
> > > > > > > 
> > > > > > > The new device was only introduced in v6.1 so we need a header update.
> > > > > > > 
> > > > > > > Please have a look, thanks.
> > > > > > 
> > > > > > I was wondering whether it would make sense/be possible for mgmt app
> > > > > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
> > > > > > itself. But looking into the code, libvirt would need to do that when
> > > > > > spawning QEMU because that's when QEMU itself initializes internal state
> > > > > > and queries userfaultfd caps.
> > > > > 
> > > > > You also have to be careful about what the userfaultfd semantics are; I
> > > > > can't remember them - but if you open it in one process and pass it to
> > > > > another process, which processes address space are you trying to
> > > > > monitor?
> > > > 
> > > > Yes it's a problem.  The kernel always fetches the current mm_struct* which
> > > > represents the current context of virtual address space when creating the
> > > > uffd handle (for either the syscall or the ioctl() approach).
> > > 
> > > At what point does the process address space get associated ? When
> > > the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW)
> > > is called ?  If it is the former, then we have no choice, QEMU must open
> > > it. if it is the latter, then libvirt can open /dev/userfaultfd, pass
> > > it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW).
> > 
> > Good point.. It should be the latter, so should be doable.
> > 
> > What should be the best interface for QEMU to detect the fd passing over to
> > it?  IIUC qemu_open() requires the name to be /dev/fdset/*, but there's no
> > existing cmdline that QEMU can know which fd number to fetch from fdset to
> > be used as the /dev/userfaultfd descriptor.
> > 
> > monitor_get_fd() seems more proper, where we can define an unique string so
> > Libvirt can preset the descriptor with the same string attached to it, then
> > I can opt-in monitor_get_fd() before trying to open() or doing the syscall.
> 
> Daniel/Michal, any input here from Libvirt side?
> 
> I just noticed that monitor_get_fd() is bound to a specific monitor, then
> it seems not clear which one is from Libvirt.  If to use qemu_open() and
> add-fd I think we need another QEMU cmdline to set the fd path, iiuc.
> 
> I can also leave that for later if opening /dev/userfaultfd is already
> resolving the immediate problem in containers.

I don't have any great ideas really. If we assume the /dev/userfaultfd
is accessible to QEMU we can ignore it.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-31 20:06             ` Daniel P. Berrangé
@ 2023-01-31 21:01               ` Peter Xu
  2023-02-01  7:55                 ` Michal Prívozník
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-01-31 21:01 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michal Prívozník, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras Soares Passos, Juan Quintela

On Tue, Jan 31, 2023 at 08:06:55PM +0000, Daniel P. Berrangé wrote:
> On Tue, Jan 31, 2023 at 02:48:54PM -0500, Peter Xu wrote:
> > On Thu, Jan 26, 2023 at 12:26:45PM -0500, Peter Xu wrote:
> > > On Thu, Jan 26, 2023 at 03:59:33PM +0000, Daniel P. Berrangé wrote:
> > > > On Thu, Jan 26, 2023 at 10:25:05AM -0500, Peter Xu wrote:
> > > > > On Thu, Jan 26, 2023 at 02:15:11PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > * Michal Prívozník (mprivozn@redhat.com) wrote:
> > > > > > > On 1/25/23 23:40, Peter Xu wrote:
> > > > > > > > The new /dev/userfaultfd handle is superior to the system call with a
> > > > > > > > better permission control and also works for a restricted seccomp
> > > > > > > > environment.
> > > > > > > > 
> > > > > > > > The new device was only introduced in v6.1 so we need a header update.
> > > > > > > > 
> > > > > > > > Please have a look, thanks.
> > > > > > > 
> > > > > > > I was wondering whether it would make sense/be possible for mgmt app
> > > > > > > (libvirt) to pass FD for /dev/userfaultfd instead of QEMU opening it
> > > > > > > itself. But looking into the code, libvirt would need to do that when
> > > > > > > spawning QEMU because that's when QEMU itself initializes internal state
> > > > > > > and queries userfaultfd caps.
> > > > > > 
> > > > > > You also have to be careful about what the userfaultfd semantics are; I
> > > > > > can't remember them - but if you open it in one process and pass it to
> > > > > > another process, which processes address space are you trying to
> > > > > > monitor?
> > > > > 
> > > > > Yes it's a problem.  The kernel always fetches the current mm_struct* which
> > > > > represents the current context of virtual address space when creating the
> > > > > uffd handle (for either the syscall or the ioctl() approach).
> > > > 
> > > > At what point does the process address space get associated ? When
> > > > the /dev/userfaultfd is opened, or only when ioctl(USERFAULTFD_IOC_NEW)
> > > > is called ?  If it is the former, then we have no choice, QEMU must open
> > > > it. if it is the latter, then libvirt can open /dev/userfaultfd, pass
> > > > it to QEMU which can then do the ioctl(USERFAULTFD_IOC_NEW).
> > > 
> > > Good point.. It should be the latter, so should be doable.
> > > 
> > > What should be the best interface for QEMU to detect the fd passing over to
> > > it?  IIUC qemu_open() requires the name to be /dev/fdset/*, but there's no
> > > existing cmdline that QEMU can know which fd number to fetch from fdset to
> > > be used as the /dev/userfaultfd descriptor.
> > > 
> > > monitor_get_fd() seems more proper, where we can define an unique string so
> > > Libvirt can preset the descriptor with the same string attached to it, then
> > > I can opt-in monitor_get_fd() before trying to open() or doing the syscall.
> > 
> > Daniel/Michal, any input here from Libvirt side?
> > 
> > I just noticed that monitor_get_fd() is bound to a specific monitor, then
> > it seems not clear which one is from Libvirt.  If to use qemu_open() and
> > add-fd I think we need another QEMU cmdline to set the fd path, iiuc.
> > 
> > I can also leave that for later if opening /dev/userfaultfd is already
> > resolving the immediate problem in containers.
> 
> I don't have any great ideas really. If we assume the /dev/userfaultfd
> is accessible to QEMU we can ignore it.

It's my understanding that QEMU process will be invoked by the user or
group that has access to /dev/userfaultfd, probably in the same context as
what Libvirt specified. So hopefully everything will work out naturally
already.

There's one thing I'm unsure on introducing a new qemu cmdline option - I
can't remember where I get this memory, but - IIRC Paolo suggested at some
point to reduce or forbid introducing new options to QEMU.

To remedy that, we can also add a migration parameter which will point to
/dev/userfaultfd (which can be set to "/dev/fdsets/N" by Libvirt in QMP in
QEMU's early stage), considering that so far most of the uffd features are
used by migration submodule, IMHO it's fine to do so.

Said that, I think we can always work on top of this series if that'll be
useful to libvirt some day; the change should be trivial.  So I can keep
this series simple.

I'll wait 1-2 more days to see whether Michal has anything to comment.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-01-31 21:01               ` Peter Xu
@ 2023-02-01  7:55                 ` Michal Prívozník
  2023-02-01 14:58                   ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Prívozník @ 2023-02-01  7:55 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: Dr. David Alan Gilbert, qemu-devel, Leonardo Bras Soares Passos,
	Juan Quintela

On 1/31/23 22:01, Peter Xu wrote:
> I'll wait 1-2 more days to see whether Michal has anything to comment.

Yeah, we can go with your patches and leave FD passing for future work.
It's orthogonal after all.

In the end we can have (in the order of precedence):

1) new cmd line argument, say:

   qemu-system-x86_64 -userfaultfd fd=5 # where FD 5 is passed by
libvirt when exec()-ing qemu, just like other FDs, e.g. -chardev
socket,fd=XXX

2) your patches, where qemu opens /dev/userfaultfd

3) current behavior, userfaultfd syscall


Michal



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd
  2023-02-01  7:55                 ` Michal Prívozník
@ 2023-02-01 14:58                   ` Peter Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Xu @ 2023-02-01 14:58 UTC (permalink / raw)
  To: Michal Prívozník
  Cc: Daniel P. Berrangé,
	Dr. David Alan Gilbert, qemu-devel, Leonardo Bras Soares Passos,
	Juan Quintela

On Wed, Feb 01, 2023 at 08:55:01AM +0100, Michal Prívozník wrote:
> On 1/31/23 22:01, Peter Xu wrote:
> > I'll wait 1-2 more days to see whether Michal has anything to comment.
> 
> Yeah, we can go with your patches and leave FD passing for future work.
> It's orthogonal after all.
> 
> In the end we can have (in the order of precedence):
> 
> 1) new cmd line argument, say:
> 
>    qemu-system-x86_64 -userfaultfd fd=5 # where FD 5 is passed by
> libvirt when exec()-ing qemu, just like other FDs, e.g. -chardev
> socket,fd=XXX
> 
> 2) your patches, where qemu opens /dev/userfaultfd
> 
> 3) current behavior, userfaultfd syscall

Sounds good.  Thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-02-01 14:58 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-25 22:40 [PATCH 0/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
2023-01-25 22:40 ` [PATCH 1/3] linux-headers: Update to v6.1 Peter Xu
2023-01-25 22:40 ` [PATCH 2/3] util/userfaultfd: Add uffd_open() Peter Xu
2023-01-25 23:04   ` Philippe Mathieu-Daudé
2023-01-26 15:58     ` Peter Xu
2023-01-25 22:40 ` [PATCH 3/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu
2023-01-25 23:08   ` Philippe Mathieu-Daudé
2023-01-26 17:33     ` Peter Xu
2023-01-26  9:02   ` Daniel P. Berrangé
2023-01-26  9:05     ` Daniel P. Berrangé
2023-01-26 20:03       ` Peter Xu
2023-01-26 14:13 ` [PATCH 0/3] " Michal Prívozník
2023-01-26 14:15   ` Dr. David Alan Gilbert
2023-01-26 15:25     ` Peter Xu
2023-01-26 15:29       ` Michal Prívozník
2023-01-26 15:49         ` Peter Xu
2023-01-26 15:59       ` Daniel P. Berrangé
2023-01-26 17:26         ` Peter Xu
2023-01-31 19:48           ` Peter Xu
2023-01-31 20:06             ` Daniel P. Berrangé
2023-01-31 21:01               ` Peter Xu
2023-02-01  7:55                 ` Michal Prívozník
2023-02-01 14:58                   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.