All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/21] migration: Support hugetlb doublemaps
@ 2023-01-17 22:08 Peter Xu
  2023-01-17 22:08 ` [PATCH RFC 01/21] update linux headers Peter Xu
                   ` (20 more replies)
  0 siblings, 21 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Based-on: <20221213213850.1481858-1-peterx@redhat.com>
  [PATCH 0/5] migration: Fix disorder of channel creations

Trees for reference:
  https://github.com/xzpeter/linux/releases/tag/doublemap-v0.1
  https://github.com/xzpeter/qemu/releases/tag/doublemap-v0.1

This is an RFC series that only for early discussion purpose but not for
merging.

This patchset allows postcopy to work with huge pages better by migrating
huge pages in small page sizes.  It relies on a kernel feature called
"hugetlb HGM" which is currently proposed on the Linux kernel mailing list
by James Houghton, latest version v1:

https://lore.kernel.org/r/20230105101844.1893104-1-jthoughton@google.com

[PS: The kernel v1 patchset may need a few fixups to make QEMU work, which
 are all contained in the tree link provided tagged doublemap-v0.1 above]

The kernel series is still during review upstream, so the API is still not
stable.

I kept the old name of "doublemap" in this QEMU patchset to represent HGM.
With that, huge pages can be mapped with even smaller sizes than the huge
page itself.  It can drastically reduce page fault latencies during
postcopy if the guest has hugepage backed memories and make postcopy start
working with huge pages.  The average page request latency can drop from
~1sec to ~250us for 1G backed in the initial test results.

UFFDIO_COPY doesn't support mapping huge pages in small sizes, so one major
part of this series introduced UFFDIO_CONTINUE to resolve page faults for
hugetlb mappings.

Sampled page latency histogram for 18G guest with/without doublemap
(preempt=on, single thread busy spin workload over 18G map):

Before:

@delay_us:
[64, 128)              3 |@                                                   |
[128, 256)            84 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)            10 |@@@@@@                                              |
[512, 1K)              1 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)               0 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)              17 |@@@@@@@@@@                                          |
[2M, 4M)              21 |@@@@@@@@@@@@@                                       |
[4M, 8M)               8 |@@@@                                                |
[8M, 16M)              4 |@@                                                  |

After:

@delay_us:
[16, 32)               6 |                                                    |
[32, 64)               6 |                                                    |
[64, 128)           3117 |@@                                                  |
[128, 256)         70815 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)         30460 |@@@@@@@@@@@@@@@@@@@@@@                              |
[512, 1K)           1135 |                                                    |
[1K, 2K)              34 |                                                    |
[2K, 4K)              42 |                                                    |
[4K, 8K)             126 |                                                    |
[8K, 16K)             91 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             1 |                                                    |

Any early comment welcomed.  Thanks.

Peter Xu (21):
  update linux headers
  util: Include osdep.h first in util/mmap-alloc.c
  physmem: Add qemu_ram_is_hugetlb()
  madvise: Include linux/mman.h under linux-headers/
  madvise: Add QEMU_MADV_SPLIT
  madvise: Add QEMU_MADV_COLLAPSE
  ramblock: Cache file offset for file-backed ramblocks
  ramblock: Cache the length to do file mmap() on ramblocks
  ramblock: Add RAM_READONLY
  ramblock: Add ramblock_file_map()
  migration: Add hugetlb-doublemap cap
  migration: Introduce page size for-migration-only
  migration: Add migration_ram_pagesize_largest()
  migration: Map hugetlbfs ramblocks twice, and pre-allocate
  migration: Teach qemu about minor faults and doublemap
  migration: Enable doublemap with MADV_SPLIT
  migration: Rework ram discard logic for hugetlb double-map
  migration: Allow postcopy_register_shared_ufd() to fail
  migration: Add postcopy_mark_received()
  migration: Handle page faults using UFFDIO_CONTINUE
  migration: Collapse huge pages again after postcopy finished

 backends/hostmem-file.c                       |   3 +-
 hw/virtio/vhost-user.c                        |   9 +-
 include/exec/cpu-common.h                     |   3 +-
 include/exec/memory.h                         |   4 +-
 include/exec/ram_addr.h                       |   6 +-
 include/exec/ramblock.h                       |  14 +
 include/qemu/madvise.h                        |  18 ++
 include/standard-headers/drm/drm_fourcc.h     |  63 +++-
 include/standard-headers/linux/ethtool.h      |  81 ++++-
 include/standard-headers/linux/fuse.h         |  20 +-
 .../linux/input-event-codes.h                 |   4 +
 include/standard-headers/linux/pci_regs.h     |   2 +
 include/standard-headers/linux/virtio_blk.h   |  19 ++
 include/standard-headers/linux/virtio_bt.h    |   8 +
 include/standard-headers/linux/virtio_net.h   |   4 +
 linux-headers/asm-arm64/kvm.h                 |   1 +
 linux-headers/asm-generic/hugetlb_encode.h    |  26 +-
 linux-headers/asm-generic/mman-common.h       |   4 +
 linux-headers/asm-mips/mman.h                 |   4 +
 linux-headers/asm-riscv/kvm.h                 |   7 +
 linux-headers/asm-x86/kvm.h                   |  11 +-
 linux-headers/linux/kvm.h                     |  32 +-
 linux-headers/linux/psci.h                    |  14 +
 linux-headers/linux/userfaultfd.h             |   4 +
 linux-headers/linux/vfio.h                    | 278 +++++++++++++++++-
 migration/migration.c                         |  56 +++-
 migration/migration.h                         |   1 +
 migration/postcopy-ram.c                      | 228 +++++++++++---
 migration/postcopy-ram.h                      |   5 +-
 migration/ram.c                               | 165 ++++++++++-
 migration/ram.h                               |   2 +
 migration/trace-events                        |   6 +-
 qapi/migration.json                           |   7 +-
 softmmu/memory.c                              |   8 +-
 softmmu/physmem.c                             |  92 ++++--
 util/mmap-alloc.c                             |   2 +-
 36 files changed, 1051 insertions(+), 160 deletions(-)

-- 
2.37.3



^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 01/21] update linux headers
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
@ 2023-01-17 22:08 ` Peter Xu
  2023-01-17 22:08 ` [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c Peter Xu
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/standard-headers/drm/drm_fourcc.h     |  63 +++-
 include/standard-headers/linux/ethtool.h      |  81 ++++-
 include/standard-headers/linux/fuse.h         |  20 +-
 .../linux/input-event-codes.h                 |   4 +
 include/standard-headers/linux/pci_regs.h     |   2 +
 include/standard-headers/linux/virtio_blk.h   |  19 ++
 include/standard-headers/linux/virtio_bt.h    |   8 +
 include/standard-headers/linux/virtio_net.h   |   4 +
 linux-headers/asm-arm64/kvm.h                 |   1 +
 linux-headers/asm-generic/hugetlb_encode.h    |  26 +-
 linux-headers/asm-generic/mman-common.h       |   4 +
 linux-headers/asm-mips/mman.h                 |   4 +
 linux-headers/asm-riscv/kvm.h                 |   7 +
 linux-headers/asm-x86/kvm.h                   |  11 +-
 linux-headers/linux/kvm.h                     |  32 +-
 linux-headers/linux/psci.h                    |  14 +
 linux-headers/linux/userfaultfd.h             |   4 +
 linux-headers/linux/vfio.h                    | 278 +++++++++++++++++-
 18 files changed, 526 insertions(+), 56 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h b/include/standard-headers/drm/drm_fourcc.h
index 48b620cbef..69cab17b38 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -98,18 +98,42 @@ extern "C" {
 #define DRM_FORMAT_INVALID	0
 
 /* color index */
+#define DRM_FORMAT_C1		fourcc_code('C', '1', ' ', ' ') /* [7:0] C0:C1:C2:C3:C4:C5:C6:C7 1:1:1:1:1:1:1:1 eight pixels/byte */
+#define DRM_FORMAT_C2		fourcc_code('C', '2', ' ', ' ') /* [7:0] C0:C1:C2:C3 2:2:2:2 four pixels/byte */
+#define DRM_FORMAT_C4		fourcc_code('C', '4', ' ', ' ') /* [7:0] C0:C1 4:4 two pixels/byte */
 #define DRM_FORMAT_C8		fourcc_code('C', '8', ' ', ' ') /* [7:0] C */
 
-/* 8 bpp Red */
+/* 1 bpp Darkness (inverse relationship between channel value and brightness) */
+#define DRM_FORMAT_D1		fourcc_code('D', '1', ' ', ' ') /* [7:0] D0:D1:D2:D3:D4:D5:D6:D7 1:1:1:1:1:1:1:1 eight pixels/byte */
+
+/* 2 bpp Darkness (inverse relationship between channel value and brightness) */
+#define DRM_FORMAT_D2		fourcc_code('D', '2', ' ', ' ') /* [7:0] D0:D1:D2:D3 2:2:2:2 four pixels/byte */
+
+/* 4 bpp Darkness (inverse relationship between channel value and brightness) */
+#define DRM_FORMAT_D4		fourcc_code('D', '4', ' ', ' ') /* [7:0] D0:D1 4:4 two pixels/byte */
+
+/* 8 bpp Darkness (inverse relationship between channel value and brightness) */
+#define DRM_FORMAT_D8		fourcc_code('D', '8', ' ', ' ') /* [7:0] D */
+
+/* 1 bpp Red (direct relationship between channel value and brightness) */
+#define DRM_FORMAT_R1		fourcc_code('R', '1', ' ', ' ') /* [7:0] R0:R1:R2:R3:R4:R5:R6:R7 1:1:1:1:1:1:1:1 eight pixels/byte */
+
+/* 2 bpp Red (direct relationship between channel value and brightness) */
+#define DRM_FORMAT_R2		fourcc_code('R', '2', ' ', ' ') /* [7:0] R0:R1:R2:R3 2:2:2:2 four pixels/byte */
+
+/* 4 bpp Red (direct relationship between channel value and brightness) */
+#define DRM_FORMAT_R4		fourcc_code('R', '4', ' ', ' ') /* [7:0] R0:R1 4:4 two pixels/byte */
+
+/* 8 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R8		fourcc_code('R', '8', ' ', ' ') /* [7:0] R */
 
-/* 10 bpp Red */
+/* 10 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R10		fourcc_code('R', '1', '0', ' ') /* [15:0] x:R 6:10 little endian */
 
-/* 12 bpp Red */
+/* 12 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R12		fourcc_code('R', '1', '2', ' ') /* [15:0] x:R 4:12 little endian */
 
-/* 16 bpp Red */
+/* 16 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R16		fourcc_code('R', '1', '6', ' ') /* [15:0] R little endian */
 
 /* 16 bpp RG */
@@ -204,7 +228,9 @@ extern "C" {
 #define DRM_FORMAT_VYUY		fourcc_code('V', 'Y', 'U', 'Y') /* [31:0] Y1:Cb0:Y0:Cr0 8:8:8:8 little endian */
 
 #define DRM_FORMAT_AYUV		fourcc_code('A', 'Y', 'U', 'V') /* [31:0] A:Y:Cb:Cr 8:8:8:8 little endian */
+#define DRM_FORMAT_AVUY8888	fourcc_code('A', 'V', 'U', 'Y') /* [31:0] A:Cr:Cb:Y 8:8:8:8 little endian */
 #define DRM_FORMAT_XYUV8888	fourcc_code('X', 'Y', 'U', 'V') /* [31:0] X:Y:Cb:Cr 8:8:8:8 little endian */
+#define DRM_FORMAT_XVUY8888	fourcc_code('X', 'V', 'U', 'Y') /* [31:0] X:Cr:Cb:Y 8:8:8:8 little endian */
 #define DRM_FORMAT_VUY888	fourcc_code('V', 'U', '2', '4') /* [23:0] Cr:Cb:Y 8:8:8 little endian */
 #define DRM_FORMAT_VUY101010	fourcc_code('V', 'U', '3', '0') /* Y followed by U then V, 10:10:10. Non-linear modifier only */
 
@@ -717,6 +743,35 @@ extern "C" {
  */
 #define DRM_FORMAT_MOD_VIVANTE_SPLIT_SUPER_TILED fourcc_mod_code(VIVANTE, 4)
 
+/*
+ * Vivante TS (tile-status) buffer modifiers. They can be combined with all of
+ * the color buffer tiling modifiers defined above. When TS is present it's a
+ * separate buffer containing the clear/compression status of each tile. The
+ * modifiers are defined as VIVANTE_MOD_TS_c_s, where c is the color buffer
+ * tile size in bytes covered by one entry in the status buffer and s is the
+ * number of status bits per entry.
+ * We reserve the top 8 bits of the Vivante modifier space for tile status
+ * clear/compression modifiers, as future cores might add some more TS layout
+ * variations.
+ */
+#define VIVANTE_MOD_TS_64_4               (1ULL << 48)
+#define VIVANTE_MOD_TS_64_2               (2ULL << 48)
+#define VIVANTE_MOD_TS_128_4              (3ULL << 48)
+#define VIVANTE_MOD_TS_256_4              (4ULL << 48)
+#define VIVANTE_MOD_TS_MASK               (0xfULL << 48)
+
+/*
+ * Vivante compression modifiers. Those depend on a TS modifier being present
+ * as the TS bits get reinterpreted as compression tags instead of simple
+ * clear markers when compression is enabled.
+ */
+#define VIVANTE_MOD_COMP_DEC400           (1ULL << 52)
+#define VIVANTE_MOD_COMP_MASK             (0xfULL << 52)
+
+/* Masking out the extension bits will yield the base modifier. */
+#define VIVANTE_MOD_EXT_MASK              (VIVANTE_MOD_TS_MASK | \
+                                           VIVANTE_MOD_COMP_MASK)
+
 /* NVIDIA frame buffer modifiers */
 
 /*
diff --git a/include/standard-headers/linux/ethtool.h b/include/standard-headers/linux/ethtool.h
index 4537da20cc..87176ab075 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -159,8 +159,10 @@ static inline uint32_t ethtool_cmd_speed(const struct ethtool_cmd *ep)
  *	in its bus driver structure (e.g. pci_driver::name).  Must
  *	not be an empty string.
  * @version: Driver version string; may be an empty string
- * @fw_version: Firmware version string; may be an empty string
- * @erom_version: Expansion ROM version string; may be an empty string
+ * @fw_version: Firmware version string; driver defined; may be an
+ *	empty string
+ * @erom_version: Expansion ROM version string; driver defined; may be
+ *	an empty string
  * @bus_info: Device bus address.  This should match the dev_name()
  *	string for the underlying bus device, if there is one.  May be
  *	an empty string.
@@ -179,10 +181,6 @@ static inline uint32_t ethtool_cmd_speed(const struct ethtool_cmd *ep)
  *
  * Users can use the %ETHTOOL_GSSET_INFO command to get the number of
  * strings in any string set (from Linux 2.6.34).
- *
- * Drivers should set at most @driver, @version, @fw_version and
- * @bus_info in their get_drvinfo() implementation.  The ethtool
- * core fills in the other fields using other driver operations.
  */
 struct ethtool_drvinfo {
 	uint32_t	cmd;
@@ -736,6 +734,51 @@ enum ethtool_module_power_mode {
 	ETHTOOL_MODULE_POWER_MODE_HIGH,
 };
 
+/**
+ * enum ethtool_podl_pse_admin_state - operational state of the PoDL PSE
+ *	functions. IEEE 802.3-2018 30.15.1.1.2 aPoDLPSEAdminState
+ * @ETHTOOL_PODL_PSE_ADMIN_STATE_UNKNOWN: state of PoDL PSE functions are
+ * 	unknown
+ * @ETHTOOL_PODL_PSE_ADMIN_STATE_DISABLED: PoDL PSE functions are disabled
+ * @ETHTOOL_PODL_PSE_ADMIN_STATE_ENABLED: PoDL PSE functions are enabled
+ */
+enum ethtool_podl_pse_admin_state {
+	ETHTOOL_PODL_PSE_ADMIN_STATE_UNKNOWN = 1,
+	ETHTOOL_PODL_PSE_ADMIN_STATE_DISABLED,
+	ETHTOOL_PODL_PSE_ADMIN_STATE_ENABLED,
+};
+
+/**
+ * enum ethtool_podl_pse_pw_d_status - power detection status of the PoDL PSE.
+ *	IEEE 802.3-2018 30.15.1.1.3 aPoDLPSEPowerDetectionStatus:
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_UNKNOWN: PoDL PSE
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_DISABLED: "The enumeration “disabled” is
+ *	asserted true when the PoDL PSE state diagram variable mr_pse_enable is
+ *	false"
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_SEARCHING: "The enumeration “searching” is
+ *	asserted true when either of the PSE state diagram variables
+ *	pi_detecting or pi_classifying is true."
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_DELIVERING: "The enumeration “deliveringPower”
+ *	is asserted true when the PoDL PSE state diagram variable pi_powered is
+ *	true."
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_SLEEP: "The enumeration “sleep” is asserted
+ *	true when the PoDL PSE state diagram variable pi_sleeping is true."
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_IDLE: "The enumeration “idle” is asserted true
+ *	when the logical combination of the PoDL PSE state diagram variables
+ *	pi_prebiased*!pi_sleeping is true."
+ * @ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR: "The enumeration “error” is asserted
+ *	true when the PoDL PSE state diagram variable overload_held is true."
+ */
+enum ethtool_podl_pse_pw_d_status {
+	ETHTOOL_PODL_PSE_PW_D_STATUS_UNKNOWN = 1,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_DISABLED,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_SEARCHING,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_DELIVERING,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_SLEEP,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_IDLE,
+	ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR,
+};
+
 /**
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
@@ -1692,6 +1735,13 @@ enum ethtool_link_mode_bit_indices {
 	ETHTOOL_LINK_MODE_100baseFX_Half_BIT		 = 90,
 	ETHTOOL_LINK_MODE_100baseFX_Full_BIT		 = 91,
 	ETHTOOL_LINK_MODE_10baseT1L_Full_BIT		 = 92,
+	ETHTOOL_LINK_MODE_800000baseCR8_Full_BIT	 = 93,
+	ETHTOOL_LINK_MODE_800000baseKR8_Full_BIT	 = 94,
+	ETHTOOL_LINK_MODE_800000baseDR8_Full_BIT	 = 95,
+	ETHTOOL_LINK_MODE_800000baseDR8_2_Full_BIT	 = 96,
+	ETHTOOL_LINK_MODE_800000baseSR8_Full_BIT	 = 97,
+	ETHTOOL_LINK_MODE_800000baseVR8_Full_BIT	 = 98,
+
 	/* must be last entry */
 	__ETHTOOL_LINK_MODE_MASK_NBITS
 };
@@ -1803,6 +1853,7 @@ enum ethtool_link_mode_bit_indices {
 #define SPEED_100000		100000
 #define SPEED_200000		200000
 #define SPEED_400000		400000
+#define SPEED_800000		800000
 
 #define SPEED_UNKNOWN		-1
 
@@ -1840,6 +1891,20 @@ static inline int ethtool_validate_duplex(uint8_t duplex)
 #define MASTER_SLAVE_STATE_SLAVE		3
 #define MASTER_SLAVE_STATE_ERR			4
 
+/* These are used to throttle the rate of data on the phy interface when the
+ * native speed of the interface is higher than the link speed. These should
+ * not be used for phy interfaces which natively support multiple speeds (e.g.
+ * MII or SGMII).
+ */
+/* No rate matching performed. */
+#define RATE_MATCH_NONE		0
+/* The phy sends pause frames to throttle the MAC. */
+#define RATE_MATCH_PAUSE	1
+/* The phy asserts CRS to prevent the MAC from transmitting. */
+#define RATE_MATCH_CRS		2
+/* The MAC is programmed with a sufficiently-large IPG. */
+#define RATE_MATCH_OPEN_LOOP	3
+
 /* Which connector port. */
 #define PORT_TP			0x00
 #define PORT_AUI		0x01
@@ -2033,8 +2098,8 @@ enum ethtool_reset_flags {
  *	reported consistently by PHYLIB.  Read-only.
  * @master_slave_cfg: Master/slave port mode.
  * @master_slave_state: Master/slave port state.
+ * @rate_matching: Rate adaptation performed by the PHY
  * @reserved: Reserved for future use; see the note on reserved space.
- * @reserved1: Reserved for future use; see the note on reserved space.
  * @link_mode_masks: Variable length bitmaps.
  *
  * If autonegotiation is disabled, the speed and @duplex represent the
@@ -2085,7 +2150,7 @@ struct ethtool_link_settings {
 	uint8_t	transceiver;
 	uint8_t	master_slave_cfg;
 	uint8_t	master_slave_state;
-	uint8_t	reserved1[1];
+	uint8_t	rate_matching;
 	uint32_t	reserved[7];
 	uint32_t	link_mode_masks[];
 	/* layout of link_mode_masks fields:
diff --git a/include/standard-headers/linux/fuse.h b/include/standard-headers/linux/fuse.h
index bda06258be..a1af78d989 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -194,6 +194,13 @@
  *  - add FUSE_SECURITY_CTX init flag
  *  - add security context to create, mkdir, symlink, and mknod requests
  *  - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
+ *
+ *  7.37
+ *  - add FUSE_TMPFILE
+ *
+ *  7.38
+ *  - add FUSE_EXPIRE_ONLY flag to fuse_notify_inval_entry
+ *  - add FOPEN_PARALLEL_DIRECT_WRITES
  */
 
 #ifndef _LINUX_FUSE_H
@@ -225,7 +232,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 36
+#define FUSE_KERNEL_MINOR_VERSION 38
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -297,6 +304,7 @@ struct fuse_file_lock {
  * FOPEN_CACHE_DIR: allow caching this directory
  * FOPEN_STREAM: the file is stream-like (no file position at all)
  * FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
+ * FOPEN_PARALLEL_DIRECT_WRITES: Allow concurrent direct writes on the same inode
  */
 #define FOPEN_DIRECT_IO		(1 << 0)
 #define FOPEN_KEEP_CACHE	(1 << 1)
@@ -304,6 +312,7 @@ struct fuse_file_lock {
 #define FOPEN_CACHE_DIR		(1 << 3)
 #define FOPEN_STREAM		(1 << 4)
 #define FOPEN_NOFLUSH		(1 << 5)
+#define FOPEN_PARALLEL_DIRECT_WRITES	(1 << 6)
 
 /**
  * INIT request/reply flags
@@ -484,6 +493,12 @@ struct fuse_file_lock {
  */
 #define FUSE_SETXATTR_ACL_KILL_SGID	(1 << 0)
 
+/**
+ * notify_inval_entry flags
+ * FUSE_EXPIRE_ONLY
+ */
+#define FUSE_EXPIRE_ONLY		(1 << 0)
+
 enum fuse_opcode {
 	FUSE_LOOKUP		= 1,
 	FUSE_FORGET		= 2,  /* no reply */
@@ -533,6 +548,7 @@ enum fuse_opcode {
 	FUSE_SETUPMAPPING	= 48,
 	FUSE_REMOVEMAPPING	= 49,
 	FUSE_SYNCFS		= 50,
+	FUSE_TMPFILE		= 51,
 
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
@@ -911,7 +927,7 @@ struct fuse_notify_inval_inode_out {
 struct fuse_notify_inval_entry_out {
 	uint64_t	parent;
 	uint32_t	namelen;
-	uint32_t	padding;
+	uint32_t	flags;
 };
 
 struct fuse_notify_delete_out {
diff --git a/include/standard-headers/linux/input-event-codes.h b/include/standard-headers/linux/input-event-codes.h
index 50790aee5a..f6bab08540 100644
--- a/include/standard-headers/linux/input-event-codes.h
+++ b/include/standard-headers/linux/input-event-codes.h
@@ -614,6 +614,9 @@
 #define KEY_KBD_LAYOUT_NEXT	0x248	/* AC Next Keyboard Layout Select */
 #define KEY_EMOJI_PICKER	0x249	/* Show/hide emoji picker (HUTRR101) */
 #define KEY_DICTATE		0x24a	/* Start or Stop Voice Dictation Session (HUTRR99) */
+#define KEY_CAMERA_ACCESS_ENABLE	0x24b	/* Enables programmatic access to camera devices. (HUTRR72) */
+#define KEY_CAMERA_ACCESS_DISABLE	0x24c	/* Disables programmatic access to camera devices. (HUTRR72) */
+#define KEY_CAMERA_ACCESS_TOGGLE	0x24d	/* Toggles the current state of the camera access control. (HUTRR72) */
 
 #define KEY_BRIGHTNESS_MIN		0x250	/* Set Brightness to Minimum */
 #define KEY_BRIGHTNESS_MAX		0x251	/* Set Brightness to Maximum */
@@ -862,6 +865,7 @@
 #define ABS_TOOL_WIDTH		0x1c
 
 #define ABS_VOLUME		0x20
+#define ABS_PROFILE		0x21
 
 #define ABS_MISC		0x28
 
diff --git a/include/standard-headers/linux/pci_regs.h b/include/standard-headers/linux/pci_regs.h
index 57b8e2ffb1..85ab127881 100644
--- a/include/standard-headers/linux/pci_regs.h
+++ b/include/standard-headers/linux/pci_regs.h
@@ -1058,6 +1058,7 @@
 /* Precision Time Measurement */
 #define PCI_PTM_CAP			0x04	    /* PTM Capability */
 #define  PCI_PTM_CAP_REQ		0x00000001  /* Requester capable */
+#define  PCI_PTM_CAP_RES		0x00000002  /* Responder capable */
 #define  PCI_PTM_CAP_ROOT		0x00000004  /* Root capable */
 #define  PCI_PTM_GRANULARITY_MASK	0x0000FF00  /* Clock granularity */
 #define PCI_PTM_CTRL			0x08	    /* PTM Control */
@@ -1119,6 +1120,7 @@
 #define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
 #define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
 #define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
+#define PCI_DOE_CAP_SIZEOF	0x18	/* Size of DOE register block */
 
 /* DOE Data Object - note not actually registers */
 #define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
diff --git a/include/standard-headers/linux/virtio_blk.h b/include/standard-headers/linux/virtio_blk.h
index 2dcc90826a..e81715cd70 100644
--- a/include/standard-headers/linux/virtio_blk.h
+++ b/include/standard-headers/linux/virtio_blk.h
@@ -40,6 +40,7 @@
 #define VIRTIO_BLK_F_MQ		12	/* support more than one vq */
 #define VIRTIO_BLK_F_DISCARD	13	/* DISCARD is supported */
 #define VIRTIO_BLK_F_WRITE_ZEROES	14	/* WRITE ZEROES is supported */
+#define VIRTIO_BLK_F_SECURE_ERASE	16 /* Secure Erase is supported */
 
 /* Legacy feature bits */
 #ifndef VIRTIO_BLK_NO_LEGACY
@@ -119,6 +120,21 @@ struct virtio_blk_config {
 	uint8_t write_zeroes_may_unmap;
 
 	uint8_t unused1[3];
+
+	/* the next 3 entries are guarded by VIRTIO_BLK_F_SECURE_ERASE */
+	/*
+	 * The maximum secure erase sectors (in 512-byte sectors) for
+	 * one segment.
+	 */
+	__virtio32 max_secure_erase_sectors;
+	/*
+	 * The maximum number of secure erase segments in a
+	 * secure erase command.
+	 */
+	__virtio32 max_secure_erase_seg;
+	/* Secure erase commands must be aligned to this number of sectors. */
+	__virtio32 secure_erase_sector_alignment;
+
 } QEMU_PACKED;
 
 /*
@@ -153,6 +169,9 @@ struct virtio_blk_config {
 /* Write zeroes command */
 #define VIRTIO_BLK_T_WRITE_ZEROES	13
 
+/* Secure erase command */
+#define VIRTIO_BLK_T_SECURE_ERASE	14
+
 #ifndef VIRTIO_BLK_NO_LEGACY
 /* Barrier before this op. */
 #define VIRTIO_BLK_T_BARRIER	0x80000000
diff --git a/include/standard-headers/linux/virtio_bt.h b/include/standard-headers/linux/virtio_bt.h
index 245e1eff4b..a11ecc3f92 100644
--- a/include/standard-headers/linux/virtio_bt.h
+++ b/include/standard-headers/linux/virtio_bt.h
@@ -9,6 +9,7 @@
 #define VIRTIO_BT_F_VND_HCI	0	/* Indicates vendor command support */
 #define VIRTIO_BT_F_MSFT_EXT	1	/* Indicates MSFT vendor support */
 #define VIRTIO_BT_F_AOSP_EXT	2	/* Indicates AOSP vendor support */
+#define VIRTIO_BT_F_CONFIG_V2	3	/* Use second version configuration */
 
 enum virtio_bt_config_type {
 	VIRTIO_BT_CONFIG_TYPE_PRIMARY	= 0,
@@ -28,4 +29,11 @@ struct virtio_bt_config {
 	uint16_t msft_opcode;
 } QEMU_PACKED;
 
+struct virtio_bt_config_v2 {
+	uint8_t  type;
+	uint8_t  alignment;
+	uint16_t vendor;
+	uint16_t msft_opcode;
+};
+
 #endif /* _LINUX_VIRTIO_BT_H */
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index 42c68caf71..c0e797067a 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 #define VIRTIO_NET_F_NOTF_COAL	53	/* Device supports notifications coalescing */
+#define VIRTIO_NET_F_GUEST_USO4	54	/* Guest can handle USOv4 in. */
+#define VIRTIO_NET_F_GUEST_USO6	55	/* Guest can handle USOv6 in. */
+#define VIRTIO_NET_F_HOST_USO	56	/* Host can handle USO in. */
 #define VIRTIO_NET_F_HASH_REPORT  57	/* Supports hash report */
 #define VIRTIO_NET_F_RSS	  60	/* Supports RSS RX steering */
 #define VIRTIO_NET_F_RSC_EXT	  61	/* extended coalescing info */
@@ -130,6 +133,7 @@ struct virtio_net_hdr_v1 {
 #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
 #define VIRTIO_NET_HDR_GSO_UDP		3	/* GSO frame, IPv4 UDP (UFO) */
 #define VIRTIO_NET_HDR_GSO_TCPV6	4	/* GSO frame, IPv6 TCP */
+#define VIRTIO_NET_HDR_GSO_UDP_L4	5	/* GSO frame, IPv4& IPv6 UDP (USO) */
 #define VIRTIO_NET_HDR_GSO_ECN		0x80	/* TCP has ECN set */
 	uint8_t gso_type;
 	__virtio16 hdr_len;	/* Ethernet + IP + tcp/udp hdrs */
diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
index 4bf2d7246e..a7cfefb3a8 100644
--- a/linux-headers/asm-arm64/kvm.h
+++ b/linux-headers/asm-arm64/kvm.h
@@ -43,6 +43,7 @@
 #define __KVM_HAVE_VCPU_EVENTS
 
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define KVM_REG_SIZE(id)						\
 	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/linux-headers/asm-generic/hugetlb_encode.h b/linux-headers/asm-generic/hugetlb_encode.h
index 4f3d5aaa11..de687009bf 100644
--- a/linux-headers/asm-generic/hugetlb_encode.h
+++ b/linux-headers/asm-generic/hugetlb_encode.h
@@ -20,18 +20,18 @@
 #define HUGETLB_FLAG_ENCODE_SHIFT	26
 #define HUGETLB_FLAG_ENCODE_MASK	0x3f
 
-#define HUGETLB_FLAG_ENCODE_16KB	(14 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_64KB	(16 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_512KB	(19 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_1MB		(20 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_2MB		(21 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_8MB		(23 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_16MB	(24 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_32MB	(25 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_256MB	(28 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_512MB	(29 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_1GB		(30 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_2GB		(31 << HUGETLB_FLAG_ENCODE_SHIFT)
-#define HUGETLB_FLAG_ENCODE_16GB	(34 << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_16KB	(14U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_64KB	(16U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_512KB	(19U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_1MB		(20U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_2MB		(21U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_8MB		(23U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_16MB	(24U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_32MB	(25U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_256MB	(28U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_512MB	(29U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_1GB		(30U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_2GB		(31U << HUGETLB_FLAG_ENCODE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_16GB	(34U << HUGETLB_FLAG_ENCODE_SHIFT)
 
 #endif /* _ASM_GENERIC_HUGETLB_ENCODE_H_ */
diff --git a/linux-headers/asm-generic/mman-common.h b/linux-headers/asm-generic/mman-common.h
index 6c1aa92a92..996e8ded09 100644
--- a/linux-headers/asm-generic/mman-common.h
+++ b/linux-headers/asm-generic/mman-common.h
@@ -77,6 +77,10 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/linux-headers/asm-mips/mman.h b/linux-headers/asm-mips/mman.h
index 1be428663c..f8a74a3a09 100644
--- a/linux-headers/asm-mips/mman.h
+++ b/linux-headers/asm-mips/mman.h
@@ -103,6 +103,10 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/linux-headers/asm-riscv/kvm.h b/linux-headers/asm-riscv/kvm.h
index 7351417afd..92af6f3f05 100644
--- a/linux-headers/asm-riscv/kvm.h
+++ b/linux-headers/asm-riscv/kvm.h
@@ -48,6 +48,10 @@ struct kvm_sregs {
 /* CONFIG registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
 struct kvm_riscv_config {
 	unsigned long isa;
+	unsigned long zicbom_block_size;
+	unsigned long mvendorid;
+	unsigned long marchid;
+	unsigned long mimpid;
 };
 
 /* CORE registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
@@ -98,6 +102,9 @@ enum KVM_RISCV_ISA_EXT_ID {
 	KVM_RISCV_ISA_EXT_M,
 	KVM_RISCV_ISA_EXT_SVPBMT,
 	KVM_RISCV_ISA_EXT_SSTC,
+	KVM_RISCV_ISA_EXT_SVINVAL,
+	KVM_RISCV_ISA_EXT_ZIHINTPAUSE,
+	KVM_RISCV_ISA_EXT_ZICBOM,
 	KVM_RISCV_ISA_EXT_MAX,
 };
 
diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 46de10a809..2747d2ce14 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -53,14 +53,6 @@
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
 
-struct kvm_memory_alias {
-	__u32 slot;  /* this has a different namespace than memory slots */
-	__u32 flags;
-	__u64 guest_phys_addr;
-	__u64 memory_size;
-	__u64 target_phys_addr;
-};
-
 /* for KVM_GET_IRQCHIP and KVM_SET_IRQCHIP */
 struct kvm_pic_state {
 	__u8 last_irr;	/* edge detection */
@@ -214,6 +206,8 @@ struct kvm_msr_list {
 struct kvm_msr_filter_range {
 #define KVM_MSR_FILTER_READ  (1 << 0)
 #define KVM_MSR_FILTER_WRITE (1 << 1)
+#define KVM_MSR_FILTER_RANGE_VALID_MASK (KVM_MSR_FILTER_READ | \
+					 KVM_MSR_FILTER_WRITE)
 	__u32 flags;
 	__u32 nmsrs; /* number of msrs in bitmap */
 	__u32 base;  /* MSR index the bitmap starts at */
@@ -224,6 +218,7 @@ struct kvm_msr_filter_range {
 struct kvm_msr_filter {
 #define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0)
 #define KVM_MSR_FILTER_DEFAULT_DENY  (1 << 0)
+#define KVM_MSR_FILTER_VALID_MASK (KVM_MSR_FILTER_DEFAULT_DENY)
 	__u32 flags;
 	struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES];
 };
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index ebdafa576d..30b2795d10 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -86,14 +86,6 @@ struct kvm_debug_guest {
 /* *** End of deprecated interfaces *** */
 
 
-/* for KVM_CREATE_MEMORY_REGION */
-struct kvm_memory_region {
-	__u32 slot;
-	__u32 flags;
-	__u64 guest_phys_addr;
-	__u64 memory_size; /* bytes */
-};
-
 /* for KVM_SET_USER_MEMORY_REGION */
 struct kvm_userspace_memory_region {
 	__u32 slot;
@@ -104,9 +96,9 @@ struct kvm_userspace_memory_region {
 };
 
 /*
- * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
- * other bits are reserved for kvm internal use which are defined in
- * include/linux/kvm_host.h.
+ * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
+ * userspace, other bits are reserved for kvm internal use which are defined
+ * in include/linux/kvm_host.h.
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
@@ -483,6 +475,9 @@ struct kvm_run {
 #define KVM_MSR_EXIT_REASON_INVAL	(1 << 0)
 #define KVM_MSR_EXIT_REASON_UNKNOWN	(1 << 1)
 #define KVM_MSR_EXIT_REASON_FILTER	(1 << 2)
+#define KVM_MSR_EXIT_REASON_VALID_MASK	(KVM_MSR_EXIT_REASON_INVAL   |	\
+					 KVM_MSR_EXIT_REASON_UNKNOWN |	\
+					 KVM_MSR_EXIT_REASON_FILTER)
 			__u32 reason; /* kernel -> user */
 			__u32 index; /* kernel -> user */
 			__u64 data; /* kernel <-> user */
@@ -1175,6 +1170,9 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_VM_DISABLE_NX_HUGE_PAGES 220
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
+#define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
+#define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
+#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1264,6 +1262,7 @@ struct kvm_x86_mce {
 #define KVM_XEN_HVM_CONFIG_RUNSTATE		(1 << 3)
 #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL	(1 << 4)
 #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND		(1 << 5)
+#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG	(1 << 6)
 
 struct kvm_xen_hvm_config {
 	__u32 flags;
@@ -1434,18 +1433,12 @@ struct kvm_vfio_spapr_tce {
 	__s32	tablefd;
 };
 
-/*
- * ioctls for VM fds
- */
-#define KVM_SET_MEMORY_REGION     _IOW(KVMIO,  0x40, struct kvm_memory_region)
 /*
  * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns
  * a vcpu fd.
  */
 #define KVM_CREATE_VCPU           _IO(KVMIO,   0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO,  0x42, struct kvm_dirty_log)
-/* KVM_SET_MEMORY_ALIAS is obsolete: */
-#define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO,  0x43, struct kvm_memory_alias)
 #define KVM_SET_NR_MMU_PAGES      _IO(KVMIO,   0x44)
 #define KVM_GET_NR_MMU_PAGES      _IO(KVMIO,   0x45)
 #define KVM_SET_USER_MEMORY_REGION _IOW(KVMIO, 0x46, \
@@ -1737,6 +1730,8 @@ enum pv_cmd_id {
 	KVM_PV_UNSHARE_ALL,
 	KVM_PV_INFO,
 	KVM_PV_DUMP,
+	KVM_PV_ASYNC_CLEANUP_PREPARE,
+	KVM_PV_ASYNC_CLEANUP_PERFORM,
 };
 
 struct kvm_pv_cmd {
@@ -1767,6 +1762,7 @@ struct kvm_xen_hvm_attr {
 	union {
 		__u8 long_mode;
 		__u8 vector;
+		__u8 runstate_update_flag;
 		struct {
 			__u64 gfn;
 		} shared_info;
@@ -1807,6 +1803,8 @@ struct kvm_xen_hvm_attr {
 /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND */
 #define KVM_XEN_ATTR_TYPE_EVTCHN		0x3
 #define KVM_XEN_ATTR_TYPE_XEN_VERSION		0x4
+/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG */
+#define KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG	0x5
 
 /* Per-vCPU Xen attributes */
 #define KVM_XEN_VCPU_GET_ATTR	_IOWR(KVMIO, 0xca, struct kvm_xen_vcpu_attr)
diff --git a/linux-headers/linux/psci.h b/linux-headers/linux/psci.h
index 213b2a0f70..e60dfd8907 100644
--- a/linux-headers/linux/psci.h
+++ b/linux-headers/linux/psci.h
@@ -48,12 +48,26 @@
 #define PSCI_0_2_FN64_MIGRATE_INFO_UP_CPU	PSCI_0_2_FN64(7)
 
 #define PSCI_1_0_FN_PSCI_FEATURES		PSCI_0_2_FN(10)
+#define PSCI_1_0_FN_CPU_FREEZE			PSCI_0_2_FN(11)
+#define PSCI_1_0_FN_CPU_DEFAULT_SUSPEND		PSCI_0_2_FN(12)
+#define PSCI_1_0_FN_NODE_HW_STATE		PSCI_0_2_FN(13)
 #define PSCI_1_0_FN_SYSTEM_SUSPEND		PSCI_0_2_FN(14)
 #define PSCI_1_0_FN_SET_SUSPEND_MODE		PSCI_0_2_FN(15)
+#define PSCI_1_0_FN_STAT_RESIDENCY		PSCI_0_2_FN(16)
+#define PSCI_1_0_FN_STAT_COUNT			PSCI_0_2_FN(17)
+
 #define PSCI_1_1_FN_SYSTEM_RESET2		PSCI_0_2_FN(18)
+#define PSCI_1_1_FN_MEM_PROTECT			PSCI_0_2_FN(19)
+#define PSCI_1_1_FN_MEM_PROTECT_CHECK_RANGE	PSCI_0_2_FN(19)
 
+#define PSCI_1_0_FN64_CPU_DEFAULT_SUSPEND	PSCI_0_2_FN64(12)
+#define PSCI_1_0_FN64_NODE_HW_STATE		PSCI_0_2_FN64(13)
 #define PSCI_1_0_FN64_SYSTEM_SUSPEND		PSCI_0_2_FN64(14)
+#define PSCI_1_0_FN64_STAT_RESIDENCY		PSCI_0_2_FN64(16)
+#define PSCI_1_0_FN64_STAT_COUNT		PSCI_0_2_FN64(17)
+
 #define PSCI_1_1_FN64_SYSTEM_RESET2		PSCI_0_2_FN64(18)
+#define PSCI_1_1_FN64_MEM_PROTECT_CHECK_RANGE	PSCI_0_2_FN64(19)
 
 /* PSCI v0.2 power state encoding for CPU_SUSPEND function */
 #define PSCI_0_2_POWER_STATE_ID_MASK		0xffff
diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h
index a3a377cd44..ba5d0df52f 100644
--- a/linux-headers/linux/userfaultfd.h
+++ b/linux-headers/linux/userfaultfd.h
@@ -12,6 +12,10 @@
 
 #include <linux/types.h>
 
+/* ioctls for /dev/userfaultfd */
+#define USERFAULTFD_IOC 0xAA
+#define USERFAULTFD_IOC_NEW _IO(USERFAULTFD_IOC, 0x00)
+
 /*
  * If the UFFDIO_API is upgraded someday, the UFFDIO_UNREGISTER and
  * UFFDIO_WAKE ioctls should be defined as _IOW and not as _IOR.  In
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index ede44b5572..c59692ce0b 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -819,12 +819,20 @@ struct vfio_device_feature {
  * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
  * is supported in addition to the STOP_COPY states.
  *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY means that
+ * PRE_COPY is supported in addition to the STOP_COPY states.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY
+ * means that RUNNING_P2P, PRE_COPY and PRE_COPY_P2P are supported
+ * in addition to the STOP_COPY states.
+ *
  * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
 #define VFIO_MIGRATION_P2P		(1 << 1)
+#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -875,8 +883,13 @@ struct vfio_device_feature_mig_state {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
- * And 1 optional state to support VFIO_MIGRATION_P2P:
+ * And optional states to support VFIO_MIGRATION_P2P:
  *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ * And VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY - The device is running normally but tracking internal state
+ *             changes
+ * And VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY_P2P - PRE_COPY, except the device cannot do peer to peer DMA
  *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
@@ -908,20 +921,48 @@ struct vfio_device_feature_mig_state {
  *
  *   To abort a RESUMING session the device must be reset.
  *
+ * PRE_COPY -> RUNNING
  * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * PRE_COPY_P2P -> RUNNING_P2P
  * RUNNING -> RUNNING_P2P
  * STOP -> RUNNING_P2P
  *   While in RUNNING_P2P the device is partially running in the P2P quiescent
  *   state defined below.
  *
+ *   The PRE_COPY_P2P arc will terminate a data transfer session.
+ *
+ * RUNNING -> PRE_COPY
+ * RUNNING_P2P -> PRE_COPY_P2P
  * STOP -> STOP_COPY
- *   This arc begin the process of saving the device state and will return a
- *   new data_fd.
+ *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
+ *   which share a data transfer session. Moving between these states alters
+ *   what is streamed in session, but does not terminate or otherwise affect
+ *   the associated fd.
+ *
+ *   These arcs begin the process of saving the device state and will return a
+ *   new data_fd. The migration driver may perform actions such as enabling
+ *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
+ *
+ *   Each arc does not change the device operation, the device remains
+ *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described below
+ *   in PRE_COPY_P2P -> STOP_COPY.
+ *
+ * PRE_COPY -> PRE_COPY_P2P
+ *   Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
+ *   However, while in the PRE_COPY_P2P state, the device is partially running
+ *   in the P2P quiescent state defined below, like RUNNING_P2P.
+ *
+ * PRE_COPY_P2P -> PRE_COPY
+ *   This arc allows returning the device to a full RUNNING behavior while
+ *   continuing all the behaviors of PRE_COPY.
  *
+ * PRE_COPY_P2P -> STOP_COPY
  *   While in the STOP_COPY state the device has the same behavior as STOP
  *   with the addition that the data transfers session continues to stream the
  *   migration state. End of stream on the FD indicates the entire device
@@ -939,6 +980,13 @@ struct vfio_device_feature_mig_state {
  *   device state for this arc if required to prepare the device to receive the
  *   migration data.
  *
+ * STOP_COPY -> PRE_COPY
+ * STOP_COPY -> PRE_COPY_P2P
+ *   These arcs are not permitted and return error if requested. Future
+ *   revisions of this API may define behaviors for these arcs, in this case
+ *   support will be discoverable by a new flag in
+ *   VFIO_DEVICE_FEATURE_MIGRATION.
+ *
  * any -> ERROR
  *   ERROR cannot be specified as a device state, however any transition request
  *   can be failed with an errno return and may then move the device_state into
@@ -950,7 +998,7 @@ struct vfio_device_feature_mig_state {
  * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
  * state for the device for the purposes of managing multiple devices within a
  * user context where peer-to-peer DMA between devices may be active. The
- * RUNNING_P2P states must prevent the device from initiating
+ * RUNNING_P2P and PRE_COPY_P2P states must prevent the device from initiating
  * any new P2P DMA transactions. If the device can identify P2P transactions
  * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
  * driver must complete any such outstanding operations prior to completing the
@@ -963,6 +1011,8 @@ struct vfio_device_feature_mig_state {
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
  *   - Select the shortest path.
+ *   - The path cannot have saving group states as interior arcs, only
+ *     starting/end states.
  * Refer to vfio_mig_get_next_state() for the result of the algorithm.
  *
  * The automatic transit through the FSM arcs that make up the combination
@@ -976,6 +1026,9 @@ struct vfio_device_feature_mig_state {
  * support them. The user can discover if these states are supported by using
  * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
  * avoid knowing about these optional states if the kernel driver supports them.
+ *
+ * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for PRE_COPY
+ * is not present.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -984,8 +1037,225 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
+	VFIO_DEVICE_STATE_PRE_COPY = 6,
+	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
+};
+
+/**
+ * VFIO_MIG_GET_PRECOPY_INFO - _IO(VFIO_TYPE, VFIO_BASE + 21)
+ *
+ * This ioctl is used on the migration data FD in the precopy phase of the
+ * migration data transfer. It returns an estimate of the current data sizes
+ * remaining to be transferred. It allows the user to judge when it is
+ * appropriate to leave PRE_COPY for STOP_COPY.
+ *
+ * This ioctl is valid only in PRE_COPY states and kernel driver should
+ * return -EINVAL from any other migration state.
+ *
+ * The vfio_precopy_info data structure returned by this ioctl provides
+ * estimates of data available from the device during the PRE_COPY states.
+ * This estimate is split into two categories, initial_bytes and
+ * dirty_bytes.
+ *
+ * The initial_bytes field indicates the amount of initial precopy
+ * data available from the device. This field should have a non-zero initial
+ * value and decrease as migration data is read from the device.
+ * It is recommended to leave PRE_COPY for STOP_COPY only after this field
+ * reaches zero. Leaving PRE_COPY earlier might make things slower.
+ *
+ * The dirty_bytes field tracks device state changes relative to data
+ * previously retrieved.  This field starts at zero and may increase as
+ * the internal device state is modified or decrease as that modified
+ * state is read from the device.
+ *
+ * Userspace may use the combination of these fields to estimate the
+ * potential data size available during the PRE_COPY phases, as well as
+ * trends relative to the rate the device is dirtying its internal
+ * state, but these fields are not required to have any bearing relative
+ * to the data size available during the STOP_COPY phase.
+ *
+ * Drivers have a lot of flexibility in when and what they transfer during the
+ * PRE_COPY phase, and how they report this from VFIO_MIG_GET_PRECOPY_INFO.
+ *
+ * During pre-copy the migration data FD has a temporary "end of stream" that is
+ * reached when both initial_bytes and dirty_byte are zero. For instance, this
+ * may indicate that the device is idle and not currently dirtying any internal
+ * state. When read() is done on this temporary end of stream the kernel driver
+ * should return ENOMSG from read(). Userspace can wait for more data (which may
+ * never come) by using poll.
+ *
+ * Once in STOP_COPY the migration data FD has a permanent end of stream
+ * signaled in the usual way by read() always returning 0 and poll always
+ * returning readable. ENOMSG may not be returned in STOP_COPY.
+ * Support for this ioctl is mandatory if a driver claims to support
+ * VFIO_MIGRATION_PRE_COPY.
+ *
+ * Return: 0 on success, -1 and errno set on failure.
+ */
+struct vfio_precopy_info {
+	__u32 argsz;
+	__u32 flags;
+	__aligned_u64 initial_bytes;
+	__aligned_u64 dirty_bytes;
+};
+
+#define VFIO_MIG_GET_PRECOPY_INFO _IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
+ * state with the platform-based power management.  Device use of lower power
+ * states depends on factors managed by the runtime power management core,
+ * including system level support and coordinating support among dependent
+ * devices.  Enabling device low power entry does not guarantee lower power
+ * usage by the device, nor is a mechanism provided through this feature to
+ * know the current power state of the device.  If any device access happens
+ * (either from the host or through the vfio uAPI) when the device is in the
+ * low power state, then the host will move the device out of the low power
+ * state as necessary prior to the access.  Once the access is completed, the
+ * device may re-enter the low power state.  For single shot low power support
+ * with wake-up notification, see
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
+ * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
+ * calling LOW_POWER_EXIT.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
+
+/*
+ * This device feature has the same behavior as
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
+ * provides an eventfd for wake-up notification.  When the device moves out of
+ * the low power state for the wake-up, the host will not allow the device to
+ * re-enter a low power state without a subsequent user call to one of the low
+ * power entry device feature IOCTLs.  Access to mmap'd device regions is
+ * disabled on LOW_POWER_ENTRY_WITH_WAKEUP and may only be resumed after the
+ * low power exit.  The low power exit can happen either through LOW_POWER_EXIT
+ * or through any other access (where the wake-up notification has been
+ * generated).  The access to mmap'd device regions will not trigger low power
+ * exit.
+ *
+ * The notification through the provided eventfd will be generated only when
+ * the device has entered and is resumed from a low power state after
+ * calling this device feature IOCTL.  A device that has not entered low power
+ * state, as managed through the runtime power management core, will not
+ * generate a notification through the provided eventfd on access.  Calling the
+ * LOW_POWER_EXIT feature is optional in the case where notification has been
+ * signaled on the provided eventfd that a resume from low power has occurred.
+ */
+struct vfio_device_low_power_entry_with_wakeup {
+	__s32 wakeup_eventfd;
+	__u32 reserved;
+};
+
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
+ * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
+ * This device feature IOCTL may itself generate a wakeup eventfd notification
+ * in the latter case if the device had previously entered a low power state.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET start/stop device DMA logging.
+ * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
+ * DMA logging.
+ *
+ * DMA logging allows a device to internally record what DMAs the device is
+ * initiating and report them back to userspace. It is part of the VFIO
+ * migration infrastructure that allows implementing dirty page tracking
+ * during the pre copy phase of live migration. Only DMA WRITEs are logged,
+ * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
+ *
+ * When DMA logging is started a range of IOVAs to monitor is provided and the
+ * device can optimize its logging to cover only the IOVA range given. Each
+ * DMA that the device initiates inside the range will be logged by the device
+ * for later retrieval.
+ *
+ * page_size is an input that hints what tracking granularity the device
+ * should try to achieve. If the device cannot do the hinted page size then
+ * it's the driver choice which page size to pick based on its support.
+ * On output the device will return the page size it selected.
+ *
+ * ranges is a pointer to an array of
+ * struct vfio_device_feature_dma_logging_range.
+ *
+ * The core kernel code guarantees to support by minimum num_ranges that fit
+ * into a single kernel page. User space can try higher values but should give
+ * up if the above can't be achieved as of some driver limitations.
+ *
+ * A single call to start device DMA logging can be issued and a matching stop
+ * should follow at the end. Another start is not allowed in the meantime.
+ */
+struct vfio_device_feature_dma_logging_control {
+	__aligned_u64 page_size;
+	__u32 num_ranges;
+	__u32 __reserved;
+	__aligned_u64 ranges;
 };
 
+struct vfio_device_feature_dma_logging_range {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 6
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
+ * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
+ */
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 7
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
+ *
+ * Query the device's DMA log for written pages within the given IOVA range.
+ * During querying the log is cleared for the IOVA range.
+ *
+ * bitmap is a pointer to an array of u64s that will hold the output bitmap
+ * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
+ * is given by:
+ *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
+ *
+ * The input page_size can be any power of two value and does not have to
+ * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
+ * will format its internal logging to match the reporting page size, possibly
+ * by replicating bits if the internal page size is lower than requested.
+ *
+ * The LOGGING_REPORT will only set bits in the bitmap and never clear or
+ * perform any initialization of the user provided bitmap.
+ *
+ * If any error is returned userspace should assume that the dirty log is
+ * corrupted. Error recovery is to consider all memory dirty and try to
+ * restart the dirty tracking, or to abort/restart the whole migration.
+ *
+ * If DMA logging is not enabled, an error will be returned.
+ *
+ */
+struct vfio_device_feature_dma_logging_report {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 bitmap;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 8
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_GET read back the estimated data length that will
+ * be required to complete stop copy.
+ *
+ * Note: Can be called on each device state.
+ */
+
+struct vfio_device_feature_mig_data_size {
+	__aligned_u64 stop_copy_length;
+};
+
+#define VFIO_DEVICE_FEATURE_MIG_DATA_SIZE 9
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
  2023-01-17 22:08 ` [PATCH RFC 01/21] update linux headers Peter Xu
@ 2023-01-17 22:08 ` Peter Xu
  2023-01-18 12:00   ` Dr. David Alan Gilbert
                     ` (2 more replies)
  2023-01-17 22:08 ` [PATCH RFC 03/21] physmem: Add qemu_ram_is_hugetlb() Peter Xu
                   ` (18 subsequent siblings)
  20 siblings, 3 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Without it, we never have CONFIG_LINUX defined even if on linux, so
linux/mman.h is never really included.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 util/mmap-alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 5ed7d29183..040599b0e3 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -9,6 +9,7 @@
  * This work is licensed under the terms of the GNU GPL, version 2 or
  * later.  See the COPYING file in the top-level directory.
  */
+#include "qemu/osdep.h"
 
 #ifdef CONFIG_LINUX
 #include <linux/mman.h>
@@ -17,7 +18,6 @@
 #define MAP_SHARED_VALIDATE   0x0
 #endif /* CONFIG_LINUX */
 
-#include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
 #include "qemu/cutils.h"
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 03/21] physmem: Add qemu_ram_is_hugetlb()
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
  2023-01-17 22:08 ` [PATCH RFC 01/21] update linux headers Peter Xu
  2023-01-17 22:08 ` [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c Peter Xu
@ 2023-01-17 22:08 ` Peter Xu
  2023-01-18 12:02   ` Dr. David Alan Gilbert
  2023-01-30  5:00   ` Juan Quintela
  2023-01-17 22:08 ` [PATCH RFC 04/21] madvise: Include linux/mman.h under linux-headers/ Peter Xu
                   ` (17 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Returns true for a hugetlbfs mapping, false otherwise.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/cpu-common.h | 1 +
 softmmu/physmem.c         | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 6feaa40ca7..94452aa17f 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -95,6 +95,7 @@ void qemu_ram_unset_migratable(RAMBlock *rb);
 int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
+bool qemu_ram_is_hugetlb(RAMBlock *rb);
 size_t qemu_ram_pagesize_largest(void);
 
 /**
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index edec095c7a..a4fb129d8f 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1798,6 +1798,11 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
     return rb->page_size;
 }
 
+bool qemu_ram_is_hugetlb(RAMBlock *rb)
+{
+    return rb->page_size > qemu_real_host_page_size();
+}
+
 /* Returns the largest size of page in use */
 size_t qemu_ram_pagesize_largest(void)
 {
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 04/21] madvise: Include linux/mman.h under linux-headers/
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (2 preceding siblings ...)
  2023-01-17 22:08 ` [PATCH RFC 03/21] physmem: Add qemu_ram_is_hugetlb() Peter Xu
@ 2023-01-17 22:08 ` Peter Xu
  2023-01-18 12:08   ` Dr. David Alan Gilbert
  2023-01-30  5:01   ` Juan Quintela
  2023-01-17 22:08 ` [PATCH RFC 05/21] madvise: Add QEMU_MADV_SPLIT Peter Xu
                   ` (16 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

This will allow qemu/madvise.h to always include linux/mman.h under the
linux-headers/.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/madvise.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/qemu/madvise.h b/include/qemu/madvise.h
index e155f59a0d..b6fa49553f 100644
--- a/include/qemu/madvise.h
+++ b/include/qemu/madvise.h
@@ -8,6 +8,10 @@
 #ifndef QEMU_MADVISE_H
 #define QEMU_MADVISE_H
 
+#ifdef CONFIG_LINUX
+#include "linux/mman.h"
+#endif
+
 #define QEMU_MADV_INVALID -1
 
 #if defined(CONFIG_MADVISE)
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 05/21] madvise: Add QEMU_MADV_SPLIT
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (3 preceding siblings ...)
  2023-01-17 22:08 ` [PATCH RFC 04/21] madvise: Include linux/mman.h under linux-headers/ Peter Xu
@ 2023-01-17 22:08 ` Peter Xu
  2023-01-30  5:01   ` Juan Quintela
  2023-01-17 22:08 ` [PATCH RFC 06/21] madvise: Add QEMU_MADV_COLLAPSE Peter Xu
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

MADV_SPLIT is a new madvise() on Linux.  Define QEMU_MADV_SPLIT.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/madvise.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/qemu/madvise.h b/include/qemu/madvise.h
index b6fa49553f..3dddd25065 100644
--- a/include/qemu/madvise.h
+++ b/include/qemu/madvise.h
@@ -63,6 +63,11 @@
 #else
 #define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 #endif
+#ifdef MADV_SPLIT
+#define QEMU_MADV_SPLIT MADV_SPLIT
+#else
+#define QEMU_MADV_SPLIT QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -77,6 +82,7 @@
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
 #define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
+#define QEMU_MADV_SPLIT QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -91,6 +97,7 @@
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
 #define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
+#define QEMU_MADV_SPLIT QEMU_MADV_INVALID
 
 #endif
 
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 06/21] madvise: Add QEMU_MADV_COLLAPSE
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (4 preceding siblings ...)
  2023-01-17 22:08 ` [PATCH RFC 05/21] madvise: Add QEMU_MADV_SPLIT Peter Xu
@ 2023-01-17 22:08 ` Peter Xu
  2023-01-18 18:51   ` Dr. David Alan Gilbert
  2023-01-30  5:02   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 07/21] ramblock: Cache file offset for file-backed ramblocks Peter Xu
                   ` (14 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

MADV_COLLAPSE is a new madvise() on Linux.  Define it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/madvise.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/qemu/madvise.h b/include/qemu/madvise.h
index 3dddd25065..794e5fb0a7 100644
--- a/include/qemu/madvise.h
+++ b/include/qemu/madvise.h
@@ -68,6 +68,11 @@
 #else
 #define QEMU_MADV_SPLIT QEMU_MADV_INVALID
 #endif
+#ifdef MADV_COLLAPSE
+#define QEMU_MADV_COLLAPSE MADV_COLLAPSE
+#else
+#define QEMU_MADV_COLLAPSE QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -83,6 +88,7 @@
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
 #define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 #define QEMU_MADV_SPLIT QEMU_MADV_INVALID
+#define QEMU_MADV_COLLAPSE QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -98,6 +104,7 @@
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
 #define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 #define QEMU_MADV_SPLIT QEMU_MADV_INVALID
+#define QEMU_MADV_COLLAPSE QEMU_MADV_INVALID
 
 #endif
 
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 07/21] ramblock: Cache file offset for file-backed ramblocks
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (5 preceding siblings ...)
  2023-01-17 22:08 ` [PATCH RFC 06/21] madvise: Add QEMU_MADV_COLLAPSE Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-01-30  5:02   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks Peter Xu
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

This value was only used for mmap() when we want to map at a specific
offset of the file for memory.  To be prepared that we might do another map
upon the same range for whatever reason, cache the offset so we know how to
map again on the same range.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ramblock.h | 5 +++++
 softmmu/physmem.c       | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index adc03df59c..76cd0812c8 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -41,6 +41,11 @@ struct RAMBlock {
     QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
     int fd;
     size_t page_size;
+    /*
+     * Cache for file offset to map the ramblock.  Only used for
+     * file-backed ramblocks.
+     */
+    off_t file_offset;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
     /* bitmap of already received pages in postcopy */
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index a4fb129d8f..aa1a7466e5 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1543,6 +1543,8 @@ static void *file_ram_alloc(RAMBlock *block,
     uint32_t qemu_map_flags;
     void *area;
 
+    /* Remember the offset just in case we'll need to map the range again */
+    block->file_offset = offset;
     block->page_size = qemu_fd_getpagesize(fd);
     if (block->mr->align % block->page_size) {
         error_setg(errp, "alignment 0x%" PRIx64
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (6 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 07/21] ramblock: Cache file offset for file-backed ramblocks Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-01-23 18:51   ` Dr. David Alan Gilbert
  2023-01-30  5:05   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 09/21] ramblock: Add RAM_READONLY Peter Xu
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

We do proper page size alignment for file backed mmap()s for ramblocks.
Even if it's as simple as that, cache the value because it'll be used in
multiple places.

Since at it, drop size for file_ram_alloc() and just use max_length because
that's always true for file-backed ramblocks.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ramblock.h |  2 ++
 softmmu/physmem.c       | 14 +++++++-------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 76cd0812c8..3f31ce1591 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -32,6 +32,8 @@ struct RAMBlock {
     ram_addr_t offset;
     ram_addr_t used_length;
     ram_addr_t max_length;
+    /* Only used for file-backed ramblocks */
+    ram_addr_t mmap_length;
     void (*resized)(const char*, uint64_t length, void *host);
     uint32_t flags;
     /* Protected by iothread lock.  */
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index aa1a7466e5..b5be02f1cb 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1533,7 +1533,6 @@ static int file_ram_open(const char *path,
 }
 
 static void *file_ram_alloc(RAMBlock *block,
-                            ram_addr_t memory,
                             int fd,
                             bool readonly,
                             bool truncate,
@@ -1563,14 +1562,14 @@ static void *file_ram_alloc(RAMBlock *block,
     }
 #endif
 
-    if (memory < block->page_size) {
+    if (block->max_length < block->page_size) {
         error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
                    "or larger than page size 0x%zx",
-                   memory, block->page_size);
+                   block->max_length, block->page_size);
         return NULL;
     }
 
-    memory = ROUND_UP(memory, block->page_size);
+    block->mmap_length = ROUND_UP(block->max_length, block->page_size);
 
     /*
      * ftruncate is not supported by hugetlbfs in older
@@ -1586,7 +1585,7 @@ static void *file_ram_alloc(RAMBlock *block,
      * those labels. Therefore, extending the non-empty backend file
      * is disabled as well.
      */
-    if (truncate && ftruncate(fd, memory)) {
+    if (truncate && ftruncate(fd, block->mmap_length)) {
         perror("ftruncate");
     }
 
@@ -1594,7 +1593,8 @@ static void *file_ram_alloc(RAMBlock *block,
     qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
     qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
     qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
-    area = qemu_ram_mmap(fd, memory, block->mr->align, qemu_map_flags, offset);
+    area = qemu_ram_mmap(fd, block->mmap_length, block->mr->align,
+                         qemu_map_flags, offset);
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
                          "unable to map backing store for guest RAM");
@@ -2100,7 +2100,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     new_block->used_length = size;
     new_block->max_length = size;
     new_block->flags = ram_flags;
-    new_block->host = file_ram_alloc(new_block, size, fd, readonly,
+    new_block->host = file_ram_alloc(new_block, fd, readonly,
                                      !file_size, offset, errp);
     if (!new_block->host) {
         g_free(new_block);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 09/21] ramblock: Add RAM_READONLY
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (7 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-01-23 19:42   ` Dr. David Alan Gilbert
  2023-01-30  5:06   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 10/21] ramblock: Add ramblock_file_map() Peter Xu
                   ` (11 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

This allows us to have RAM_READONLY to be set in ram_flags to show that
this ramblock can only be read not write.

We used to pass in readonly boolean along the way for allocating the
ramblock, now let it be together with the rest ramblock flags.

The main purpose of this patch is not for clean up though, it's for caching
mapping information of each ramblock so when we want to mmap() it again for
whatever reason we can have all the information on hand.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 backends/hostmem-file.c |  3 ++-
 include/exec/memory.h   |  4 ++--
 include/exec/ram_addr.h |  5 ++---
 softmmu/memory.c        |  8 +++-----
 softmmu/physmem.c       | 16 +++++++---------
 5 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 25141283c4..1daf00d2da 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -56,9 +56,10 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
+    ram_flags |= fb->readonly ? RAM_READONLY : 0;
     memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
                                      backend->size, fb->align, ram_flags,
-                                     fb->mem_path, fb->readonly, errp);
+                                     fb->mem_path, errp);
     g_free(name);
 #endif
 }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c37ffdbcd1..006ba77ede 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -188,6 +188,8 @@ typedef struct IOMMUTLBEvent {
 /* RAM is a persistent kind memory */
 #define RAM_PMEM (1 << 5)
 
+/* RAM is read-only */
+#define RAM_READONLY (1 << 6)
 
 /*
  * UFFDIO_WRITEPROTECT is used on this RAMBlock to
@@ -1292,7 +1294,6 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
  *             RAM_NORESERVE,
  * @path: the path in which to allocate the RAM.
- * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -1305,7 +1306,6 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
                                       uint64_t align,
                                       uint32_t ram_flags,
                                       const char *path,
-                                      bool readonly,
                                       Error **errp);
 
 /**
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index f4fb6a2111..0bf9cfc659 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -110,7 +110,6 @@ long qemu_maxrampagesize(void);
  *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
  *              RAM_NORESERVE.
  *  @mem_path or @fd: specify the backing file or device
- *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
  *
  * Return:
@@ -119,10 +118,10 @@ long qemu_maxrampagesize(void);
  */
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
                                    uint32_t ram_flags, const char *mem_path,
-                                   bool readonly, Error **errp);
+                                   Error **errp);
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
                                  uint32_t ram_flags, int fd, off_t offset,
-                                 bool readonly, Error **errp);
+                                 Error **errp);
 
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                   MemoryRegion *mr, Error **errp);
diff --git a/softmmu/memory.c b/softmmu/memory.c
index e05332d07f..2137028773 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1601,18 +1601,16 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
                                       uint64_t align,
                                       uint32_t ram_flags,
                                       const char *path,
-                                      bool readonly,
                                       Error **errp)
 {
     Error *err = NULL;
     memory_region_init(mr, owner, name, size);
     mr->ram = true;
-    mr->readonly = readonly;
+    mr->readonly = ram_flags & RAM_READONLY;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
     mr->align = align;
-    mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path,
-                                             readonly, &err);
+    mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
@@ -1635,7 +1633,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
     mr->ram_block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, offset,
-                                           false, &err);
+                                           &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index b5be02f1cb..6096eac286 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1534,7 +1534,6 @@ static int file_ram_open(const char *path,
 
 static void *file_ram_alloc(RAMBlock *block,
                             int fd,
-                            bool readonly,
                             bool truncate,
                             off_t offset,
                             Error **errp)
@@ -1589,7 +1588,7 @@ static void *file_ram_alloc(RAMBlock *block,
         perror("ftruncate");
     }
 
-    qemu_map_flags = readonly ? QEMU_MAP_READONLY : 0;
+    qemu_map_flags = (block->flags & RAM_READONLY) ? QEMU_MAP_READONLY : 0;
     qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
     qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
     qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
@@ -2057,7 +2056,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
 #ifdef CONFIG_POSIX
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
                                  uint32_t ram_flags, int fd, off_t offset,
-                                 bool readonly, Error **errp)
+                                 Error **errp)
 {
     RAMBlock *new_block;
     Error *local_err = NULL;
@@ -2065,7 +2064,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
 
     /* Just support these ram flags by now. */
     assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
-                          RAM_PROTECTED)) == 0);
+                          RAM_PROTECTED | RAM_READONLY)) == 0);
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -2100,8 +2099,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     new_block->used_length = size;
     new_block->max_length = size;
     new_block->flags = ram_flags;
-    new_block->host = file_ram_alloc(new_block, fd, readonly,
-                                     !file_size, offset, errp);
+    new_block->host = file_ram_alloc(new_block, fd, !file_size, offset, errp);
     if (!new_block->host) {
         g_free(new_block);
         return NULL;
@@ -2120,11 +2118,11 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
 
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
                                    uint32_t ram_flags, const char *mem_path,
-                                   bool readonly, Error **errp)
+                                   Error **errp)
 {
     int fd;
-    bool created;
     RAMBlock *block;
+    bool created, readonly = ram_flags & RAM_READONLY;
 
     fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created,
                        errp);
@@ -2132,7 +2130,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, 0, readonly, errp);
+    block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, 0, errp);
     if (!block) {
         if (created) {
             unlink(mem_path);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 10/21] ramblock: Add ramblock_file_map()
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (8 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 09/21] ramblock: Add RAM_READONLY Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-01-24 10:06   ` Dr. David Alan Gilbert
  2023-01-30  5:09   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 11/21] migration: Add hugetlb-doublemap cap Peter Xu
                   ` (10 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Add a helper to do mmap() for a ramblock based on the cached informations.

A trivial thing to mention is we need to move ramblock->fd setup to be
earlier, before the ramblock_file_map() call, because it'll need to
reference the fd being mapped.  However that should not be a problem at
all, majorly because the fd won't be freed if successful, and if it failed
the fd will be freeed (or to be explicit, close()ed) by the caller.

Export it - prepare to be used outside this file.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ram_addr.h |  1 +
 softmmu/physmem.c       | 25 +++++++++++++++++--------
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 0bf9cfc659..56db25009a 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -98,6 +98,7 @@ bool ramblock_is_pmem(RAMBlock *rb);
 
 long qemu_minrampagesize(void);
 long qemu_maxrampagesize(void);
+void *ramblock_file_map(RAMBlock *block);
 
 /**
  * qemu_ram_alloc_from_file,
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 6096eac286..cdda7eaea5 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1532,17 +1532,31 @@ static int file_ram_open(const char *path,
     return fd;
 }
 
+/* Do the mmap() for a ramblock based on information already setup */
+void *ramblock_file_map(RAMBlock *block)
+{
+    uint32_t qemu_map_flags;
+
+    qemu_map_flags = (block->flags & RAM_READONLY) ? QEMU_MAP_READONLY : 0;
+    qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
+    qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
+    qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
+
+    return qemu_ram_mmap(block->fd, block->mmap_length, block->mr->align,
+                         qemu_map_flags, block->file_offset);
+}
+
 static void *file_ram_alloc(RAMBlock *block,
                             int fd,
                             bool truncate,
                             off_t offset,
                             Error **errp)
 {
-    uint32_t qemu_map_flags;
     void *area;
 
     /* Remember the offset just in case we'll need to map the range again */
     block->file_offset = offset;
+    block->fd = fd;
     block->page_size = qemu_fd_getpagesize(fd);
     if (block->mr->align % block->page_size) {
         error_setg(errp, "alignment 0x%" PRIx64
@@ -1588,19 +1602,14 @@ static void *file_ram_alloc(RAMBlock *block,
         perror("ftruncate");
     }
 
-    qemu_map_flags = (block->flags & RAM_READONLY) ? QEMU_MAP_READONLY : 0;
-    qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
-    qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
-    qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
-    area = qemu_ram_mmap(fd, block->mmap_length, block->mr->align,
-                         qemu_map_flags, offset);
+    area = ramblock_file_map(block);
+
     if (area == MAP_FAILED) {
         error_setg_errno(errp, errno,
                          "unable to map backing store for guest RAM");
         return NULL;
     }
 
-    block->fd = fd;
     return area;
 }
 #endif
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 11/21] migration: Add hugetlb-doublemap cap
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (9 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 10/21] ramblock: Add ramblock_file_map() Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-01-24 12:45   ` Dr. David Alan Gilbert
  2023-01-30  5:13   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 12/21] migration: Introduce page size for-migration-only Peter Xu
                   ` (9 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Add a new cap to allow mapping hugetlbfs backed RAMs in small page sizes.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 48 ++++++++++++++++++++++++++++++++++++++++++-
 migration/migration.h |  1 +
 qapi/migration.json   |  7 ++++++-
 3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 64f74534e2..b174f2af92 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -17,6 +17,7 @@
 #include "qemu/cutils.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "qemu/madvise.h"
 #include "migration/blocker.h"
 #include "exec.h"
 #include "fd.h"
@@ -62,6 +63,7 @@
 #include "sysemu/cpus.h"
 #include "yank_functions.h"
 #include "sysemu/qtest.h"
+#include "exec/ramblock.h"
 
 #define MAX_THROTTLE  (128 << 20)      /* Migration transfer speed throttling */
 
@@ -1363,12 +1365,47 @@ static bool migrate_caps_check(bool *cap_list,
                    "Zero copy only available for non-compressed non-TLS multifd migration");
         return false;
     }
+
+    if (cap_list[MIGRATION_CAPABILITY_HUGETLB_DOUBLEMAP]) {
+        RAMBlock *rb;
+
+        /* Check whether the platform/binary supports the new madvise()s */
+
+#if QEMU_MADV_SPLIT == QEMU_MADV_INVALID
+        error_setg(errp, "MADV_SPLIT is not supported by the QEMU binary");
+        return false;
+#endif
+
+#if QEMU_MADV_COLLAPSE == QEMU_MADV_INVALID
+        error_setg(errp, "MADV_COLLAPSE is not supported by the QEMU binary");
+        return false;
+#endif
+
+        /*
+         * Check against kernel support of MADV_SPLIT is not easy, delay
+         * that until we have all the hugetlb mappings ready on dest node,
+         * meanwhile do the best effort check here because doublemap
+         * requires the hugetlb ramblocks to be shared first.
+         */
+        RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
+            if (qemu_ram_is_hugetlb(rb) && !qemu_ram_is_shared(rb)) {
+                error_setg(errp, "RAMBlock '%s' needs to be shared for doublemap",
+                           rb->idstr);
+                return false;
+            }
+        }
+    }
 #else
     if (cap_list[MIGRATION_CAPABILITY_ZERO_COPY_SEND]) {
         error_setg(errp,
                    "Zero copy currently only available on Linux");
         return false;
     }
+
+    if (cap_list[MIGRATION_CAPABILITY_HUGETLB_DOUBLEMAP]) {
+        error_setg(errp, "Hugetlb doublemap is only supported on Linux");
+        return false;
+    }
 #endif
 
     if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT]) {
@@ -2792,6 +2829,13 @@ bool migrate_postcopy_preempt(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT];
 }
 
+bool migrate_hugetlb_doublemap(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_HUGETLB_DOUBLEMAP];
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
@@ -4472,7 +4516,9 @@ static Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
     DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
     DEFINE_PROP_MIG_CAP("x-background-snapshot",
-            MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
+                        MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
+    DEFINE_PROP_MIG_CAP("hugetlb-doublemap",
+                        MIGRATION_CAPABILITY_HUGETLB_DOUBLEMAP),
 #ifdef CONFIG_LINUX
     DEFINE_PROP_MIG_CAP("x-zero-copy-send",
             MIGRATION_CAPABILITY_ZERO_COPY_SEND),
diff --git a/migration/migration.h b/migration/migration.h
index 5674a13876..bbd610a2d5 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -447,6 +447,7 @@ bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
 bool migrate_background_snapshot(void);
 bool migrate_postcopy_preempt(void);
+bool migrate_hugetlb_doublemap(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/qapi/migration.json b/qapi/migration.json
index 88ecf86ac8..b23516e75e 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -477,6 +477,11 @@
 #                    will be handled faster.  This is a performance feature and
 #                    should not affect the correctness of postcopy migration.
 #                    (since 7.1)
+# @hugetlb-doublemap: If enabled, the migration process will allow postcopy
+#                     to handle page faults based on small pages even if
+#                     hugetlb is used.  This will drastically reduce page
+#                     fault latencies when hugetlb is used as the guest RAM
+#                     backends. (since 7.3)
 #
 # Features:
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -492,7 +497,7 @@
            'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
            'validate-uuid', 'background-snapshot',
-           'zero-copy-send', 'postcopy-preempt'] }
+           'zero-copy-send', 'postcopy-preempt', 'hugetlb-doublemap'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 12/21] migration: Introduce page size for-migration-only
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (10 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 11/21] migration: Add hugetlb-doublemap cap Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-01-24 13:20   ` Dr. David Alan Gilbert
  2023-01-30  5:17   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 13/21] migration: Add migration_ram_pagesize_largest() Peter Xu
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Migration may not want to recognize memory chunks in page size of the host
only, but sometimes we may want to recognize the memory in smaller chunks
if e.g. they're doubly mapped as both huge and small.

In those cases we'll prefer to assume the memory page size is always mapped
small (qemu_real_host_page_size) and we'll do things just like when the
pages was only smally mapped.

Let's do this to be prepared of postcopy double-mapping for hugetlbfs.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c    |  6 ++++--
 migration/postcopy-ram.c | 16 +++++++++-------
 migration/ram.c          | 29 ++++++++++++++++++++++-------
 migration/ram.h          |  1 +
 4 files changed, 36 insertions(+), 16 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index b174f2af92..f6fe474fc3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -408,7 +408,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 {
     uint8_t bufc[12 + 1 + 255]; /* start (8), len (4), rbname up to 256 */
     size_t msglen = 12; /* start + len */
-    size_t len = qemu_ram_pagesize(rb);
+    size_t len = migration_ram_pagesize(rb);
     enum mig_rp_message_type msg_type;
     const char *rbname;
     int rbname_len;
@@ -443,8 +443,10 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 int migrate_send_rp_req_pages(MigrationIncomingState *mis,
                               RAMBlock *rb, ram_addr_t start, uint64_t haddr)
 {
-    void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
     bool received = false;
+    void *aligned;
+
+    aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, migration_ram_pagesize(rb));
 
     WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
         received = ramblock_recv_bitmap_test_byte_offset(rb, start);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2c86bfc091..acae1dc6ae 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -694,7 +694,7 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
                          uint64_t client_addr,
                          RAMBlock *rb)
 {
-    size_t pagesize = qemu_ram_pagesize(rb);
+    size_t pagesize = migration_ram_pagesize(rb);
     struct uffdio_range range;
     int ret;
     trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
@@ -712,7 +712,9 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
 static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
                                  ram_addr_t start, uint64_t haddr)
 {
-    void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
+    void *aligned;
+
+    aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, migration_ram_pagesize(rb));
 
     /*
      * Discarded pages (via RamDiscardManager) are never migrated. On unlikely
@@ -722,7 +724,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
      * Checking a single bit is sufficient to handle pagesize > TPS as either
      * all relevant bits are set or not.
      */
-    assert(QEMU_IS_ALIGNED(start, qemu_ram_pagesize(rb)));
+    assert(QEMU_IS_ALIGNED(start, migration_ram_pagesize(rb)));
     if (ramblock_page_is_discarded(rb, start)) {
         bool received = ramblock_recv_bitmap_test_byte_offset(rb, start);
 
@@ -740,7 +742,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
 int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                  uint64_t client_addr, uint64_t rb_offset)
 {
-    uint64_t aligned_rbo = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
+    uint64_t aligned_rbo = ROUND_DOWN(rb_offset, migration_ram_pagesize(rb));
     MigrationIncomingState *mis = migration_incoming_get_current();
 
     trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
@@ -1020,7 +1022,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
                 break;
             }
 
-            rb_offset = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
+            rb_offset = ROUND_DOWN(rb_offset, migration_ram_pagesize(rb));
             trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
                                                 qemu_ram_get_idstr(rb),
                                                 rb_offset,
@@ -1281,7 +1283,7 @@ int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
 int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
                         RAMBlock *rb)
 {
-    size_t pagesize = qemu_ram_pagesize(rb);
+    size_t pagesize = migration_ram_pagesize(rb);
 
     /* copy also acks to the kernel waking the stalled thread up
      * TODO: We can inhibit that ack and only do it if it was requested
@@ -1308,7 +1310,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
 int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
                              RAMBlock *rb)
 {
-    size_t pagesize = qemu_ram_pagesize(rb);
+    size_t pagesize = migration_ram_pagesize(rb);
     trace_postcopy_place_page_zero(host);
 
     /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
diff --git a/migration/ram.c b/migration/ram.c
index 334309f1c6..945c6477fd 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -121,6 +121,20 @@ static struct {
     uint8_t *decoded_buf;
 } XBZRLE;
 
+/* Get the page size we should use for migration purpose. */
+size_t migration_ram_pagesize(RAMBlock *block)
+{
+    /*
+     * When hugetlb doublemap is enabled, we should always use the smallest
+     * page for migration.
+     */
+    if (migrate_hugetlb_doublemap()) {
+        return qemu_real_host_page_size();
+    }
+
+    return qemu_ram_pagesize(block);
+}
+
 static void XBZRLE_cache_lock(void)
 {
     if (migrate_use_xbzrle()) {
@@ -1049,7 +1063,7 @@ bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start)
         MemoryRegionSection section = {
             .mr = rb->mr,
             .offset_within_region = start,
-            .size = int128_make64(qemu_ram_pagesize(rb)),
+            .size = int128_make64(migration_ram_pagesize(rb)),
         };
 
         return !ram_discard_manager_is_populated(rdm, &section);
@@ -2152,7 +2166,7 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
      */
     if (postcopy_preempt_active()) {
         ram_addr_t page_start = start >> TARGET_PAGE_BITS;
-        size_t page_size = qemu_ram_pagesize(ramblock);
+        size_t page_size = migration_ram_pagesize(ramblock);
         PageSearchStatus *pss = &ram_state->pss[RAM_CHANNEL_POSTCOPY];
         int ret = 0;
 
@@ -2316,7 +2330,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
 static void pss_host_page_prepare(PageSearchStatus *pss)
 {
     /* How many guest pages are there in one host page? */
-    size_t guest_pfns = qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
+    size_t guest_pfns = migration_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
 
     pss->host_page_sending = true;
     pss->host_page_start = ROUND_DOWN(pss->page, guest_pfns);
@@ -2425,7 +2439,7 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
     bool page_dirty, preempt_active = postcopy_preempt_active();
     int tmppages, pages = 0;
     size_t pagesize_bits =
-        qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
+        migration_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
     unsigned long start_page = pss->page;
     int res;
 
@@ -3518,7 +3532,7 @@ static void *host_page_from_ram_block_offset(RAMBlock *block,
 {
     /* Note: Explicitly no check against offset_in_ramblock(). */
     return (void *)QEMU_ALIGN_DOWN((uintptr_t)(block->host + offset),
-                                   block->page_size);
+                                   migration_ram_pagesize(block));
 }
 
 static ram_addr_t host_page_offset_from_ram_block_offset(RAMBlock *block,
@@ -3970,7 +3984,8 @@ int ram_load_postcopy(QEMUFile *f, int channel)
                 break;
             }
             tmp_page->target_pages++;
-            matches_target_page_size = block->page_size == TARGET_PAGE_SIZE;
+            matches_target_page_size =
+                migration_ram_pagesize(block) == TARGET_PAGE_SIZE;
             /*
              * Postcopy requires that we place whole host pages atomically;
              * these may be huge pages for RAMBlocks that are backed by
@@ -4005,7 +4020,7 @@ int ram_load_postcopy(QEMUFile *f, int channel)
              * page
              */
             if (tmp_page->target_pages ==
-                (block->page_size / TARGET_PAGE_SIZE)) {
+                (migration_ram_pagesize(block) / TARGET_PAGE_SIZE)) {
                 place_needed = true;
             }
             place_source = tmp_page->tmp_huge_page;
diff --git a/migration/ram.h b/migration/ram.h
index 81cbb0947c..162b3e7cb8 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -68,6 +68,7 @@ bool ramblock_is_ignored(RAMBlock *block);
         if (!qemu_ram_is_migratable(block)) {} else
 
 int xbzrle_cache_resize(uint64_t new_size, Error **errp);
+size_t migration_ram_pagesize(RAMBlock *block);
 uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_total(void);
 void mig_throttle_counter_reset(void);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 13/21] migration: Add migration_ram_pagesize_largest()
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (11 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 12/21] migration: Introduce page size for-migration-only Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-01-24 17:34   ` Dr. David Alan Gilbert
  2023-01-30  5:19   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate Peter Xu
                   ` (7 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Let it replace the old qemu_ram_pagesize_largest() just to fetch the page
sizes using migration_ram_pagesize(), because it'll start to consider
double mapping effect in migrations.

Also don't account the ignored ramblocks as they won't be migrated.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/cpu-common.h |  1 -
 migration/migration.c     |  2 +-
 migration/ram.c           | 12 ++++++++++++
 migration/ram.h           |  1 +
 softmmu/physmem.c         | 13 -------------
 5 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 94452aa17f..4c394ccdfc 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -96,7 +96,6 @@ int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
 bool qemu_ram_is_hugetlb(RAMBlock *rb);
-size_t qemu_ram_pagesize_largest(void);
 
 /**
  * cpu_address_space_init:
diff --git a/migration/migration.c b/migration/migration.c
index f6fe474fc3..7724e00c47 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -604,7 +604,7 @@ process_incoming_migration_co(void *opaque)
 
     assert(mis->from_src_file);
     mis->migration_incoming_co = qemu_coroutine_self();
-    mis->largest_page_size = qemu_ram_pagesize_largest();
+    mis->largest_page_size = migration_ram_pagesize_largest();
     postcopy_state_set(POSTCOPY_INCOMING_NONE);
     migrate_set_state(&mis->state, MIGRATION_STATUS_NONE,
                       MIGRATION_STATUS_ACTIVE);
diff --git a/migration/ram.c b/migration/ram.c
index 945c6477fd..2ebf414f5f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -135,6 +135,18 @@ size_t migration_ram_pagesize(RAMBlock *block)
     return qemu_ram_pagesize(block);
 }
 
+size_t migration_ram_pagesize_largest(void)
+{
+    RAMBlock *block;
+    size_t largest = 0;
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(block) {
+        largest = MAX(largest, migration_ram_pagesize(block));
+    }
+
+    return largest;
+}
+
 static void XBZRLE_cache_lock(void)
 {
     if (migrate_use_xbzrle()) {
diff --git a/migration/ram.h b/migration/ram.h
index 162b3e7cb8..cefe166841 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -69,6 +69,7 @@ bool ramblock_is_ignored(RAMBlock *block);
 
 int xbzrle_cache_resize(uint64_t new_size, Error **errp);
 size_t migration_ram_pagesize(RAMBlock *block);
+size_t migration_ram_pagesize_largest(void);
 uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_total(void);
 void mig_throttle_counter_reset(void);
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index cdda7eaea5..536c204811 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1813,19 +1813,6 @@ bool qemu_ram_is_hugetlb(RAMBlock *rb)
     return rb->page_size > qemu_real_host_page_size();
 }
 
-/* Returns the largest size of page in use */
-size_t qemu_ram_pagesize_largest(void)
-{
-    RAMBlock *block;
-    size_t largest = 0;
-
-    RAMBLOCK_FOREACH(block) {
-        largest = MAX(largest, qemu_ram_pagesize(block));
-    }
-
-    return largest;
-}
-
 static int memory_try_enable_merging(void *addr, size_t len)
 {
     if (!machine_mem_merge(current_machine)) {
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (12 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 13/21] migration: Add migration_ram_pagesize_largest() Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-01-25 14:25   ` Dr. David Alan Gilbert
  2023-01-30  5:24   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 15/21] migration: Teach qemu about minor faults and doublemap Peter Xu
                   ` (6 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Add a RAMBlock.host_mirror for all the hugetlbfs backed guest memories.
It'll be used to remap the same region twice and it'll be used to service
page faults using UFFDIO_CONTINUE.

To make sure all accesses to these ranges will generate minor page faults
not missing page faults, we need to pre-allocate the files to make sure
page cache exist start from the beginning.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ramblock.h |  7 +++++
 migration/ram.c         | 59 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 3f31ce1591..c76683c3c8 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -28,6 +28,13 @@ struct RAMBlock {
     struct rcu_head rcu;
     struct MemoryRegion *mr;
     uint8_t *host;
+    /*
+     * This is only used for hugetlbfs ramblocks where doublemap is
+     * enabled.  The pointer is managed by dest host migration code, and
+     * should be NULL when migration is finished.  On src host, it should
+     * always be NULL.
+     */
+    uint8_t *host_mirror;
     uint8_t *colo_cache; /* For colo, VM's ram cache */
     ram_addr_t offset;
     ram_addr_t used_length;
diff --git a/migration/ram.c b/migration/ram.c
index 2ebf414f5f..37d7b3553a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3879,6 +3879,57 @@ void colo_release_ram_cache(void)
     ram_state_cleanup(&ram_state);
 }
 
+static int migrate_hugetlb_doublemap_init(void)
+{
+    RAMBlock *rb;
+    void *addr;
+    int ret;
+
+    if (!migrate_hugetlb_doublemap()) {
+        return 0;
+    }
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
+        if (qemu_ram_is_hugetlb(rb)) {
+            /*
+             * Firstly, we remap the same ramblock into another range of
+             * virtual address, so that we can write to the pages without
+             * touching the page tables that directly mapped for the guest.
+             */
+            addr = ramblock_file_map(rb);
+            if (addr == MAP_FAILED) {
+                ret = -errno;
+                error_report("%s: Duplicate mapping for hugetlb ramblock '%s'"
+                             "failed: %s", __func__, qemu_ram_get_idstr(rb),
+                             strerror(errno));
+                return ret;
+            }
+            rb->host_mirror = addr;
+
+            /*
+             * We need to make sure we pre-allocate the range with
+             * hugetlbfs pages before hand, so that all the page fault will
+             * be trapped as MINOR faults always, rather than MISSING
+             * faults in userfaultfd.
+             */
+            ret = qemu_madvise(addr, rb->mmap_length, QEMU_MADV_POPULATE_WRITE);
+            if (ret) {
+                error_report("Failed to populate hugetlb ramblock '%s': "
+                             "%s", qemu_ram_get_idstr(rb), strerror(-ret));
+                return ret;
+            }
+        }
+    }
+
+    /*
+     * When reach here, it means we've setup the mirror mapping for all the
+     * hugetlbfs pages.  Hence when page fault happens, we'll be able to
+     * resolve page faults using UFFDIO_CONTINUE for hugetlbfs pages, but
+     * we'll keep using UFFDIO_COPY for anonymous pages.
+     */
+    return 0;
+}
+
 /**
  * ram_load_setup: Setup RAM for migration incoming side
  *
@@ -3893,6 +3944,10 @@ static int ram_load_setup(QEMUFile *f, void *opaque)
         return -1;
     }
 
+    if (migrate_hugetlb_doublemap_init()) {
+        return -1;
+    }
+
     xbzrle_load_setup();
     ramblock_recv_map_init();
 
@@ -3913,6 +3968,10 @@ static int ram_load_cleanup(void *opaque)
     RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
         g_free(rb->receivedmap);
         rb->receivedmap = NULL;
+        if (rb->host_mirror) {
+            munmap(rb->host_mirror, rb->mmap_length);
+            rb->host_mirror = NULL;
+        }
     }
 
     return 0;
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 15/21] migration: Teach qemu about minor faults and doublemap
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (13 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-01-30  5:45   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 16/21] migration: Enable doublemap with MADV_SPLIT Peter Xu
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

When a ramblock is backed by hugetlbfs and the user specified using
double-map feature, we trap the faults on these regions using minor mode.
Teach QEMU about that.

Add some sanity check on the fault flags when receiving a uffd message.
For minor fault trapped ranges, we should always see the MINOR flag set,
while when using generic missing faults we should never see it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 99 ++++++++++++++++++++++++++++++++--------
 migration/postcopy-ram.h |  1 +
 2 files changed, 81 insertions(+), 19 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index acae1dc6ae..86ff73c2c0 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -325,12 +325,25 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis)
 
     if (qemu_real_host_page_size() != ram_pagesize_summary()) {
         bool have_hp = false;
-        /* We've got a huge page */
+
+        /*
+         * If we're using doublemap, we need MINOR fault, otherwise we need
+         * MISSING fault (which is the default).
+         */
+        if (migrate_hugetlb_doublemap()) {
+#ifdef UFFD_FEATURE_MINOR_HUGETLBFS
+            have_hp = supported_features & UFFD_FEATURE_MINOR_HUGETLBFS;
+#endif
+        } else {
 #ifdef UFFD_FEATURE_MISSING_HUGETLBFS
-        have_hp = supported_features & UFFD_FEATURE_MISSING_HUGETLBFS;
+            have_hp = supported_features & UFFD_FEATURE_MISSING_HUGETLBFS;
 #endif
+        }
+
         if (!have_hp) {
-            error_report("Userfault on this host does not support huge pages");
+            error_report("Userfault on this host does not support huge pages "
+                         "with %s fault traps", migrate_hugetlb_doublemap() ?
+                         "MINOR" : "MISSING");
             return false;
         }
     }
@@ -669,22 +682,43 @@ static int ram_block_enable_notify(RAMBlock *rb, void *opaque)
 {
     MigrationIncomingState *mis = opaque;
     struct uffdio_register reg_struct;
+    bool minor_fault = postcopy_use_minor_fault(rb);
 
     reg_struct.range.start = (uintptr_t)qemu_ram_get_host_addr(rb);
     reg_struct.range.len = rb->postcopy_length;
+
+    /*
+     * For hugetlbfs with double-map enabled, we trap pages using minor
+     * mode, otherwise we use missing mode.  Note: we also register missing
+     * mode for doublemap, but we should never hit it.
+     */
     reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
+    if (minor_fault) {
+        reg_struct.mode |= UFFDIO_REGISTER_MODE_MINOR;
+    }
 
     /* Now tell our userfault_fd that it's responsible for this area */
     if (ioctl(mis->userfault_fd, UFFDIO_REGISTER, &reg_struct)) {
         error_report("%s userfault register: %s", __func__, strerror(errno));
         return -1;
     }
-    if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {
-        error_report("%s userfault: Region doesn't support COPY", __func__);
-        return -1;
-    }
-    if (reg_struct.ioctls & ((__u64)1 << _UFFDIO_ZEROPAGE)) {
-        qemu_ram_set_uf_zeroable(rb);
+
+    if (minor_fault) {
+        /* Using minor faults for this ramblock */
+        if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_CONTINUE))) {
+            error_report("%s userfault: Region doesn't support CONTINUE",
+                         __func__);
+            return -1;
+        }
+    } else {
+        /* Using missing faults for this ramblock */
+        if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {
+            error_report("%s userfault: Region doesn't support COPY", __func__);
+            return -1;
+        }
+        if (reg_struct.ioctls & ((__u64)1 << _UFFDIO_ZEROPAGE)) {
+            qemu_ram_set_uf_zeroable(rb);
+        }
     }
 
     return 0;
@@ -916,6 +950,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
 {
     MigrationIncomingState *mis = opaque;
     struct uffd_msg msg;
+    uint64_t address;
     int ret;
     size_t index;
     RAMBlock *rb = NULL;
@@ -945,6 +980,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
     }
 
     while (true) {
+        bool use_minor_fault, minor_flag;
         ram_addr_t rb_offset;
         int poll_result;
 
@@ -1022,22 +1058,37 @@ static void *postcopy_ram_fault_thread(void *opaque)
                 break;
             }
 
-            rb_offset = ROUND_DOWN(rb_offset, migration_ram_pagesize(rb));
-            trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
-                                                qemu_ram_get_idstr(rb),
-                                                rb_offset,
-                                                msg.arg.pagefault.feat.ptid);
-            mark_postcopy_blocktime_begin(
-                    (uintptr_t)(msg.arg.pagefault.address),
-                                msg.arg.pagefault.feat.ptid, rb);
+            address = ROUND_DOWN(msg.arg.pagefault.address,
+                                 migration_ram_pagesize(rb));
+            use_minor_fault = postcopy_use_minor_fault(rb);
+            minor_flag = !!(msg.arg.pagefault.flags &
+                            UFFD_PAGEFAULT_FLAG_MINOR);
 
+            /*
+             * Do sanity check on the message flags to make sure this is
+             * the one we expect to receive.  When using minor fault on
+             * this ramblock, it should _always_ be set; when not using
+             * minor fault, it should _never_ be set.
+             */
+            if (use_minor_fault ^ minor_flag) {
+                error_report("%s: Unexpected page fault flags (0x%"PRIx64") "
+                             "for address 0x%"PRIx64" (mode=%s)", __func__,
+                             (uint64_t)msg.arg.pagefault.flags,
+                             (uint64_t)msg.arg.pagefault.address,
+                             use_minor_fault ? "MINOR" : "MISSING");
+            }
+
+            trace_postcopy_ram_fault_thread_request(
+                address, qemu_ram_get_idstr(rb), rb_offset,
+                msg.arg.pagefault.feat.ptid);
+            mark_postcopy_blocktime_begin(
+                    (uintptr_t)(address), msg.arg.pagefault.feat.ptid, rb);
 retry:
             /*
              * Send the request to the source - we want to request one
              * of our host page sizes (which is >= TPS)
              */
-            ret = postcopy_request_page(mis, rb, rb_offset,
-                                        msg.arg.pagefault.address);
+            ret = postcopy_request_page(mis, rb, rb_offset, address);
             if (ret) {
                 /* May be network failure, try to wait for recovery */
                 postcopy_pause_fault_thread(mis);
@@ -1694,3 +1745,13 @@ void *postcopy_preempt_thread(void *opaque)
 
     return NULL;
 }
+
+/*
+ * Whether we should use MINOR fault to trap page faults?  It will be used
+ * when doublemap is enabled on hugetlbfs.  The default value will be
+ * false, which means we'll keep using the legacy MISSING faults.
+ */
+bool postcopy_use_minor_fault(RAMBlock *rb)
+{
+    return migrate_hugetlb_doublemap() && qemu_ram_is_hugetlb(rb);
+}
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index b4867a32d5..32734d2340 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -193,5 +193,6 @@ enum PostcopyChannels {
 void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
 void postcopy_preempt_setup(MigrationState *s);
 int postcopy_preempt_establish_channel(MigrationState *s);
+bool postcopy_use_minor_fault(RAMBlock *rb);
 
 #endif
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 16/21] migration: Enable doublemap with MADV_SPLIT
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (14 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 15/21] migration: Teach qemu about minor faults and doublemap Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-02-01 18:59   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 17/21] migration: Rework ram discard logic for hugetlb double-map Peter Xu
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

MADV_SPLIT enables doublemap on hugetlb.  Do that if doublemap=true
specified for the migration.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 16 ++++++++++++++++
 migration/ram.c          | 18 ++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 86ff73c2c0..dbc7e54e4a 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -694,6 +694,22 @@ static int ram_block_enable_notify(RAMBlock *rb, void *opaque)
      */
     reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
     if (minor_fault) {
+        /*
+         * MADV_SPLIT implicitly enables doublemap mode for hugetlb.  If
+         * that fails (e.g. on old kernels) we need to fail the migration.
+         *
+         * It's a bit late to fail here as we could have migrated lots of
+         * pages in precopy, but early failure will require us to allocate
+         * hugetlb pages secretly in QEMU which is not friendly to admins
+         * and it may affect the global hugetlb pool.  Considering it is
+         * normally always limited, keep the failure late but tolerable.
+         */
+        if (qemu_madvise(qemu_ram_get_host_addr(rb), rb->postcopy_length,
+                         QEMU_MADV_SPLIT)) {
+            error_report("%s: madvise(MADV_SPLIT) failed (ret=%d) but "
+                         "required for doublemap.", __func__, -errno);
+            return -1;
+        }
         reg_struct.mode |= UFFDIO_REGISTER_MODE_MINOR;
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index 37d7b3553a..4d786f4b97 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3891,6 +3891,19 @@ static int migrate_hugetlb_doublemap_init(void)
 
     RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
         if (qemu_ram_is_hugetlb(rb)) {
+            /*
+             * MADV_SPLIT implicitly enables doublemap mode for hugetlb on
+             * the guest mapped ranges.  If that fails (e.g. on old
+             * kernels) we need to fail the migration.  Note, the
+             * host_mirror mapping below can be kept as hugely mapped.
+             */
+            if (qemu_madvise(qemu_ram_get_host_addr(rb), rb->mmap_length,
+                             QEMU_MADV_SPLIT)) {
+                error_report("%s: madvise(MADV_SPLIT) required for doublemap",
+                             __func__);
+                return -1;
+            }
+
             /*
              * Firstly, we remap the same ramblock into another range of
              * virtual address, so that we can write to the pages without
@@ -3898,6 +3911,11 @@ static int migrate_hugetlb_doublemap_init(void)
              */
             addr = ramblock_file_map(rb);
             if (addr == MAP_FAILED) {
+                /*
+                 * No need to undo MADV_SPLIT because this is dest node and
+                 * we're going to bail out anyway.  Leave that for mm exit
+                 * to clean things up.
+                 */
                 ret = -errno;
                 error_report("%s: Duplicate mapping for hugetlb ramblock '%s'"
                              "failed: %s", __func__, qemu_ram_get_idstr(rb),
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 17/21] migration: Rework ram discard logic for hugetlb double-map
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (15 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 16/21] migration: Enable doublemap with MADV_SPLIT Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-02-01 19:03   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 18/21] migration: Allow postcopy_register_shared_ufd() to fail Peter Xu
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Hugetlb double map will make the ram discard logic different.

The whole idea will still be the same: we need to a bitmap sync between
src/dst before we switch to postcopy.

When discarding a range, we only erase the pgtables that were used to be
mapped for the guest leveraging the semantics of MADV_DONTNEED on Linux.
This guarantees us that when a guest access triggered we'll receive a MINOR
fault message rather than a MISSING fault message.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/cpu-common.h |  1 +
 migration/ram.c           | 16 +++++++++++++++-
 migration/trace-events    |  1 +
 softmmu/physmem.c         | 31 +++++++++++++++++++++++++++++++
 4 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 4c394ccdfc..09378c6ada 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -155,6 +155,7 @@ typedef int (RAMBlockIterFunc)(RAMBlock *rb, void *opaque);
 
 int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
 int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length);
+int ram_block_zap_range(RAMBlock *rb, uint64_t start, size_t length);
 
 #endif
 
diff --git a/migration/ram.c b/migration/ram.c
index 4d786f4b97..4da56d925c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2770,6 +2770,12 @@ static void postcopy_each_ram_send_discard(MigrationState *ms)
          * host-page size chunks, mark any partially dirty host-page size
          * chunks as all dirty.  In this case the host-page is the host-page
          * for the particular RAMBlock, i.e. it might be a huge page.
+         *
+         * Note: we need to do huge page truncation when double-map is
+         * enabled too, _only_ because we use MADV_DONTNEED to drop
+         * pgtables on dest QEMU, and it (at least so far...) does not
+         * support dropping partial of the hugetlb pgtables.  If it can one
+         * day, we can skip this "chunk" operation as further optimization.
          */
         postcopy_chunk_hostpages_pass(ms, block);
 
@@ -2913,7 +2919,15 @@ int ram_discard_range(const char *rbname, uint64_t start, size_t length)
                      length >> qemu_target_page_bits());
     }
 
-    return ram_block_discard_range(rb, start, length);
+    if (postcopy_use_minor_fault(rb)) {
+        /*
+         * We need to keep the page cache exist, so as to trigger MINOR
+         * faults for every future page accesses on old pages.
+         */
+        return ram_block_zap_range(rb, start, length);
+    } else {
+        return ram_block_discard_range(rb, start, length);
+    }
 }
 
 /*
diff --git a/migration/trace-events b/migration/trace-events
index 57003edcbd..6b418a0e9e 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -92,6 +92,7 @@ migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64
 migration_bitmap_clear_dirty(char *str, uint64_t start, uint64_t size, unsigned long page) "rb %s start 0x%"PRIx64" size 0x%"PRIx64" page 0x%lx"
 migration_throttle(void) ""
 ram_discard_range(const char *rbname, uint64_t start, size_t len) "%s: start: %" PRIx64 " %zx"
+postcopy_discard_range(const char *rbname, uint64_t start, void *host, size_t len) "%s: start=%" PRIx64 " haddr=%p len=%zx"
 ram_load_loop(const char *rbname, uint64_t addr, int flags, void *host) "%s: addr: 0x%" PRIx64 " flags: 0x%x host: %p"
 ram_load_postcopy_loop(int channel, uint64_t addr, int flags) "chan=%d addr=0x%" PRIx64 " flags=0x%x"
 ram_postcopy_send_discard_bitmap(void) ""
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 536c204811..12c0bc9aee 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -3567,6 +3567,37 @@ int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
     return ret;
 }
 
+/*
+ * Zap page tables for specified range.  Only applicable for file-backed
+ * memory.  We're relying on Linux's MADV_DONTNEED behavior here for
+ * zapping the pgtables, it may or may not work on other OSes.  Before we
+ * know that, fail them.
+ */
+int ram_block_zap_range(RAMBlock *rb, uint64_t start, size_t length)
+{
+#ifdef CONFIG_LINUX
+    uint8_t *host_addr = rb->host + start;
+    int ret;
+
+    if (rb->fd == -1) {
+        /* The zap magic only works with file-backed */
+        return -EINVAL;
+    }
+
+    ret = madvise(host_addr, length, MADV_DONTNEED);
+    if (ret) {
+        ret = -errno;
+        error_report("%s: Failed to zap ramblock start=0x%"PRIx64
+                     " addr=0x%"PRIx64" length=0x%zx", __func__,
+                     start, (uint64_t)host_addr, length);
+    }
+
+    return ret;
+#else
+    return -EINVAL;
+#endif
+}
+
 /*
  * Unmap pages of memory from start to start+length such that
  * they a) read as 0, b) Trigger whatever fault mechanism
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 18/21] migration: Allow postcopy_register_shared_ufd() to fail
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (16 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 17/21] migration: Rework ram discard logic for hugetlb double-map Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-02-01 19:09   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 19/21] migration: Add postcopy_mark_received() Peter Xu
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Let's fail double-map for vhost-user and any potential users that can have
a remote userfaultfd for now.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/virtio/vhost-user.c   | 9 ++++++++-
 migration/postcopy-ram.c | 9 +++++++--
 migration/postcopy-ram.h | 4 ++--
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index d9ce0501b2..00351bd67a 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1952,7 +1952,14 @@ static int vhost_user_postcopy_advise(struct vhost_dev *dev, Error **errp)
     u->postcopy_fd.handler = vhost_user_postcopy_fault_handler;
     u->postcopy_fd.waker = vhost_user_postcopy_waker;
     u->postcopy_fd.idstr = "vhost-user"; /* Need to find unique name */
-    postcopy_register_shared_ufd(&u->postcopy_fd);
+
+    ret = postcopy_register_shared_ufd(&u->postcopy_fd);
+    if (ret) {
+        error_setg(errp, "%s: Register of shared userfaultfd failed: %s",
+                   __func__, strerror(ret));
+        return ret;
+    }
+
     return 0;
 #else
     error_setg(errp, "Postcopy not supported on non-Linux systems");
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index dbc7e54e4a..0cfe5174a5 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1582,14 +1582,19 @@ PostcopyState postcopy_state_set(PostcopyState new_state)
 }
 
 /* Register a handler for external shared memory postcopy
- * called on the destination.
+ * called on the destination.  Returns 0 if success, <0 for err.
  */
-void postcopy_register_shared_ufd(struct PostCopyFD *pcfd)
+int postcopy_register_shared_ufd(struct PostCopyFD *pcfd)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
+    if (migrate_hugetlb_doublemap()) {
+        return -EINVAL;
+    }
+
     mis->postcopy_remote_fds = g_array_append_val(mis->postcopy_remote_fds,
                                                   *pcfd);
+    return 0;
 }
 
 /* Unregister a handler for external shared memory postcopy
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 32734d2340..94adad6fb8 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -161,9 +161,9 @@ struct PostCopyFD {
 };
 
 /* Register a userfaultfd owned by an external process for
- * shared memory.
+ * shared memory.  Returns 0 if succeeded, <0 if error.
  */
-void postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
+int postcopy_register_shared_ufd(struct PostCopyFD *pcfd);
 void postcopy_unregister_shared_ufd(struct PostCopyFD *pcfd);
 /* Call each of the shared 'waker's registered telling them of
  * availability of a block.
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 19/21] migration: Add postcopy_mark_received()
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (17 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 18/21] migration: Allow postcopy_register_shared_ufd() to fail Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-02-01 19:10   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 20/21] migration: Handle page faults using UFFDIO_CONTINUE Peter Xu
  2023-01-17 22:09 ` [PATCH RFC 21/21] migration: Collapse huge pages again after postcopy finished Peter Xu
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

We have a few maintainance work to do after we UFFDIO_[ZERO]COPY a page
before, e.g. on requested list of pages or when measuring page latencies.

Move those steps into a separate function so that it can be easily reused
when we're going to support UFFDIO_CONTINUE.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 0cfe5174a5..8a2259581e 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1288,6 +1288,25 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
     return 0;
 }
 
+static void
+postcopy_mark_received(MigrationIncomingState *mis, RAMBlock *rb,
+                       void *host_addr, size_t npages)
+{
+        qemu_mutex_lock(&mis->page_request_mutex);
+        ramblock_recv_bitmap_set_range(rb, host_addr, npages);
+        /*
+         * If this page resolves a page fault for a previous recorded faulted
+         * address, take a special note to maintain the requested page list.
+         */
+        if (g_tree_lookup(mis->page_requested, host_addr)) {
+            g_tree_remove(mis->page_requested, host_addr);
+            mis->page_requested_count--;
+            trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
+        }
+        qemu_mutex_unlock(&mis->page_request_mutex);
+        mark_postcopy_blocktime_end((uintptr_t)host_addr);
+}
+
 static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
                                void *from_addr, uint64_t pagesize, RAMBlock *rb)
 {
@@ -1309,20 +1328,8 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
         ret = ioctl(userfault_fd, UFFDIO_ZEROPAGE, &zero_struct);
     }
     if (!ret) {
-        qemu_mutex_lock(&mis->page_request_mutex);
-        ramblock_recv_bitmap_set_range(rb, host_addr,
-                                       pagesize / qemu_target_page_size());
-        /*
-         * If this page resolves a page fault for a previous recorded faulted
-         * address, take a special note to maintain the requested page list.
-         */
-        if (g_tree_lookup(mis->page_requested, host_addr)) {
-            g_tree_remove(mis->page_requested, host_addr);
-            mis->page_requested_count--;
-            trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
-        }
-        qemu_mutex_unlock(&mis->page_request_mutex);
-        mark_postcopy_blocktime_end((uintptr_t)host_addr);
+        postcopy_mark_received(mis, rb, host_addr,
+                               pagesize / qemu_target_page_size());
     }
     return ret;
 }
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 20/21] migration: Handle page faults using UFFDIO_CONTINUE
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (18 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 19/21] migration: Add postcopy_mark_received() Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-02-01 19:24   ` Juan Quintela
  2023-01-17 22:09 ` [PATCH RFC 21/21] migration: Collapse huge pages again after postcopy finished Peter Xu
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

Teach QEMU to be able to handle page faults using UFFDIO_CONTINUE for
hugetlbfs double mapped ranges.

To copy the data, we need to use the mirror buffer created per ramblock by
a raw memcpy(), then we can kick the faulted threads using UFFDIO_CONTINUE
by installing the pgtables.

Move trace_postcopy_place_page(host) upper so that it'll dump something for
either UFFDIO_COPY or UFFDIO_CONTINUE.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 55 ++++++++++++++++++++++++++++++++++++++--
 migration/trace-events   |  4 +--
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 8a2259581e..c4bd338e22 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1350,6 +1350,43 @@ int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
     return 0;
 }
 
+/* Returns the mirror_host addr for a specific host address in ramblock */
+static inline void *migration_ram_get_mirror_addr(RAMBlock *rb, void *host)
+{
+    return (void *)((__u64)rb->host_mirror + ((__u64)host - (__u64)rb->host));
+}
+
+static int
+qemu_uffd_continue(MigrationIncomingState *mis, RAMBlock *rb, void *host,
+                   void *from)
+{
+    void *mirror_addr = migration_ram_get_mirror_addr(rb, host);
+    /* Doublemap uses small host page size */
+    uint64_t psize = qemu_real_host_page_size();
+    struct uffdio_continue req;
+
+    /*
+     * Copy data first into the mirror host pointer; we can't directly copy
+     * data into rb->host because otherwise our thread will get trapped too.
+     */
+    memcpy(mirror_addr, from, psize);
+
+    /* Kick off the faluted threads to fetch data from the page cache */
+    req.range.start = (__u64)host;
+    req.range.len = psize;
+    req.mode = 0;
+	if (ioctl(mis->userfault_fd, UFFDIO_CONTINUE, &req)) {
+        error_report("%s: UFFDIO_CONTINUE failed for start=%p"
+                     " len=0x%"PRIx64": %s\n", __func__, host,
+                     psize, strerror(-req.mapped));
+        return req.mapped;
+    }
+
+    postcopy_mark_received(mis, rb, host, psize / qemu_target_page_size());
+
+    return 0;
+}
+
 /*
  * Place a host page (from) at (host) atomically
  * returns 0 on success
@@ -1359,6 +1396,18 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
 {
     size_t pagesize = migration_ram_pagesize(rb);
 
+    trace_postcopy_place_page(rb->idstr, (uint8_t *)host - rb->host, host);
+
+    if (postcopy_use_minor_fault(rb)) {
+        /*
+         * If minor fault used, we use UFFDIO_CONTINUE instead.
+         *
+         * TODO: support shared uffds (e.g. vhost-user). Currently we're
+         * skipping them.
+         */
+        return qemu_uffd_continue(mis, rb, host, from);
+    }
+
     /* copy also acks to the kernel waking the stalled thread up
      * TODO: We can inhibit that ack and only do it if it was requested
      * which would be slightly cheaper, but we'd have to be careful
@@ -1372,7 +1421,6 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
         return -e;
     }
 
-    trace_postcopy_place_page(host);
     return postcopy_notify_shared_wake(rb,
                                        qemu_ram_block_host_offset(rb, host));
 }
@@ -1385,10 +1433,13 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
                              RAMBlock *rb)
 {
     size_t pagesize = migration_ram_pagesize(rb);
-    trace_postcopy_place_page_zero(host);
+    trace_postcopy_place_page_zero(rb->idstr, (uint8_t *)host - rb->host, host);
 
     /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
      * but it's not available for everything (e.g. hugetlbpages)
+     *
+     * NOTE: when hugetlb double-map enabled, then this ramblock will never
+     * have RAM_UF_ZEROPAGE, so it'll always go to postcopy_place_page().
      */
     if (qemu_ram_is_uf_zeroable(rb)) {
         if (qemu_ufd_copy_ioctl(mis, host, NULL, pagesize, rb)) {
diff --git a/migration/trace-events b/migration/trace-events
index 6b418a0e9e..7baf235d22 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -265,8 +265,8 @@ postcopy_discard_send_range(const char *ramblock, unsigned long start, unsigned
 postcopy_cleanup_range(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=0x%zx length=0x%zx"
 postcopy_init_range(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=0x%zx length=0x%zx"
 postcopy_nhp_range(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=0x%zx length=0x%zx"
-postcopy_place_page(void *host_addr) "host=%p"
-postcopy_place_page_zero(void *host_addr) "host=%p"
+postcopy_place_page(const char *id, size_t offset, void *host_addr) "id=%s offset=0x%zx host=%p"
+postcopy_place_page_zero(const char *id, size_t offset, void *host_addr) "id=%s offset=0x%zx host=%p"
 postcopy_ram_enable_notify(void) ""
 mark_postcopy_blocktime_begin(uint64_t addr, void *dd, uint32_t time, int cpu, int received) "addr: 0x%" PRIx64 ", dd: %p, time: %u, cpu: %d, already_received: %d"
 mark_postcopy_blocktime_end(uint64_t addr, void *dd, uint32_t time, int affected_cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, affected_cpu: %d"
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 21/21] migration: Collapse huge pages again after postcopy finished
  2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
                   ` (19 preceding siblings ...)
  2023-01-17 22:09 ` [PATCH RFC 20/21] migration: Handle page faults using UFFDIO_CONTINUE Peter Xu
@ 2023-01-17 22:09 ` Peter Xu
  2023-02-01 19:49   ` Juan Quintela
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-17 22:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	peterx, Dr . David Alan Gilbert

When hugetlb-doublemap enabled, the pages will be migrated in small page
sizes during postcopy.  When the migration finishes, the pgtable needs to
be rebuilt explicitly for these ranges to have huge page being mapped again.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c        | 31 +++++++++++++++++++++++++++++++
 migration/trace-events |  1 +
 2 files changed, 32 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 4da56d925c..178739f8c3 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3986,6 +3986,31 @@ static int ram_load_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
+#define  MADV_COLLAPSE_CHUNK_SIZE  (1UL << 30) /* 1G */
+
+static void ramblock_rebuild_huge_mappings(RAMBlock *rb)
+{
+    unsigned long addr, size;
+
+    assert(qemu_ram_is_hugetlb(rb));
+
+    addr = (unsigned long)qemu_ram_get_host_addr(rb);
+    size = rb->mmap_length;
+
+    while (size) {
+        unsigned long chunk = MIN(size, MADV_COLLAPSE_CHUNK_SIZE);
+
+        if (qemu_madvise((void *)addr, chunk, QEMU_MADV_COLLAPSE)) {
+            error_report("%s: madvise(MADV_COLLAPSE) failed "
+                         "for ramblock '%s'", __func__, rb->idstr);
+        } else {
+            trace_ramblock_rebuild_huge_mappings(rb->idstr, addr, chunk);
+        }
+        addr += chunk;
+        size -= chunk;
+    }
+}
+
 static int ram_load_cleanup(void *opaque)
 {
     RAMBlock *rb;
@@ -4001,6 +4026,12 @@ static int ram_load_cleanup(void *opaque)
         g_free(rb->receivedmap);
         rb->receivedmap = NULL;
         if (rb->host_mirror) {
+            /*
+             * If host_mirror set, it means this is an hugetlb ramblock,
+             * and we've enabled double mappings for it.  Rebuild the huge
+             * page tables here.
+             */
+            ramblock_rebuild_huge_mappings(rb);
             munmap(rb->host_mirror, rb->mmap_length);
             rb->host_mirror = NULL;
         }
diff --git a/migration/trace-events b/migration/trace-events
index 7baf235d22..6b52bb691c 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -119,6 +119,7 @@ postcopy_preempt_hit(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
 postcopy_preempt_send_host_page(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
 postcopy_preempt_switch_channel(int channel) "%d"
 postcopy_preempt_reset_channel(void) ""
+ramblock_rebuild_huge_mappings(char *str, unsigned long start, unsigned long size) "ramblock %s start 0x%lx size 0x%lx"
 
 # multifd.c
 multifd_new_send_channel_async(uint8_t id) "channel %u"
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c
  2023-01-17 22:08 ` [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c Peter Xu
@ 2023-01-18 12:00   ` Dr. David Alan Gilbert
  2023-01-25  0:19   ` Philippe Mathieu-Daudé
  2023-01-30  4:57   ` Juan Quintela
  2 siblings, 0 replies; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-18 12:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Without it, we never have CONFIG_LINUX defined even if on linux, so
> linux/mman.h is never really included.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  util/mmap-alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 5ed7d29183..040599b0e3 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -9,6 +9,7 @@
>   * This work is licensed under the terms of the GNU GPL, version 2 or
>   * later.  See the COPYING file in the top-level directory.
>   */
> +#include "qemu/osdep.h"
>  
>  #ifdef CONFIG_LINUX
>  #include <linux/mman.h>
> @@ -17,7 +18,6 @@
>  #define MAP_SHARED_VALIDATE   0x0
>  #endif /* CONFIG_LINUX */
>  
> -#include "qemu/osdep.h"
>  #include "qemu/mmap-alloc.h"
>  #include "qemu/host-utils.h"
>  #include "qemu/cutils.h"
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 03/21] physmem: Add qemu_ram_is_hugetlb()
  2023-01-17 22:08 ` [PATCH RFC 03/21] physmem: Add qemu_ram_is_hugetlb() Peter Xu
@ 2023-01-18 12:02   ` Dr. David Alan Gilbert
  2023-01-30  5:00   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-18 12:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Returns true for a hugetlbfs mapping, false otherwise.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Yeh OK, it feels a little delecate perhaps if anything else
ever allows large mappings.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/exec/cpu-common.h | 1 +
>  softmmu/physmem.c         | 5 +++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 6feaa40ca7..94452aa17f 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -95,6 +95,7 @@ void qemu_ram_unset_migratable(RAMBlock *rb);
>  int qemu_ram_get_fd(RAMBlock *rb);
>  
>  size_t qemu_ram_pagesize(RAMBlock *block);
> +bool qemu_ram_is_hugetlb(RAMBlock *rb);
>  size_t qemu_ram_pagesize_largest(void);
>  
>  /**
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index edec095c7a..a4fb129d8f 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -1798,6 +1798,11 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
>      return rb->page_size;
>  }
>  
> +bool qemu_ram_is_hugetlb(RAMBlock *rb)
> +{
> +    return rb->page_size > qemu_real_host_page_size();
> +}
> +
>  /* Returns the largest size of page in use */
>  size_t qemu_ram_pagesize_largest(void)
>  {
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 04/21] madvise: Include linux/mman.h under linux-headers/
  2023-01-17 22:08 ` [PATCH RFC 04/21] madvise: Include linux/mman.h under linux-headers/ Peter Xu
@ 2023-01-18 12:08   ` Dr. David Alan Gilbert
  2023-01-30  5:01   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-18 12:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> This will allow qemu/madvise.h to always include linux/mman.h under the
> linux-headers/.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/qemu/madvise.h | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/qemu/madvise.h b/include/qemu/madvise.h
> index e155f59a0d..b6fa49553f 100644
> --- a/include/qemu/madvise.h
> +++ b/include/qemu/madvise.h
> @@ -8,6 +8,10 @@
>  #ifndef QEMU_MADVISE_H
>  #define QEMU_MADVISE_H
>  
> +#ifdef CONFIG_LINUX
> +#include "linux/mman.h"
> +#endif
> +
>  #define QEMU_MADV_INVALID -1
>  
>  #if defined(CONFIG_MADVISE)
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 06/21] madvise: Add QEMU_MADV_COLLAPSE
  2023-01-17 22:08 ` [PATCH RFC 06/21] madvise: Add QEMU_MADV_COLLAPSE Peter Xu
@ 2023-01-18 18:51   ` Dr. David Alan Gilbert
  2023-01-18 20:21     ` Peter Xu
  2023-01-30  5:02   ` Juan Quintela
  1 sibling, 1 reply; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-18 18:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> MADV_COLLAPSE is a new madvise() on Linux.  Define it.

I'd probably have merged this with the MADV_SPLIT one since they go
together; but also, it would be good in the commit message
for Qemu to include either the definition or a pointer to the kernel
definiton of them.

Dave

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/qemu/madvise.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/qemu/madvise.h b/include/qemu/madvise.h
> index 3dddd25065..794e5fb0a7 100644
> --- a/include/qemu/madvise.h
> +++ b/include/qemu/madvise.h
> @@ -68,6 +68,11 @@
>  #else
>  #define QEMU_MADV_SPLIT QEMU_MADV_INVALID
>  #endif
> +#ifdef MADV_COLLAPSE
> +#define QEMU_MADV_COLLAPSE MADV_COLLAPSE
> +#else
> +#define QEMU_MADV_COLLAPSE QEMU_MADV_INVALID
> +#endif
>  
>  #elif defined(CONFIG_POSIX_MADVISE)
>  
> @@ -83,6 +88,7 @@
>  #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
>  #define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
>  #define QEMU_MADV_SPLIT QEMU_MADV_INVALID
> +#define QEMU_MADV_COLLAPSE QEMU_MADV_INVALID
>  
>  #else /* no-op */
>  
> @@ -98,6 +104,7 @@
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>  #define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
>  #define QEMU_MADV_SPLIT QEMU_MADV_INVALID
> +#define QEMU_MADV_COLLAPSE QEMU_MADV_INVALID
>  
>  #endif
>  
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 06/21] madvise: Add QEMU_MADV_COLLAPSE
  2023-01-18 18:51   ` Dr. David Alan Gilbert
@ 2023-01-18 20:21     ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-18 20:21 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

On Wed, Jan 18, 2023 at 06:51:07PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > MADV_COLLAPSE is a new madvise() on Linux.  Define it.
> 
> I'd probably have merged this with the MADV_SPLIT one since they go
> together; but also, it would be good in the commit message
> for Qemu to include either the definition or a pointer to the kernel
> definiton of them.

Will do.

I don't have good links for them yet because both of them are still not in
upstream man-page project.  Even THP version of MADV_COLLAPSE man page just
got added into the man-page repository in Nov 2022 so most of the websites
that host man pages won't even have MADV_COLLAPSE..

For now I'll add some more paragraphs trying to explain everything, and
I'll also link to madvise(2) where both of them will be discussed in the
future.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks
  2023-01-17 22:09 ` [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks Peter Xu
@ 2023-01-23 18:51   ` Dr. David Alan Gilbert
  2023-01-24 20:28     ` Peter Xu
  2023-01-30  5:05   ` Juan Quintela
  1 sibling, 1 reply; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-23 18:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> We do proper page size alignment for file backed mmap()s for ramblocks.
> Even if it's as simple as that, cache the value because it'll be used in
> multiple places.
> 
> Since at it, drop size for file_ram_alloc() and just use max_length because
> that's always true for file-backed ramblocks.

Having a length previously called 'memory' was a bit odd!

> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/exec/ramblock.h |  2 ++
>  softmmu/physmem.c       | 14 +++++++-------
>  2 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 76cd0812c8..3f31ce1591 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -32,6 +32,8 @@ struct RAMBlock {
>      ram_addr_t offset;
>      ram_addr_t used_length;
>      ram_addr_t max_length;
> +    /* Only used for file-backed ramblocks */
> +    ram_addr_t mmap_length;
>      void (*resized)(const char*, uint64_t length, void *host);
>      uint32_t flags;
>      /* Protected by iothread lock.  */
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index aa1a7466e5..b5be02f1cb 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -1533,7 +1533,6 @@ static int file_ram_open(const char *path,
>  }
>  
>  static void *file_ram_alloc(RAMBlock *block,
> -                            ram_addr_t memory,
>                              int fd,
>                              bool readonly,
>                              bool truncate,
> @@ -1563,14 +1562,14 @@ static void *file_ram_alloc(RAMBlock *block,
>      }
>  #endif
>  
> -    if (memory < block->page_size) {
> +    if (block->max_length < block->page_size) {
>          error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
>                     "or larger than page size 0x%zx",
> -                   memory, block->page_size);
> +                   block->max_length, block->page_size);
>          return NULL;
>      }
>  
> -    memory = ROUND_UP(memory, block->page_size);
> +    block->mmap_length = ROUND_UP(block->max_length, block->page_size);
>  
>      /*
>       * ftruncate is not supported by hugetlbfs in older
> @@ -1586,7 +1585,7 @@ static void *file_ram_alloc(RAMBlock *block,
>       * those labels. Therefore, extending the non-empty backend file
>       * is disabled as well.
>       */
> -    if (truncate && ftruncate(fd, memory)) {
> +    if (truncate && ftruncate(fd, block->mmap_length)) {
>          perror("ftruncate");
>      }
>  
> @@ -1594,7 +1593,8 @@ static void *file_ram_alloc(RAMBlock *block,
>      qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
>      qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
>      qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
> -    area = qemu_ram_mmap(fd, memory, block->mr->align, qemu_map_flags, offset);
> +    area = qemu_ram_mmap(fd, block->mmap_length, block->mr->align,
> +                         qemu_map_flags, offset);
>      if (area == MAP_FAILED) {
>          error_setg_errno(errp, errno,
>                           "unable to map backing store for guest RAM");
> @@ -2100,7 +2100,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>      new_block->used_length = size;
>      new_block->max_length = size;
>      new_block->flags = ram_flags;
> -    new_block->host = file_ram_alloc(new_block, size, fd, readonly,
> +    new_block->host = file_ram_alloc(new_block, fd, readonly,
>                                       !file_size, offset, errp);
>      if (!new_block->host) {
>          g_free(new_block);
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 09/21] ramblock: Add RAM_READONLY
  2023-01-17 22:09 ` [PATCH RFC 09/21] ramblock: Add RAM_READONLY Peter Xu
@ 2023-01-23 19:42   ` Dr. David Alan Gilbert
  2023-01-30  5:06   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-23 19:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> This allows us to have RAM_READONLY to be set in ram_flags to show that
> this ramblock can only be read not write.
> 
> We used to pass in readonly boolean along the way for allocating the
> ramblock, now let it be together with the rest ramblock flags.
> 
> The main purpose of this patch is not for clean up though, it's for caching
> mapping information of each ramblock so when we want to mmap() it again for
> whatever reason we can have all the information on hand.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  backends/hostmem-file.c |  3 ++-
>  include/exec/memory.h   |  4 ++--
>  include/exec/ram_addr.h |  5 ++---
>  softmmu/memory.c        |  8 +++-----
>  softmmu/physmem.c       | 16 +++++++---------
>  5 files changed, 16 insertions(+), 20 deletions(-)
> 
> diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
> index 25141283c4..1daf00d2da 100644
> --- a/backends/hostmem-file.c
> +++ b/backends/hostmem-file.c
> @@ -56,9 +56,10 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>      ram_flags = backend->share ? RAM_SHARED : 0;
>      ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>      ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
> +    ram_flags |= fb->readonly ? RAM_READONLY : 0;
>      memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
>                                       backend->size, fb->align, ram_flags,
> -                                     fb->mem_path, fb->readonly, errp);
> +                                     fb->mem_path, errp);
>      g_free(name);
>  #endif
>  }
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index c37ffdbcd1..006ba77ede 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -188,6 +188,8 @@ typedef struct IOMMUTLBEvent {
>  /* RAM is a persistent kind memory */
>  #define RAM_PMEM (1 << 5)
>  
> +/* RAM is read-only */
> +#define RAM_READONLY (1 << 6)
>  
>  /*
>   * UFFDIO_WRITEPROTECT is used on this RAMBlock to
> @@ -1292,7 +1294,6 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
>   * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
>   *             RAM_NORESERVE,
>   * @path: the path in which to allocate the RAM.
> - * @readonly: true to open @path for reading, false for read/write.
>   * @errp: pointer to Error*, to store an error if it happens.
>   *
>   * Note that this function does not do anything to cause the data in the
> @@ -1305,7 +1306,6 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
>                                        uint64_t align,
>                                        uint32_t ram_flags,
>                                        const char *path,
> -                                      bool readonly,
>                                        Error **errp);
>  
>  /**
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index f4fb6a2111..0bf9cfc659 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -110,7 +110,6 @@ long qemu_maxrampagesize(void);
>   *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
>   *              RAM_NORESERVE.
>   *  @mem_path or @fd: specify the backing file or device
> - *  @readonly: true to open @path for reading, false for read/write.
>   *  @errp: pointer to Error*, to store an error if it happens
>   *
>   * Return:
> @@ -119,10 +118,10 @@ long qemu_maxrampagesize(void);
>   */
>  RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
>                                     uint32_t ram_flags, const char *mem_path,
> -                                   bool readonly, Error **errp);
> +                                   Error **errp);
>  RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>                                   uint32_t ram_flags, int fd, off_t offset,
> -                                 bool readonly, Error **errp);
> +                                 Error **errp);
>  
>  RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>                                    MemoryRegion *mr, Error **errp);
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index e05332d07f..2137028773 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -1601,18 +1601,16 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
>                                        uint64_t align,
>                                        uint32_t ram_flags,
>                                        const char *path,
> -                                      bool readonly,
>                                        Error **errp)
>  {
>      Error *err = NULL;
>      memory_region_init(mr, owner, name, size);
>      mr->ram = true;
> -    mr->readonly = readonly;
> +    mr->readonly = ram_flags & RAM_READONLY;
>      mr->terminates = true;
>      mr->destructor = memory_region_destructor_ram;
>      mr->align = align;
> -    mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path,
> -                                             readonly, &err);
> +    mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path, &err);
>      if (err) {
>          mr->size = int128_zero();
>          object_unparent(OBJECT(mr));
> @@ -1635,7 +1633,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
>      mr->terminates = true;
>      mr->destructor = memory_region_destructor_ram;
>      mr->ram_block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, offset,
> -                                           false, &err);
> +                                           &err);
>      if (err) {
>          mr->size = int128_zero();
>          object_unparent(OBJECT(mr));
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index b5be02f1cb..6096eac286 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -1534,7 +1534,6 @@ static int file_ram_open(const char *path,
>  
>  static void *file_ram_alloc(RAMBlock *block,
>                              int fd,
> -                            bool readonly,
>                              bool truncate,
>                              off_t offset,
>                              Error **errp)
> @@ -1589,7 +1588,7 @@ static void *file_ram_alloc(RAMBlock *block,
>          perror("ftruncate");
>      }
>  
> -    qemu_map_flags = readonly ? QEMU_MAP_READONLY : 0;
> +    qemu_map_flags = (block->flags & RAM_READONLY) ? QEMU_MAP_READONLY : 0;
>      qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
>      qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
>      qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
> @@ -2057,7 +2056,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>  #ifdef CONFIG_POSIX
>  RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>                                   uint32_t ram_flags, int fd, off_t offset,
> -                                 bool readonly, Error **errp)
> +                                 Error **errp)
>  {
>      RAMBlock *new_block;
>      Error *local_err = NULL;
> @@ -2065,7 +2064,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>  
>      /* Just support these ram flags by now. */
>      assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
> -                          RAM_PROTECTED)) == 0);
> +                          RAM_PROTECTED | RAM_READONLY)) == 0);
>  
>      if (xen_enabled()) {
>          error_setg(errp, "-mem-path not supported with Xen");
> @@ -2100,8 +2099,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>      new_block->used_length = size;
>      new_block->max_length = size;
>      new_block->flags = ram_flags;
> -    new_block->host = file_ram_alloc(new_block, fd, readonly,
> -                                     !file_size, offset, errp);
> +    new_block->host = file_ram_alloc(new_block, fd, !file_size, offset, errp);
>      if (!new_block->host) {
>          g_free(new_block);
>          return NULL;
> @@ -2120,11 +2118,11 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>  
>  RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
>                                     uint32_t ram_flags, const char *mem_path,
> -                                   bool readonly, Error **errp)
> +                                   Error **errp)
>  {
>      int fd;
> -    bool created;
>      RAMBlock *block;
> +    bool created, readonly = ram_flags & RAM_READONLY;
>  
>      fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created,
>                         errp);
> @@ -2132,7 +2130,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
>          return NULL;
>      }
>  
> -    block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, 0, readonly, errp);
> +    block = qemu_ram_alloc_from_fd(size, mr, ram_flags, fd, 0, errp);
>      if (!block) {
>          if (created) {
>              unlink(mem_path);
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 10/21] ramblock: Add ramblock_file_map()
  2023-01-17 22:09 ` [PATCH RFC 10/21] ramblock: Add ramblock_file_map() Peter Xu
@ 2023-01-24 10:06   ` Dr. David Alan Gilbert
  2023-01-24 20:47     ` Peter Xu
  2023-01-30  5:09   ` Juan Quintela
  1 sibling, 1 reply; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-24 10:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Add a helper to do mmap() for a ramblock based on the cached informations.
> 
> A trivial thing to mention is we need to move ramblock->fd setup to be
> earlier, before the ramblock_file_map() call, because it'll need to
> reference the fd being mapped.  However that should not be a problem at
> all, majorly because the fd won't be freed if successful, and if it failed
> the fd will be freeed (or to be explicit, close()ed) by the caller.
> 
> Export it - prepare to be used outside this file.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/exec/ram_addr.h |  1 +
>  softmmu/physmem.c       | 25 +++++++++++++++++--------
>  2 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index 0bf9cfc659..56db25009a 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -98,6 +98,7 @@ bool ramblock_is_pmem(RAMBlock *rb);
>  
>  long qemu_minrampagesize(void);
>  long qemu_maxrampagesize(void);
> +void *ramblock_file_map(RAMBlock *block);
>  
>  /**
>   * qemu_ram_alloc_from_file,
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 6096eac286..cdda7eaea5 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -1532,17 +1532,31 @@ static int file_ram_open(const char *path,
>      return fd;
>  }
>  
> +/* Do the mmap() for a ramblock based on information already setup */
> +void *ramblock_file_map(RAMBlock *block)
> +{
> +    uint32_t qemu_map_flags;
> +
> +    qemu_map_flags = (block->flags & RAM_READONLY) ? QEMU_MAP_READONLY : 0;
> +    qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
> +    qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
> +    qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
> +
> +    return qemu_ram_mmap(block->fd, block->mmap_length, block->mr->align,
> +                         qemu_map_flags, block->file_offset);
> +}
> +
>  static void *file_ram_alloc(RAMBlock *block,
>                              int fd,
>                              bool truncate,
>                              off_t offset,
>                              Error **errp)
>  {
> -    uint32_t qemu_map_flags;
>      void *area;
>  
>      /* Remember the offset just in case we'll need to map the range again */

Note that this comment is now wrong; you need to always set that for the
map call.

Other than that,


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

>      block->file_offset = offset;
> +    block->fd = fd;
>      block->page_size = qemu_fd_getpagesize(fd);
>      if (block->mr->align % block->page_size) {
>          error_setg(errp, "alignment 0x%" PRIx64
> @@ -1588,19 +1602,14 @@ static void *file_ram_alloc(RAMBlock *block,
>          perror("ftruncate");
>      }
>  
> -    qemu_map_flags = (block->flags & RAM_READONLY) ? QEMU_MAP_READONLY : 0;
> -    qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
> -    qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
> -    qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
> -    area = qemu_ram_mmap(fd, block->mmap_length, block->mr->align,
> -                         qemu_map_flags, offset);
> +    area = ramblock_file_map(block);
> +
>      if (area == MAP_FAILED) {
>          error_setg_errno(errp, errno,
>                           "unable to map backing store for guest RAM");
>          return NULL;
>      }
>  
> -    block->fd = fd;
>      return area;
>  }
>  #endif
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 11/21] migration: Add hugetlb-doublemap cap
  2023-01-17 22:09 ` [PATCH RFC 11/21] migration: Add hugetlb-doublemap cap Peter Xu
@ 2023-01-24 12:45   ` Dr. David Alan Gilbert
  2023-01-24 21:15     ` Peter Xu
  2023-01-30  5:13   ` Juan Quintela
  1 sibling, 1 reply; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-24 12:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Add a new cap to allow mapping hugetlbfs backed RAMs in small page sizes.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

although, I'm curious if the protocol actually changes - or whether
a doublepage enabled destination would work with an unmodified source?
I guess potentially you can get away without the dirty clearing of the
partially sent hugepages that the source normally does.

Dave

> ---
>  migration/migration.c | 48 ++++++++++++++++++++++++++++++++++++++++++-
>  migration/migration.h |  1 +
>  qapi/migration.json   |  7 ++++++-
>  3 files changed, 54 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 64f74534e2..b174f2af92 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -17,6 +17,7 @@
>  #include "qemu/cutils.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
> +#include "qemu/madvise.h"
>  #include "migration/blocker.h"
>  #include "exec.h"
>  #include "fd.h"
> @@ -62,6 +63,7 @@
>  #include "sysemu/cpus.h"
>  #include "yank_functions.h"
>  #include "sysemu/qtest.h"
> +#include "exec/ramblock.h"
>  
>  #define MAX_THROTTLE  (128 << 20)      /* Migration transfer speed throttling */
>  
> @@ -1363,12 +1365,47 @@ static bool migrate_caps_check(bool *cap_list,
>                     "Zero copy only available for non-compressed non-TLS multifd migration");
>          return false;
>      }
> +
> +    if (cap_list[MIGRATION_CAPABILITY_HUGETLB_DOUBLEMAP]) {
> +        RAMBlock *rb;
> +
> +        /* Check whether the platform/binary supports the new madvise()s */
> +
> +#if QEMU_MADV_SPLIT == QEMU_MADV_INVALID
> +        error_setg(errp, "MADV_SPLIT is not supported by the QEMU binary");
> +        return false;
> +#endif
> +
> +#if QEMU_MADV_COLLAPSE == QEMU_MADV_INVALID
> +        error_setg(errp, "MADV_COLLAPSE is not supported by the QEMU binary");
> +        return false;
> +#endif
> +
> +        /*
> +         * Check against kernel support of MADV_SPLIT is not easy, delay
> +         * that until we have all the hugetlb mappings ready on dest node,
> +         * meanwhile do the best effort check here because doublemap
> +         * requires the hugetlb ramblocks to be shared first.
> +         */
> +        RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
> +            if (qemu_ram_is_hugetlb(rb) && !qemu_ram_is_shared(rb)) {
> +                error_setg(errp, "RAMBlock '%s' needs to be shared for doublemap",
> +                           rb->idstr);
> +                return false;
> +            }
> +        }
> +    }
>  #else
>      if (cap_list[MIGRATION_CAPABILITY_ZERO_COPY_SEND]) {
>          error_setg(errp,
>                     "Zero copy currently only available on Linux");
>          return false;
>      }
> +
> +    if (cap_list[MIGRATION_CAPABILITY_HUGETLB_DOUBLEMAP]) {
> +        error_setg(errp, "Hugetlb doublemap is only supported on Linux");
> +        return false;
> +    }
>  #endif
>  
>      if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT]) {
> @@ -2792,6 +2829,13 @@ bool migrate_postcopy_preempt(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT];
>  }
>  
> +bool migrate_hugetlb_doublemap(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_HUGETLB_DOUBLEMAP];
> +}
> +
>  /* migration thread support */
>  /*
>   * Something bad happened to the RP stream, mark an error
> @@ -4472,7 +4516,9 @@ static Property migration_properties[] = {
>      DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
>      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
>      DEFINE_PROP_MIG_CAP("x-background-snapshot",
> -            MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
> +                        MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
> +    DEFINE_PROP_MIG_CAP("hugetlb-doublemap",
> +                        MIGRATION_CAPABILITY_HUGETLB_DOUBLEMAP),
>  #ifdef CONFIG_LINUX
>      DEFINE_PROP_MIG_CAP("x-zero-copy-send",
>              MIGRATION_CAPABILITY_ZERO_COPY_SEND),
> diff --git a/migration/migration.h b/migration/migration.h
> index 5674a13876..bbd610a2d5 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -447,6 +447,7 @@ bool migrate_use_events(void);
>  bool migrate_postcopy_blocktime(void);
>  bool migrate_background_snapshot(void);
>  bool migrate_postcopy_preempt(void);
> +bool migrate_hugetlb_doublemap(void);
>  
>  /* Sending on the return path - generic and then for each message type */
>  void migrate_send_rp_shut(MigrationIncomingState *mis,
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 88ecf86ac8..b23516e75e 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -477,6 +477,11 @@
>  #                    will be handled faster.  This is a performance feature and
>  #                    should not affect the correctness of postcopy migration.
>  #                    (since 7.1)
> +# @hugetlb-doublemap: If enabled, the migration process will allow postcopy
> +#                     to handle page faults based on small pages even if
> +#                     hugetlb is used.  This will drastically reduce page
> +#                     fault latencies when hugetlb is used as the guest RAM
> +#                     backends. (since 7.3)
>  #
>  # Features:
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
> @@ -492,7 +497,7 @@
>             'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>             'validate-uuid', 'background-snapshot',
> -           'zero-copy-send', 'postcopy-preempt'] }
> +           'zero-copy-send', 'postcopy-preempt', 'hugetlb-doublemap'] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 12/21] migration: Introduce page size for-migration-only
  2023-01-17 22:09 ` [PATCH RFC 12/21] migration: Introduce page size for-migration-only Peter Xu
@ 2023-01-24 13:20   ` Dr. David Alan Gilbert
  2023-01-24 21:36     ` Peter Xu
  2023-01-30  5:17   ` Juan Quintela
  1 sibling, 1 reply; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-24 13:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Migration may not want to recognize memory chunks in page size of the host
> only, but sometimes we may want to recognize the memory in smaller chunks
> if e.g. they're doubly mapped as both huge and small.
> 
> In those cases we'll prefer to assume the memory page size is always mapped
> small (qemu_real_host_page_size) and we'll do things just like when the
> pages was only smally mapped.
> 
> Let's do this to be prepared of postcopy double-mapping for hugetlbfs.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c    |  6 ++++--
>  migration/postcopy-ram.c | 16 +++++++++-------
>  migration/ram.c          | 29 ++++++++++++++++++++++-------
>  migration/ram.h          |  1 +
>  4 files changed, 36 insertions(+), 16 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index b174f2af92..f6fe474fc3 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -408,7 +408,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  {
>      uint8_t bufc[12 + 1 + 255]; /* start (8), len (4), rbname up to 256 */
>      size_t msglen = 12; /* start + len */
> -    size_t len = qemu_ram_pagesize(rb);
> +    size_t len = migration_ram_pagesize(rb);
>      enum mig_rp_message_type msg_type;
>      const char *rbname;
>      int rbname_len;
> @@ -443,8 +443,10 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  int migrate_send_rp_req_pages(MigrationIncomingState *mis,
>                                RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>  {
> -    void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
>      bool received = false;
> +    void *aligned;
> +
> +    aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, migration_ram_pagesize(rb));
>  
>      WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
>          received = ramblock_recv_bitmap_test_byte_offset(rb, start);
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 2c86bfc091..acae1dc6ae 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -694,7 +694,7 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
>                           uint64_t client_addr,
>                           RAMBlock *rb)
>  {
> -    size_t pagesize = qemu_ram_pagesize(rb);
> +    size_t pagesize = migration_ram_pagesize(rb);
>      struct uffdio_range range;
>      int ret;
>      trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
> @@ -712,7 +712,9 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
>  static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
>                                   ram_addr_t start, uint64_t haddr)
>  {
> -    void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
> +    void *aligned;
> +
> +    aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, migration_ram_pagesize(rb));
>  
>      /*
>       * Discarded pages (via RamDiscardManager) are never migrated. On unlikely
> @@ -722,7 +724,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
>       * Checking a single bit is sufficient to handle pagesize > TPS as either
>       * all relevant bits are set or not.
>       */
> -    assert(QEMU_IS_ALIGNED(start, qemu_ram_pagesize(rb)));
> +    assert(QEMU_IS_ALIGNED(start, migration_ram_pagesize(rb)));
>      if (ramblock_page_is_discarded(rb, start)) {
>          bool received = ramblock_recv_bitmap_test_byte_offset(rb, start);
>  
> @@ -740,7 +742,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
>  int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
>                                   uint64_t client_addr, uint64_t rb_offset)
>  {
> -    uint64_t aligned_rbo = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
> +    uint64_t aligned_rbo = ROUND_DOWN(rb_offset, migration_ram_pagesize(rb));
>      MigrationIncomingState *mis = migration_incoming_get_current();
>  
>      trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
> @@ -1020,7 +1022,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
>                  break;
>              }
>  
> -            rb_offset = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
> +            rb_offset = ROUND_DOWN(rb_offset, migration_ram_pagesize(rb));
>              trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
>                                                  qemu_ram_get_idstr(rb),
>                                                  rb_offset,
> @@ -1281,7 +1283,7 @@ int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
>  int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>                          RAMBlock *rb)
>  {
> -    size_t pagesize = qemu_ram_pagesize(rb);
> +    size_t pagesize = migration_ram_pagesize(rb);
>  
>      /* copy also acks to the kernel waking the stalled thread up
>       * TODO: We can inhibit that ack and only do it if it was requested
> @@ -1308,7 +1310,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>  int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
>                               RAMBlock *rb)
>  {
> -    size_t pagesize = qemu_ram_pagesize(rb);
> +    size_t pagesize = migration_ram_pagesize(rb);
>      trace_postcopy_place_page_zero(host);
>  
>      /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
> diff --git a/migration/ram.c b/migration/ram.c
> index 334309f1c6..945c6477fd 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -121,6 +121,20 @@ static struct {
>      uint8_t *decoded_buf;
>  } XBZRLE;
>  
> +/* Get the page size we should use for migration purpose. */
> +size_t migration_ram_pagesize(RAMBlock *block)
> +{
> +    /*
> +     * When hugetlb doublemap is enabled, we should always use the smallest
> +     * page for migration.
> +     */
> +    if (migrate_hugetlb_doublemap()) {
> +        return qemu_real_host_page_size();
> +    }
> +
> +    return qemu_ram_pagesize(block);
> +}
> +
>  static void XBZRLE_cache_lock(void)
>  {
>      if (migrate_use_xbzrle()) {
> @@ -1049,7 +1063,7 @@ bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start)
>          MemoryRegionSection section = {
>              .mr = rb->mr,
>              .offset_within_region = start,
> -            .size = int128_make64(qemu_ram_pagesize(rb)),
> +            .size = int128_make64(migration_ram_pagesize(rb)),
>          };
>  
>          return !ram_discard_manager_is_populated(rdm, &section);
> @@ -2152,7 +2166,7 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
>       */
>      if (postcopy_preempt_active()) {
>          ram_addr_t page_start = start >> TARGET_PAGE_BITS;
> -        size_t page_size = qemu_ram_pagesize(ramblock);
> +        size_t page_size = migration_ram_pagesize(ramblock);
>          PageSearchStatus *pss = &ram_state->pss[RAM_CHANNEL_POSTCOPY];
>          int ret = 0;
>  
> @@ -2316,7 +2330,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>  static void pss_host_page_prepare(PageSearchStatus *pss)
>  {
>      /* How many guest pages are there in one host page? */
> -    size_t guest_pfns = qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
> +    size_t guest_pfns = migration_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
>  
>      pss->host_page_sending = true;
>      pss->host_page_start = ROUND_DOWN(pss->page, guest_pfns);
> @@ -2425,7 +2439,7 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
>      bool page_dirty, preempt_active = postcopy_preempt_active();
>      int tmppages, pages = 0;
>      size_t pagesize_bits =
> -        qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
> +        migration_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
>      unsigned long start_page = pss->page;
>      int res;
>  
> @@ -3518,7 +3532,7 @@ static void *host_page_from_ram_block_offset(RAMBlock *block,
>  {
>      /* Note: Explicitly no check against offset_in_ramblock(). */
>      return (void *)QEMU_ALIGN_DOWN((uintptr_t)(block->host + offset),
> -                                   block->page_size);
> +                                   migration_ram_pagesize(block));
>  }
>  
>  static ram_addr_t host_page_offset_from_ram_block_offset(RAMBlock *block,
> @@ -3970,7 +3984,8 @@ int ram_load_postcopy(QEMUFile *f, int channel)
>                  break;
>              }
>              tmp_page->target_pages++;
> -            matches_target_page_size = block->page_size == TARGET_PAGE_SIZE;
> +            matches_target_page_size =
> +                migration_ram_pagesize(block) == TARGET_PAGE_SIZE;
>              /*
>               * Postcopy requires that we place whole host pages atomically;
>               * these may be huge pages for RAMBlocks that are backed by

Hmm do you really want this change?

Dave

> @@ -4005,7 +4020,7 @@ int ram_load_postcopy(QEMUFile *f, int channel)
>               * page
>               */
>              if (tmp_page->target_pages ==
> -                (block->page_size / TARGET_PAGE_SIZE)) {
> +                (migration_ram_pagesize(block) / TARGET_PAGE_SIZE)) {
>                  place_needed = true;
>              }
>              place_source = tmp_page->tmp_huge_page;
> diff --git a/migration/ram.h b/migration/ram.h
> index 81cbb0947c..162b3e7cb8 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -68,6 +68,7 @@ bool ramblock_is_ignored(RAMBlock *block);
>          if (!qemu_ram_is_migratable(block)) {} else
>  
>  int xbzrle_cache_resize(uint64_t new_size, Error **errp);
> +size_t migration_ram_pagesize(RAMBlock *block);
>  uint64_t ram_bytes_remaining(void);
>  uint64_t ram_bytes_total(void);
>  void mig_throttle_counter_reset(void);
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 13/21] migration: Add migration_ram_pagesize_largest()
  2023-01-17 22:09 ` [PATCH RFC 13/21] migration: Add migration_ram_pagesize_largest() Peter Xu
@ 2023-01-24 17:34   ` Dr. David Alan Gilbert
  2023-01-30  5:19   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-24 17:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Let it replace the old qemu_ram_pagesize_largest() just to fetch the page
> sizes using migration_ram_pagesize(), because it'll start to consider
> double mapping effect in migrations.
> 
> Also don't account the ignored ramblocks as they won't be migrated.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/exec/cpu-common.h |  1 -
>  migration/migration.c     |  2 +-
>  migration/ram.c           | 12 ++++++++++++
>  migration/ram.h           |  1 +
>  softmmu/physmem.c         | 13 -------------
>  5 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 94452aa17f..4c394ccdfc 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -96,7 +96,6 @@ int qemu_ram_get_fd(RAMBlock *rb);
>  
>  size_t qemu_ram_pagesize(RAMBlock *block);
>  bool qemu_ram_is_hugetlb(RAMBlock *rb);
> -size_t qemu_ram_pagesize_largest(void);
>  
>  /**
>   * cpu_address_space_init:
> diff --git a/migration/migration.c b/migration/migration.c
> index f6fe474fc3..7724e00c47 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -604,7 +604,7 @@ process_incoming_migration_co(void *opaque)
>  
>      assert(mis->from_src_file);
>      mis->migration_incoming_co = qemu_coroutine_self();
> -    mis->largest_page_size = qemu_ram_pagesize_largest();
> +    mis->largest_page_size = migration_ram_pagesize_largest();
>      postcopy_state_set(POSTCOPY_INCOMING_NONE);
>      migrate_set_state(&mis->state, MIGRATION_STATUS_NONE,
>                        MIGRATION_STATUS_ACTIVE);
> diff --git a/migration/ram.c b/migration/ram.c
> index 945c6477fd..2ebf414f5f 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -135,6 +135,18 @@ size_t migration_ram_pagesize(RAMBlock *block)
>      return qemu_ram_pagesize(block);
>  }
>  
> +size_t migration_ram_pagesize_largest(void)
> +{
> +    RAMBlock *block;
> +    size_t largest = 0;
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(block) {
> +        largest = MAX(largest, migration_ram_pagesize(block));
> +    }
> +
> +    return largest;
> +}
> +
>  static void XBZRLE_cache_lock(void)
>  {
>      if (migrate_use_xbzrle()) {
> diff --git a/migration/ram.h b/migration/ram.h
> index 162b3e7cb8..cefe166841 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -69,6 +69,7 @@ bool ramblock_is_ignored(RAMBlock *block);
>  
>  int xbzrle_cache_resize(uint64_t new_size, Error **errp);
>  size_t migration_ram_pagesize(RAMBlock *block);
> +size_t migration_ram_pagesize_largest(void);
>  uint64_t ram_bytes_remaining(void);
>  uint64_t ram_bytes_total(void);
>  void mig_throttle_counter_reset(void);
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index cdda7eaea5..536c204811 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -1813,19 +1813,6 @@ bool qemu_ram_is_hugetlb(RAMBlock *rb)
>      return rb->page_size > qemu_real_host_page_size();
>  }
>  
> -/* Returns the largest size of page in use */
> -size_t qemu_ram_pagesize_largest(void)
> -{
> -    RAMBlock *block;
> -    size_t largest = 0;
> -
> -    RAMBLOCK_FOREACH(block) {
> -        largest = MAX(largest, qemu_ram_pagesize(block));
> -    }
> -
> -    return largest;
> -}
> -
>  static int memory_try_enable_merging(void *addr, size_t len)
>  {
>      if (!machine_mem_merge(current_machine)) {
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks
  2023-01-23 18:51   ` Dr. David Alan Gilbert
@ 2023-01-24 20:28     ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-24 20:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

On Mon, Jan 23, 2023 at 06:51:51PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > We do proper page size alignment for file backed mmap()s for ramblocks.
> > Even if it's as simple as that, cache the value because it'll be used in
> > multiple places.
> > 
> > Since at it, drop size for file_ram_alloc() and just use max_length because
> > that's always true for file-backed ramblocks.
> 
> Having a length previously called 'memory' was a bit odd!

:-D

> 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 10/21] ramblock: Add ramblock_file_map()
  2023-01-24 10:06   ` Dr. David Alan Gilbert
@ 2023-01-24 20:47     ` Peter Xu
  2023-01-25  9:24       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-24 20:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

On Tue, Jan 24, 2023 at 10:06:48AM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Add a helper to do mmap() for a ramblock based on the cached informations.
> > 
> > A trivial thing to mention is we need to move ramblock->fd setup to be
> > earlier, before the ramblock_file_map() call, because it'll need to
> > reference the fd being mapped.  However that should not be a problem at
> > all, majorly because the fd won't be freed if successful, and if it failed
> > the fd will be freeed (or to be explicit, close()ed) by the caller.
> > 
> > Export it - prepare to be used outside this file.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/exec/ram_addr.h |  1 +
> >  softmmu/physmem.c       | 25 +++++++++++++++++--------
> >  2 files changed, 18 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> > index 0bf9cfc659..56db25009a 100644
> > --- a/include/exec/ram_addr.h
> > +++ b/include/exec/ram_addr.h
> > @@ -98,6 +98,7 @@ bool ramblock_is_pmem(RAMBlock *rb);
> >  
> >  long qemu_minrampagesize(void);
> >  long qemu_maxrampagesize(void);
> > +void *ramblock_file_map(RAMBlock *block);
> >  
> >  /**
> >   * qemu_ram_alloc_from_file,
> > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > index 6096eac286..cdda7eaea5 100644
> > --- a/softmmu/physmem.c
> > +++ b/softmmu/physmem.c
> > @@ -1532,17 +1532,31 @@ static int file_ram_open(const char *path,
> >      return fd;
> >  }
> >  
> > +/* Do the mmap() for a ramblock based on information already setup */
> > +void *ramblock_file_map(RAMBlock *block)
> > +{
> > +    uint32_t qemu_map_flags;
> > +
> > +    qemu_map_flags = (block->flags & RAM_READONLY) ? QEMU_MAP_READONLY : 0;
> > +    qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
> > +    qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
> > +    qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
> > +
> > +    return qemu_ram_mmap(block->fd, block->mmap_length, block->mr->align,
> > +                         qemu_map_flags, block->file_offset);
> > +}
> > +
> >  static void *file_ram_alloc(RAMBlock *block,
> >                              int fd,
> >                              bool truncate,
> >                              off_t offset,
> >                              Error **errp)
> >  {
> > -    uint32_t qemu_map_flags;
> >      void *area;
> >  
> >      /* Remember the offset just in case we'll need to map the range again */
> 
> Note that this comment is now wrong; you need to always set that for the
> map call.

This line is added in patch 7.  After this patch, a ramblock should always
be mapped with ramblock_file_map(), so it keeps being true?

> 
> Other than that,
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 11/21] migration: Add hugetlb-doublemap cap
  2023-01-24 12:45   ` Dr. David Alan Gilbert
@ 2023-01-24 21:15     ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-24 21:15 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

On Tue, Jan 24, 2023 at 12:45:38PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Add a new cap to allow mapping hugetlbfs backed RAMs in small page sizes.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks.

> 
> although, I'm curious if the protocol actually changes

Yes it does.

It differs not in the form of a changed header or any frame definitions,
but in the format of how huge pages are sent.  The old binary can only send
a huge page by sending all the small pages sequentially starting from index
0 to index N_HUGE-1; while the new binary can send the huge page out of
order.  For the latter it's the same as when huge page is not used.

> or whether a doublepage enabled destination would work with an unmodified
> source?

This is an interesting question.

I would expect old -> new work as usual, because the page frames are not
modified so the dest node will just see pages being migrated in a
sequential manner.  The latency of page request will be the same as old
binary though because even if dest host can handle small pages it won't be
able to get asap on the pages it wants - src host decides which page to
send.

Meanwhile new -> old shouldn't work I think as described above, because the
dest host should see weird things happening, e.g., a huge page was sent not
starting fron index 0 but index X (0<X<N_HUGE-1).  It should quickly bail
out assuming there's something wrong.

> I guess potentially you can get away without the dirty clearing
> of the partially sent hugepages that the source normally does.

Good point. It's actually more relevant to the other patch later on
reworking the discard logic.  I kept it as-is for majorly two reasons:

 1) It is still not 100% confirmed on how MADV_DONTNEED should behave on
    HGM enabled memory ranges where huge pages used to be mapped.  It's
    part of the discussion upstream on the kernel patchset.  I think it's
    settling, but in the current series I kept it in a form so it'll work
    in all cases.

 2) Not dirtying the partially sent huge pages can always reduce small
    pages being migrated, but it can also change the content of discard
    messages due to the frame format of MIG_CMD_POSTCOPY_RAM_DISCARD, in
    that we can have a lot more scattered ranges, so a lot more messaging
    can be needed.  While when with the existing logic, since we'll always
    re-dirty the partial sent pages, the ranges are more likely to be
    efficient.
    
        * CMD_POSTCOPY_RAM_DISCARD consist of:
        *      byte   version (0)
        *      byte   Length of name field (not including 0)
        *  n x byte   RAM block name
        *      byte   0 terminator (just for safety)
        *  n x        Byte ranges within the named RAMBlock
        *      be64   Start of the range
        *      be64   Length

I think 1) may not hold as the kernel series evolves, so it may not be true
anymore.  2) may still be true, but I think worth some testing (especially
on 1G pages) to see how it could interfere the discard procedure.  Maybe it
won't be as bad as I think.  Even if it could, we can evaluate the tradeoff
between "slower discard sync" and "less page need to send".  E.g., we can
consider changing the frame layout by boosting postcopy_ram_discard_version.

I'll take a note on this one and provide more update in the next version.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 12/21] migration: Introduce page size for-migration-only
  2023-01-24 13:20   ` Dr. David Alan Gilbert
@ 2023-01-24 21:36     ` Peter Xu
  2023-01-24 22:03       ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-24 21:36 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

On Tue, Jan 24, 2023 at 01:20:37PM +0000, Dr. David Alan Gilbert wrote:
> > @@ -3970,7 +3984,8 @@ int ram_load_postcopy(QEMUFile *f, int channel)
> >                  break;
> >              }
> >              tmp_page->target_pages++;
> > -            matches_target_page_size = block->page_size == TARGET_PAGE_SIZE;
> > +            matches_target_page_size =
> > +                migration_ram_pagesize(block) == TARGET_PAGE_SIZE;
> >              /*
> >               * Postcopy requires that we place whole host pages atomically;
> >               * these may be huge pages for RAMBlocks that are backed by
> 
> Hmm do you really want this change?

Yes that's intended.  I want to reuse the same logic here when receiving
small pages from huge pages, just like when we're receiving small pages on
non-hugetlb mappings.

matches_target_page_size majorly affects two things:

  1) For a small zero page, whether we want to pre-set the page_buffer, or
     simply use postcopy_place_page_zero():
  
        case RAM_SAVE_FLAG_ZERO:
            ch = qemu_get_byte(f);
            /*
             * Can skip to set page_buffer when
             * this is a zero page and (block->page_size == TARGET_PAGE_SIZE).
             */
            if (ch || !matches_target_page_size) {
                memset(page_buffer, ch, TARGET_PAGE_SIZE);
            }

  2) For normal page, whether we need to use a page buffer or we can
     directly reuse the page buffer in QEMUFile:

            if (!matches_target_page_size) {
                /* For huge pages, we always use temporary buffer */
                qemu_get_buffer(f, page_buffer, TARGET_PAGE_SIZE);
            } else {
                /*
                 * For small pages that matches target page size, we
                 * avoid the qemu_file copy.  Instead we directly use
                 * the buffer of QEMUFile to place the page.  Note: we
                 * cannot do any QEMUFile operation before using that
                 * buffer to make sure the buffer is valid when
                 * placing the page.
                 */
                qemu_get_buffer_in_place(f, (uint8_t **)&place_source,
                                         TARGET_PAGE_SIZE);
            }

Here:

I want 1) to reuse postcopy_place_page_zero().  For the doublemap case,
it'll reuse postcopy_tmp_zero_page() (because qemu_ram_is_uf_zeroable()
will return false for such a ramblock).

I want 2) to reuse qemu_get_buffer_in_place(), so we avoid a copy process
for the small page which is faster (even if it's hugetlb backed, now we can
reuse the qemufile buffer safely).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 12/21] migration: Introduce page size for-migration-only
  2023-01-24 21:36     ` Peter Xu
@ 2023-01-24 22:03       ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-24 22:03 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

On Tue, Jan 24, 2023 at 04:36:20PM -0500, Peter Xu wrote:
> On Tue, Jan 24, 2023 at 01:20:37PM +0000, Dr. David Alan Gilbert wrote:
> > > @@ -3970,7 +3984,8 @@ int ram_load_postcopy(QEMUFile *f, int channel)
> > >                  break;
> > >              }
> > >              tmp_page->target_pages++;
> > > -            matches_target_page_size = block->page_size == TARGET_PAGE_SIZE;
> > > +            matches_target_page_size =
> > > +                migration_ram_pagesize(block) == TARGET_PAGE_SIZE;
> > >              /*
> > >               * Postcopy requires that we place whole host pages atomically;
> > >               * these may be huge pages for RAMBlocks that are backed by
> > 
> > Hmm do you really want this change?
> 
> Yes that's intended.  I want to reuse the same logic here when receiving
> small pages from huge pages, just like when we're receiving small pages on
> non-hugetlb mappings.
> 
> matches_target_page_size majorly affects two things:
> 
>   1) For a small zero page, whether we want to pre-set the page_buffer, or
>      simply use postcopy_place_page_zero():
>   
>         case RAM_SAVE_FLAG_ZERO:
>             ch = qemu_get_byte(f);
>             /*
>              * Can skip to set page_buffer when
>              * this is a zero page and (block->page_size == TARGET_PAGE_SIZE).
>              */
>             if (ch || !matches_target_page_size) {
>                 memset(page_buffer, ch, TARGET_PAGE_SIZE);
>             }
> 
>   2) For normal page, whether we need to use a page buffer or we can
>      directly reuse the page buffer in QEMUFile:
> 
>             if (!matches_target_page_size) {
>                 /* For huge pages, we always use temporary buffer */
>                 qemu_get_buffer(f, page_buffer, TARGET_PAGE_SIZE);
>             } else {
>                 /*
>                  * For small pages that matches target page size, we
>                  * avoid the qemu_file copy.  Instead we directly use
>                  * the buffer of QEMUFile to place the page.  Note: we
>                  * cannot do any QEMUFile operation before using that
>                  * buffer to make sure the buffer is valid when
>                  * placing the page.
>                  */
>                 qemu_get_buffer_in_place(f, (uint8_t **)&place_source,
>                                          TARGET_PAGE_SIZE);
>             }
> 
> Here:
> 
> I want 1) to reuse postcopy_place_page_zero().  For the doublemap case,
> it'll reuse postcopy_tmp_zero_page() (because qemu_ram_is_uf_zeroable()
> will return false for such a ramblock).
> 
> I want 2) to reuse qemu_get_buffer_in_place(), so we avoid a copy process
> for the small page which is faster (even if it's hugetlb backed, now we can
> reuse the qemufile buffer safely).

Since at it, one more thing worth mentioning is I didn't actually know
whether the original code is always correct when target and host small
psizes don't match..  This is the original line:

  matches_target_page_size = block->page_size == TARGET_PAGE_SIZE;

The problem is we're comparing block page size against target page size,
however block page size should be in host page size granule:

  RAMBlock *qemu_ram_alloc_internal()
  {
    new_block->page_size = qemu_real_host_page_size();

IOW, I am not sure whether postcopy will run at all in that case.  For
example, when we run an Alpha emulator upon x86_64, we can have target
psize 8K while host psize 4K.

The migration protocol should be TARGET_PAGE_SIZE based.  It means, for
postcopy when receiving a single page for Alpha VM being migrated, maybe we
should call UFFDIO_COPY (or UFFDIO_CONTINUE; doesn't matter here) twice
because one guest page contains two host pages.

I'm not sure whether I get all these right.. if so, we have two options:

  a) Forbid postcopy as a whole when detecting qemu_real_host_page_size()
     != TARGET_PAGE_SIZE.

  b) Implement postcopy for that case

I'd go with a) even if it's an issue because it means no one is migrating
that thing in postcopy way in the past N years, so it justifies that maybe
b) doesn't worth it.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c
  2023-01-17 22:08 ` [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c Peter Xu
  2023-01-18 12:00   ` Dr. David Alan Gilbert
@ 2023-01-25  0:19   ` Philippe Mathieu-Daudé
  2023-01-30  4:57   ` Juan Quintela
  2 siblings, 0 replies; 69+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-01-25  0:19 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Leonardo Bras Soares Passos, James Houghton, Juan Quintela,
	Dr . David Alan Gilbert

On 17/1/23 23:08, Peter Xu wrote:
> Without it, we never have CONFIG_LINUX defined even if on linux, so
> linux/mman.h is never really included.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   util/mmap-alloc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 10/21] ramblock: Add ramblock_file_map()
  2023-01-24 20:47     ` Peter Xu
@ 2023-01-25  9:24       ` Dr. David Alan Gilbert
  2023-01-25 14:46         ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-25  9:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Jan 24, 2023 at 10:06:48AM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Add a helper to do mmap() for a ramblock based on the cached informations.
> > > 
> > > A trivial thing to mention is we need to move ramblock->fd setup to be
> > > earlier, before the ramblock_file_map() call, because it'll need to
> > > reference the fd being mapped.  However that should not be a problem at
> > > all, majorly because the fd won't be freed if successful, and if it failed
> > > the fd will be freeed (or to be explicit, close()ed) by the caller.
> > > 
> > > Export it - prepare to be used outside this file.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  include/exec/ram_addr.h |  1 +
> > >  softmmu/physmem.c       | 25 +++++++++++++++++--------
> > >  2 files changed, 18 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> > > index 0bf9cfc659..56db25009a 100644
> > > --- a/include/exec/ram_addr.h
> > > +++ b/include/exec/ram_addr.h
> > > @@ -98,6 +98,7 @@ bool ramblock_is_pmem(RAMBlock *rb);
> > >  
> > >  long qemu_minrampagesize(void);
> > >  long qemu_maxrampagesize(void);
> > > +void *ramblock_file_map(RAMBlock *block);
> > >  
> > >  /**
> > >   * qemu_ram_alloc_from_file,
> > > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > > index 6096eac286..cdda7eaea5 100644
> > > --- a/softmmu/physmem.c
> > > +++ b/softmmu/physmem.c
> > > @@ -1532,17 +1532,31 @@ static int file_ram_open(const char *path,
> > >      return fd;
> > >  }
> > >  
> > > +/* Do the mmap() for a ramblock based on information already setup */
> > > +void *ramblock_file_map(RAMBlock *block)
> > > +{
> > > +    uint32_t qemu_map_flags;
> > > +
> > > +    qemu_map_flags = (block->flags & RAM_READONLY) ? QEMU_MAP_READONLY : 0;
> > > +    qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
> > > +    qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
> > > +    qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
> > > +
> > > +    return qemu_ram_mmap(block->fd, block->mmap_length, block->mr->align,
> > > +                         qemu_map_flags, block->file_offset);
> > > +}
> > > +
> > >  static void *file_ram_alloc(RAMBlock *block,
> > >                              int fd,
> > >                              bool truncate,
> > >                              off_t offset,
> > >                              Error **errp)
> > >  {
> > > -    uint32_t qemu_map_flags;
> > >      void *area;
> > >  
> > >      /* Remember the offset just in case we'll need to map the range again */
> > 
> > Note that this comment is now wrong; you need to always set that for the
> > map call.
> 
> This line is added in patch 7.  After this patch, a ramblock should always
> be mapped with ramblock_file_map(), so it keeps being true?

With ramblock_file_map() it's not a 'just in case' any more though is
it?  This value always goes through the block-> now?

Dave

> > 
> > Other than that,
> > 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate
  2023-01-17 22:09 ` [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate Peter Xu
@ 2023-01-25 14:25   ` Dr. David Alan Gilbert
  2023-01-30  5:24   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Dr. David Alan Gilbert @ 2023-01-25 14:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Add a RAMBlock.host_mirror for all the hugetlbfs backed guest memories.
> It'll be used to remap the same region twice and it'll be used to service
> page faults using UFFDIO_CONTINUE.
> 
> To make sure all accesses to these ranges will generate minor page faults
> not missing page faults, we need to pre-allocate the files to make sure
> page cache exist start from the beginning.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/exec/ramblock.h |  7 +++++
>  migration/ram.c         | 59 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 66 insertions(+)
> 
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 3f31ce1591..c76683c3c8 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -28,6 +28,13 @@ struct RAMBlock {
>      struct rcu_head rcu;
>      struct MemoryRegion *mr;
>      uint8_t *host;
> +    /*
> +     * This is only used for hugetlbfs ramblocks where doublemap is
> +     * enabled.  The pointer is managed by dest host migration code, and
> +     * should be NULL when migration is finished.  On src host, it should
> +     * always be NULL.
> +     */
> +    uint8_t *host_mirror;
>      uint8_t *colo_cache; /* For colo, VM's ram cache */
>      ram_addr_t offset;
>      ram_addr_t used_length;
> diff --git a/migration/ram.c b/migration/ram.c
> index 2ebf414f5f..37d7b3553a 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3879,6 +3879,57 @@ void colo_release_ram_cache(void)
>      ram_state_cleanup(&ram_state);
>  }
>  
> +static int migrate_hugetlb_doublemap_init(void)
> +{
> +    RAMBlock *rb;
> +    void *addr;
> +    int ret;
> +
> +    if (!migrate_hugetlb_doublemap()) {
> +        return 0;
> +    }
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
> +        if (qemu_ram_is_hugetlb(rb)) {
> +            /*
> +             * Firstly, we remap the same ramblock into another range of
> +             * virtual address, so that we can write to the pages without
> +             * touching the page tables that directly mapped for the guest.
> +             */
> +            addr = ramblock_file_map(rb);
> +            if (addr == MAP_FAILED) {
> +                ret = -errno;
> +                error_report("%s: Duplicate mapping for hugetlb ramblock '%s'"
> +                             "failed: %s", __func__, qemu_ram_get_idstr(rb),
> +                             strerror(errno));
> +                return ret;
> +            }
> +            rb->host_mirror = addr;
> +
> +            /*
> +             * We need to make sure we pre-allocate the range with
> +             * hugetlbfs pages before hand, so that all the page fault will
> +             * be trapped as MINOR faults always, rather than MISSING
> +             * faults in userfaultfd.
> +             */
> +            ret = qemu_madvise(addr, rb->mmap_length, QEMU_MADV_POPULATE_WRITE);
> +            if (ret) {
> +                error_report("Failed to populate hugetlb ramblock '%s': "
> +                             "%s", qemu_ram_get_idstr(rb), strerror(-ret));
> +                return ret;
> +            }
> +        }
> +    }
> +
> +    /*
> +     * When reach here, it means we've setup the mirror mapping for all the
> +     * hugetlbfs pages.  Hence when page fault happens, we'll be able to
> +     * resolve page faults using UFFDIO_CONTINUE for hugetlbfs pages, but
> +     * we'll keep using UFFDIO_COPY for anonymous pages.
> +     */
> +    return 0;
> +}
> +
>  /**
>   * ram_load_setup: Setup RAM for migration incoming side
>   *
> @@ -3893,6 +3944,10 @@ static int ram_load_setup(QEMUFile *f, void *opaque)
>          return -1;
>      }
>  
> +    if (migrate_hugetlb_doublemap_init()) {
> +        return -1;
> +    }
> +
>      xbzrle_load_setup();
>      ramblock_recv_map_init();
>  
> @@ -3913,6 +3968,10 @@ static int ram_load_cleanup(void *opaque)
>      RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
>          g_free(rb->receivedmap);
>          rb->receivedmap = NULL;
> +        if (rb->host_mirror) {
> +            munmap(rb->host_mirror, rb->mmap_length);
> +            rb->host_mirror = NULL;
> +        }
>      }
>  
>      return 0;
> -- 
> 2.37.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 10/21] ramblock: Add ramblock_file_map()
  2023-01-25  9:24       ` Dr. David Alan Gilbert
@ 2023-01-25 14:46         ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-25 14:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton, Juan Quintela

On Wed, Jan 25, 2023 at 09:24:24AM +0000, Dr. David Alan Gilbert wrote:
> > > >  static void *file_ram_alloc(RAMBlock *block,
> > > >                              int fd,
> > > >                              bool truncate,
> > > >                              off_t offset,
> > > >                              Error **errp)
> > > >  {
> > > > -    uint32_t qemu_map_flags;
> > > >      void *area;
> > > >  
> > > >      /* Remember the offset just in case we'll need to map the range again */
> > > 
> > > Note that this comment is now wrong; you need to always set that for the
> > > map call.
> > 
> > This line is added in patch 7.  After this patch, a ramblock should always
> > be mapped with ramblock_file_map(), so it keeps being true?
> 
> With ramblock_file_map() it's not a 'just in case' any more though is
> it?  This value always goes through the block-> now?

Ah yes.. Since the comment is not extremely informative, instead of
changing it, I can drop it in the previous patch when introduced.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c
  2023-01-17 22:08 ` [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c Peter Xu
  2023-01-18 12:00   ` Dr. David Alan Gilbert
  2023-01-25  0:19   ` Philippe Mathieu-Daudé
@ 2023-01-30  4:57   ` Juan Quintela
  2 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  4:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert, Markus Armbruster

Peter Xu <peterx@redhat.com> wrote:
> Without it, we never have CONFIG_LINUX defined even if on linux, so
> linux/mman.h is never really included.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

Markus is working on this right now, Markus?



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 03/21] physmem: Add qemu_ram_is_hugetlb()
  2023-01-17 22:08 ` [PATCH RFC 03/21] physmem: Add qemu_ram_is_hugetlb() Peter Xu
  2023-01-18 12:02   ` Dr. David Alan Gilbert
@ 2023-01-30  5:00   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Returns true for a hugetlbfs mapping, false otherwise.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 04/21] madvise: Include linux/mman.h under linux-headers/
  2023-01-17 22:08 ` [PATCH RFC 04/21] madvise: Include linux/mman.h under linux-headers/ Peter Xu
  2023-01-18 12:08   ` Dr. David Alan Gilbert
@ 2023-01-30  5:01   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> This will allow qemu/madvise.h to always include linux/mman.h under the
> linux-headers/.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 05/21] madvise: Add QEMU_MADV_SPLIT
  2023-01-17 22:08 ` [PATCH RFC 05/21] madvise: Add QEMU_MADV_SPLIT Peter Xu
@ 2023-01-30  5:01   ` Juan Quintela
  0 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> MADV_SPLIT is a new madvise() on Linux.  Define QEMU_MADV_SPLIT.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

You can maintain the reviewed-by even if you collapsed with next one as
David suggests.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 06/21] madvise: Add QEMU_MADV_COLLAPSE
  2023-01-17 22:08 ` [PATCH RFC 06/21] madvise: Add QEMU_MADV_COLLAPSE Peter Xu
  2023-01-18 18:51   ` Dr. David Alan Gilbert
@ 2023-01-30  5:02   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> MADV_COLLAPSE is a new madvise() on Linux.  Define it.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 07/21] ramblock: Cache file offset for file-backed ramblocks
  2023-01-17 22:09 ` [PATCH RFC 07/21] ramblock: Cache file offset for file-backed ramblocks Peter Xu
@ 2023-01-30  5:02   ` Juan Quintela
  0 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> This value was only used for mmap() when we want to map at a specific
> offset of the file for memory.  To be prepared that we might do another map
> upon the same range for whatever reason, cache the offset so we know how to
> map again on the same range.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

A bit weird that we don't use it (yet) anywhere, but that is life.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks
  2023-01-17 22:09 ` [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks Peter Xu
  2023-01-23 18:51   ` Dr. David Alan Gilbert
@ 2023-01-30  5:05   ` Juan Quintela
  2023-01-30 22:07     ` Peter Xu
  1 sibling, 1 reply; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> We do proper page size alignment for file backed mmap()s for ramblocks.
> Even if it's as simple as that, cache the value because it'll be used in
> multiple places.
>
> Since at it, drop size for file_ram_alloc() and just use max_length because
> that's always true for file-backed ramblocks.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>


> @@ -2100,7 +2100,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>      new_block->used_length = size;
>      new_block->max_length = size;
>      new_block->flags = ram_flags;
> -    new_block->host = file_ram_alloc(new_block, size, fd, readonly,
> +    new_block->host = file_ram_alloc(new_block, fd, readonly,
>                                       !file_size, offset, errp);
>      if (!new_block->host) {
>          g_free(new_block);

Passing "size" in three places, not bad at all O:-)



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 09/21] ramblock: Add RAM_READONLY
  2023-01-17 22:09 ` [PATCH RFC 09/21] ramblock: Add RAM_READONLY Peter Xu
  2023-01-23 19:42   ` Dr. David Alan Gilbert
@ 2023-01-30  5:06   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> This allows us to have RAM_READONLY to be set in ram_flags to show that
> this ramblock can only be read not write.
>
> We used to pass in readonly boolean along the way for allocating the
> ramblock, now let it be together with the rest ramblock flags.
>
> The main purpose of this patch is not for clean up though, it's for caching
> mapping information of each ramblock so when we want to mmap() it again for
> whatever reason we can have all the information on hand.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 10/21] ramblock: Add ramblock_file_map()
  2023-01-17 22:09 ` [PATCH RFC 10/21] ramblock: Add ramblock_file_map() Peter Xu
  2023-01-24 10:06   ` Dr. David Alan Gilbert
@ 2023-01-30  5:09   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Add a helper to do mmap() for a ramblock based on the cached informations.
>
> A trivial thing to mention is we need to move ramblock->fd setup to be
> earlier, before the ramblock_file_map() call, because it'll need to
> reference the fd being mapped.  However that should not be a problem at
> all, majorly because the fd won't be freed if successful, and if it failed
> the fd will be freeed (or to be explicit, close()ed) by the caller.
>
> Export it - prepare to be used outside this file.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>


> +void *ramblock_file_map(RAMBlock *block);

I would have called it:

void *qemu_ram_mmap_file(RAMBlock *block);

To make clear that it is 'like' qemu_ram_mmap(), but for a file.

But that is just a suggestion.  Whoever does the patch, get the right to
name the functions.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 11/21] migration: Add hugetlb-doublemap cap
  2023-01-17 22:09 ` [PATCH RFC 11/21] migration: Add hugetlb-doublemap cap Peter Xu
  2023-01-24 12:45   ` Dr. David Alan Gilbert
@ 2023-01-30  5:13   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Add a new cap to allow mapping hugetlbfs backed RAMs in small page sizes.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

> +bool migrate_hugetlb_doublemap(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_HUGETLB_DOUBLEMAP];
> +}

I think it was not our finest moment when we decided to name the
functions that query capabilities without a verb, but well, everything
else uses "this" convention, so ....



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 12/21] migration: Introduce page size for-migration-only
  2023-01-17 22:09 ` [PATCH RFC 12/21] migration: Introduce page size for-migration-only Peter Xu
  2023-01-24 13:20   ` Dr. David Alan Gilbert
@ 2023-01-30  5:17   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Migration may not want to recognize memory chunks in page size of the host
> only, but sometimes we may want to recognize the memory in smaller chunks
> if e.g. they're doubly mapped as both huge and small.
>
> In those cases we'll prefer to assume the memory page size is always mapped
> small (qemu_real_host_page_size) and we'll do things just like when the
> pages was only smally mapped.
>
> Let's do this to be prepared of postcopy double-mapping for hugetlbfs.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>


Reviewed-by: Juan Quintela <quintela@redhat.com>


> ---
>  migration/migration.c    |  6 ++++--
>  migration/postcopy-ram.c | 16 +++++++++-------
>  migration/ram.c          | 29 ++++++++++++++++++++++-------
>  migration/ram.h          |  1 +
>  4 files changed, 36 insertions(+), 16 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index b174f2af92..f6fe474fc3 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -408,7 +408,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  {
>      uint8_t bufc[12 + 1 + 255]; /* start (8), len (4), rbname up to 256 */
>      size_t msglen = 12; /* start + len */
> -    size_t len = qemu_ram_pagesize(rb);
> +    size_t len = migration_ram_pagesize(rb);
>      enum mig_rp_message_type msg_type;
>      const char *rbname;
>      int rbname_len;
> @@ -443,8 +443,10 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  int migrate_send_rp_req_pages(MigrationIncomingState *mis,
>                                RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>  {
> -    void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
>      bool received = false;
> +    void *aligned;
> +
> +    aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, migration_ram_pagesize(rb));

I am trying that all new code declares variables at 1st use, and this
goes in the wrong direction.  As this happens more than once in this
patch, can we change the macro (or create another macro) that also does
the cast?



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 13/21] migration: Add migration_ram_pagesize_largest()
  2023-01-17 22:09 ` [PATCH RFC 13/21] migration: Add migration_ram_pagesize_largest() Peter Xu
  2023-01-24 17:34   ` Dr. David Alan Gilbert
@ 2023-01-30  5:19   ` Juan Quintela
  1 sibling, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Let it replace the old qemu_ram_pagesize_largest() just to fetch the page
> sizes using migration_ram_pagesize(), because it'll start to consider
> double mapping effect in migrations.
>
> Also don't account the ignored ramblocks as they won't be migrated.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate
  2023-01-17 22:09 ` [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate Peter Xu
  2023-01-25 14:25   ` Dr. David Alan Gilbert
@ 2023-01-30  5:24   ` Juan Quintela
  2023-01-30 22:35     ` Peter Xu
  1 sibling, 1 reply; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Add a RAMBlock.host_mirror for all the hugetlbfs backed guest memories.
> It'll be used to remap the same region twice and it'll be used to service
> page faults using UFFDIO_CONTINUE.
>
> To make sure all accesses to these ranges will generate minor page faults
> not missing page faults, we need to pre-allocate the files to make sure
> page cache exist start from the beginning.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

but what about this change

> ---
>  include/exec/ramblock.h |  7 +++++
>  migration/ram.c         | 59 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 66 insertions(+)
>
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 3f31ce1591..c76683c3c8 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -28,6 +28,13 @@ struct RAMBlock {
>      struct rcu_head rcu;
>      struct MemoryRegion *mr;
>      uint8_t *host;
> +    /*
> +     * This is only used for hugetlbfs ramblocks where doublemap is
> +     * enabled.  The pointer is managed by dest host migration code, and
> +     * should be NULL when migration is finished.  On src host, it should
> +     * always be NULL.
> +     */
> +    uint8_t *host_mirror;

I would consider here:

    uint8_t *host_doublemap;

as I have not a small name that means
    uint8_t *host_map_smaller_size_pages;

That explains why we need it.


>      uint8_t *colo_cache; /* For colo, VM's ram cache */
>      ram_addr_t offset;
>      ram_addr_t used_length;
> diff --git a/migration/ram.c b/migration/ram.c
> index 2ebf414f5f..37d7b3553a 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3879,6 +3879,57 @@ void colo_release_ram_cache(void)
>      ram_state_cleanup(&ram_state);
>  }
>  
> +static int migrate_hugetlb_doublemap_init(void)
> +{
> +    RAMBlock *rb;
> +    void *addr;
> +    int ret;

Not initialized variables, remove the last two.

> +    if (!migrate_hugetlb_doublemap()) {
> +        return 0;
> +    }
> +

I would move the declaration of the RAMBlock here.

> +    RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
> +        if (qemu_ram_is_hugetlb(rb)) {
> +            /*
> +             * Firstly, we remap the same ramblock into another range of
> +             * virtual address, so that we can write to the pages without
> +             * touching the page tables that directly mapped for the guest.
> +             */
> +            addr = ramblock_file_map(rb);

               void *addr = ramblock_file_map(rb);

> +            if (addr == MAP_FAILED) {
> +                ret = -errno;
                   int ret = -errno;
> +                error_report("%s: Duplicate mapping for hugetlb ramblock '%s'"
> +                             "failed: %s", __func__, qemu_ram_get_idstr(rb),
> +                             strerror(errno));
> +                return ret;
> +            }
> +            rb->host_mirror = addr;
> +
> +            /*
> +             * We need to make sure we pre-allocate the range with
> +             * hugetlbfs pages before hand, so that all the page fault will
> +             * be trapped as MINOR faults always, rather than MISSING
> +             * faults in userfaultfd.
> +             */
> +            ret = qemu_madvise(addr, rb->mmap_length, QEMU_MADV_POPULATE_WRITE);

               int ret = qemu_madvise(addr, rb->mmap_length, QEMU_MADV_POPULATE_WRITE);

> +            if (ret) {
> +                error_report("Failed to populate hugetlb ramblock '%s': "
> +                             "%s", qemu_ram_get_idstr(rb), strerror(-ret));
> +                return ret;
> +            }
> +        }
> +    }
> +
> +    /*
> +     * When reach here, it means we've setup the mirror mapping for all the
> +     * hugetlbfs pages.  Hence when page fault happens, we'll be able to
> +     * resolve page faults using UFFDIO_CONTINUE for hugetlbfs pages, but
> +     * we'll keep using UFFDIO_COPY for anonymous pages.
> +     */
> +    return 0;
> +}



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 15/21] migration: Teach qemu about minor faults and doublemap
  2023-01-17 22:09 ` [PATCH RFC 15/21] migration: Teach qemu about minor faults and doublemap Peter Xu
@ 2023-01-30  5:45   ` Juan Quintela
  2023-01-30 22:50     ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: Juan Quintela @ 2023-01-30  5:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> When a ramblock is backed by hugetlbfs and the user specified using
> double-map feature, we trap the faults on these regions using minor mode.
> Teach QEMU about that.
>
> Add some sanity check on the fault flags when receiving a uffd message.
> For minor fault trapped ranges, we should always see the MINOR flag set,
> while when using generic missing faults we should never see it.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



> -    if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {

Does qemu have a macro to do this bitmap handling?

>  {
>      MigrationIncomingState *mis = opaque;
>      struct uffd_msg msg;
> +    uint64_t address;
>      int ret;
>      size_t index;
>      RAMBlock *rb = NULL;
> @@ -945,6 +980,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
>      }
>  
>      while (true) {
> +        bool use_minor_fault, minor_flag;

I think that something on the lines of:
           bool src_minor_fault, dst_minor_fault;

will make things simpler.  Reviewing, I have to go back to definition
place to know which is which.

>          ram_addr_t rb_offset;
>          int poll_result;
>  
> @@ -1022,22 +1058,37 @@ static void *postcopy_ram_fault_thread(void *opaque)
>                  break;
>              }
>  
> -            rb_offset = ROUND_DOWN(rb_offset, migration_ram_pagesize(rb));
> -            trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
> -                                                qemu_ram_get_idstr(rb),
> -                                                rb_offset,
> -                                                msg.arg.pagefault.feat.ptid);
> -            mark_postcopy_blocktime_begin(
> -                    (uintptr_t)(msg.arg.pagefault.address),
> -                                msg.arg.pagefault.feat.ptid, rb);
> +            address = ROUND_DOWN(msg.arg.pagefault.address,
> +                                 migration_ram_pagesize(rb));
> +            use_minor_fault = postcopy_use_minor_fault(rb);
> +            minor_flag = !!(msg.arg.pagefault.flags &
> +                            UFFD_PAGEFAULT_FLAG_MINOR);
>  
> +            /*
> +             * Do sanity check on the message flags to make sure this is
> +             * the one we expect to receive.  When using minor fault on
> +             * this ramblock, it should _always_ be set; when not using
> +             * minor fault, it should _never_ be set.
> +             */
> +            if (use_minor_fault ^ minor_flag) {
> +                error_report("%s: Unexpected page fault flags (0x%"PRIx64") "
> +                             "for address 0x%"PRIx64" (mode=%s)", __func__,
> +                             (uint64_t)msg.arg.pagefault.flags,
> +                             (uint64_t)msg.arg.pagefault.address,
> +                             use_minor_fault ? "MINOR" : "MISSING");
> +            }
> +
> +            trace_postcopy_ram_fault_thread_request(
> +                address, qemu_ram_get_idstr(rb), rb_offset,
> +                msg.arg.pagefault.feat.ptid);
> +            mark_postcopy_blocktime_begin(
> +                    (uintptr_t)(address), msg.arg.pagefault.feat.ptid, rb);
>  retry:
>              /*
>               * Send the request to the source - we want to request one
>               * of our host page sizes (which is >= TPS)
>               */
> -            ret = postcopy_request_page(mis, rb, rb_offset,
> -                                        msg.arg.pagefault.address);
> +            ret = postcopy_request_page(mis, rb, rb_offset, address);

This is the only change that I find 'problematic'.
On old code, rb_offset has been ROUND_DOWN, on new code it is not.
On old code we pass msg.arg.pagefault.address, now we use
ROUND_DOW(msg.arg.pagefault.address, mighration_ram_pagesize(rb)).

>              if (ret) {
>                  /* May be network failure, try to wait for recovery */
>                  postcopy_pause_fault_thread(mis);
> @@ -1694,3 +1745,13 @@ void *postcopy_preempt_thread(void *opaque)
>  
>      return NULL;
>  }
> +
> +/*
> + * Whether we should use MINOR fault to trap page faults?  It will be used
> + * when doublemap is enabled on hugetlbfs.  The default value will be
> + * false, which means we'll keep using the legacy MISSING faults.
> + */
> +bool postcopy_use_minor_fault(RAMBlock *rb)
> +{
> +    return migrate_hugetlb_doublemap() && qemu_ram_is_hugetlb(rb);
> +}

Are you planing using this function outside postocpy-ram.c?  Otherwise
if you move up its definition you can make it static and drop the header
change.

Later, Juan.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks
  2023-01-30  5:05   ` Juan Quintela
@ 2023-01-30 22:07     ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2023-01-30 22:07 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

On Mon, Jan 30, 2023 at 06:05:47AM +0100, Juan Quintela wrote:
> Peter Xu <peterx@redhat.com> wrote:
> > We do proper page size alignment for file backed mmap()s for ramblocks.
> > Even if it's as simple as that, cache the value because it'll be used in
> > multiple places.
> >
> > Since at it, drop size for file_ram_alloc() and just use max_length because
> > that's always true for file-backed ramblocks.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>

Thanks for reviewing the set!

> 
> > @@ -2100,7 +2100,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
> >      new_block->used_length = size;
> >      new_block->max_length = size;
> >      new_block->flags = ram_flags;
> > -    new_block->host = file_ram_alloc(new_block, size, fd, readonly,
> > +    new_block->host = file_ram_alloc(new_block, fd, readonly,
> >                                       !file_size, offset, errp);
> >      if (!new_block->host) {
> >          g_free(new_block);
> 
> Passing "size" in three places, not bad at all O:-)

Yes it's a bit unfortunate. :(

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate
  2023-01-30  5:24   ` Juan Quintela
@ 2023-01-30 22:35     ` Peter Xu
  2023-02-01 18:53       ` Juan Quintela
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-30 22:35 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

On Mon, Jan 30, 2023 at 06:24:20AM +0100, Juan Quintela wrote:
> Peter Xu <peterx@redhat.com> wrote:
> > Add a RAMBlock.host_mirror for all the hugetlbfs backed guest memories.
> > It'll be used to remap the same region twice and it'll be used to service
> > page faults using UFFDIO_CONTINUE.
> >
> > To make sure all accesses to these ranges will generate minor page faults
> > not missing page faults, we need to pre-allocate the files to make sure
> > page cache exist start from the beginning.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> 
> but what about this change
> 
> > ---
> >  include/exec/ramblock.h |  7 +++++
> >  migration/ram.c         | 59 +++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 66 insertions(+)
> >
> > diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> > index 3f31ce1591..c76683c3c8 100644
> > --- a/include/exec/ramblock.h
> > +++ b/include/exec/ramblock.h
> > @@ -28,6 +28,13 @@ struct RAMBlock {
> >      struct rcu_head rcu;
> >      struct MemoryRegion *mr;
> >      uint8_t *host;
> > +    /*
> > +     * This is only used for hugetlbfs ramblocks where doublemap is
> > +     * enabled.  The pointer is managed by dest host migration code, and
> > +     * should be NULL when migration is finished.  On src host, it should
> > +     * always be NULL.
> > +     */
> > +    uint8_t *host_mirror;
> 
> I would consider here:
> 
>     uint8_t *host_doublemap;
> 
> as I have not a small name that means
>     uint8_t *host_map_smaller_size_pages;
> 
> That explains why we need it.

Sure, I can rename this one if it helps.

One thing worth mention is that, it's not mapping things in small page size
here with host_doublemap but in huge page size only.

It's just that UFFDIO_CONTINUE needs another mapping to resolve the page
faults. It'll be the guest hugetlb ramblocks that will be mapped in small
pages during postcopy.

> 
> 
> >      uint8_t *colo_cache; /* For colo, VM's ram cache */
> >      ram_addr_t offset;
> >      ram_addr_t used_length;
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 2ebf414f5f..37d7b3553a 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -3879,6 +3879,57 @@ void colo_release_ram_cache(void)
> >      ram_state_cleanup(&ram_state);
> >  }
> >  
> > +static int migrate_hugetlb_doublemap_init(void)
> > +{
> > +    RAMBlock *rb;
> > +    void *addr;
> > +    int ret;
> 
> Not initialized variables, remove the last two.

I can do this.

> 
> > +    if (!migrate_hugetlb_doublemap()) {
> > +        return 0;
> > +    }
> > +
> 
> I would move the declaration of the RAMBlock here.

But isn't QEMU in most cases declaring variables at the start of any code
block, rather than after or in the middle of any code segments?  IIRC some
compiler should start to fail with it, even though not on the modern ones.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 15/21] migration: Teach qemu about minor faults and doublemap
  2023-01-30  5:45   ` Juan Quintela
@ 2023-01-30 22:50     ` Peter Xu
  2023-02-01 18:55       ` Juan Quintela
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2023-01-30 22:50 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

On Mon, Jan 30, 2023 at 06:45:20AM +0100, Juan Quintela wrote:
> Peter Xu <peterx@redhat.com> wrote:
> > When a ramblock is backed by hugetlbfs and the user specified using
> > double-map feature, we trap the faults on these regions using minor mode.
> > Teach QEMU about that.
> >
> > Add some sanity check on the fault flags when receiving a uffd message.
> > For minor fault trapped ranges, we should always see the MINOR flag set,
> > while when using generic missing faults we should never see it.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> 
> 
> 
> > -    if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {
> 
> Does qemu have a macro to do this bitmap handling?

Not yet that's suitable.  It's open coded like this in many places of
postcopy.  One thing close enough is bitmap_test_and_clear() but too heavy.

> 
> >  {
> >      MigrationIncomingState *mis = opaque;
> >      struct uffd_msg msg;
> > +    uint64_t address;
> >      int ret;
> >      size_t index;
> >      RAMBlock *rb = NULL;
> > @@ -945,6 +980,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >      }
> >  
> >      while (true) {
> > +        bool use_minor_fault, minor_flag;
> 
> I think that something on the lines of:
>            bool src_minor_fault, dst_minor_fault;
> 
> will make things simpler.  Reviewing, I have to go back to definition
> place to know which is which.

These two values represents "what we expect" and "what we got from the
message", so the only thing is I'm not sure whether src/dst matches the
best here.

How about "expect_minor_fault" and "has_minor_fault" instead?

> 
> >          ram_addr_t rb_offset;
> >          int poll_result;
> >  
> > @@ -1022,22 +1058,37 @@ static void *postcopy_ram_fault_thread(void *opaque)
> >                  break;
> >              }
> >  
> > -            rb_offset = ROUND_DOWN(rb_offset, migration_ram_pagesize(rb));
> > -            trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
> > -                                                qemu_ram_get_idstr(rb),
> > -                                                rb_offset,
> > -                                                msg.arg.pagefault.feat.ptid);
> > -            mark_postcopy_blocktime_begin(
> > -                    (uintptr_t)(msg.arg.pagefault.address),
> > -                                msg.arg.pagefault.feat.ptid, rb);
> > +            address = ROUND_DOWN(msg.arg.pagefault.address,
> > +                                 migration_ram_pagesize(rb));
> > +            use_minor_fault = postcopy_use_minor_fault(rb);
> > +            minor_flag = !!(msg.arg.pagefault.flags &
> > +                            UFFD_PAGEFAULT_FLAG_MINOR);
> >  
> > +            /*
> > +             * Do sanity check on the message flags to make sure this is
> > +             * the one we expect to receive.  When using minor fault on
> > +             * this ramblock, it should _always_ be set; when not using
> > +             * minor fault, it should _never_ be set.
> > +             */
> > +            if (use_minor_fault ^ minor_flag) {
> > +                error_report("%s: Unexpected page fault flags (0x%"PRIx64") "
> > +                             "for address 0x%"PRIx64" (mode=%s)", __func__,
> > +                             (uint64_t)msg.arg.pagefault.flags,
> > +                             (uint64_t)msg.arg.pagefault.address,
> > +                             use_minor_fault ? "MINOR" : "MISSING");
> > +            }
> > +
> > +            trace_postcopy_ram_fault_thread_request(
> > +                address, qemu_ram_get_idstr(rb), rb_offset,
> > +                msg.arg.pagefault.feat.ptid);
> > +            mark_postcopy_blocktime_begin(
> > +                    (uintptr_t)(address), msg.arg.pagefault.feat.ptid, rb);
> >  retry:
> >              /*
> >               * Send the request to the source - we want to request one
> >               * of our host page sizes (which is >= TPS)
> >               */
> > -            ret = postcopy_request_page(mis, rb, rb_offset,
> > -                                        msg.arg.pagefault.address);
> > +            ret = postcopy_request_page(mis, rb, rb_offset, address);
> 
> This is the only change that I find 'problematic'.
> On old code, rb_offset has been ROUND_DOWN, on new code it is not.
> On old code we pass msg.arg.pagefault.address, now we use
> ROUND_DOW(msg.arg.pagefault.address, mighration_ram_pagesize(rb)).

Thanks for spotting such a detail even for a RFC series. :)

It's actually rounded down to target psize, here since we're in postcopy we
should require target psize equals to host psize (or I bet it won't really
work at all).  So the relevant rounddown was actually done here:

            rb = qemu_ram_block_from_host(
                     (void *)(uintptr_t)msg.arg.pagefault.address,
                     true, &rb_offset);

In which there's:

    *offset = (host - block->host);
    if (round_offset) {
        *offset &= TARGET_PAGE_MASK;
    }

So when I rework that chunk of code I directly dropped the ROUND_DOWN()
because I find it duplicated.

> 
> >              if (ret) {
> >                  /* May be network failure, try to wait for recovery */
> >                  postcopy_pause_fault_thread(mis);
> > @@ -1694,3 +1745,13 @@ void *postcopy_preempt_thread(void *opaque)
> >  
> >      return NULL;
> >  }
> > +
> > +/*
> > + * Whether we should use MINOR fault to trap page faults?  It will be used
> > + * when doublemap is enabled on hugetlbfs.  The default value will be
> > + * false, which means we'll keep using the legacy MISSING faults.
> > + */
> > +bool postcopy_use_minor_fault(RAMBlock *rb)
> > +{
> > +    return migrate_hugetlb_doublemap() && qemu_ram_is_hugetlb(rb);
> > +}
> 
> Are you planing using this function outside postocpy-ram.c?  Otherwise
> if you move up its definition you can make it static and drop the header
> change.

Yes, it'll be further used in ram.c later in the patch "migration: Rework
ram discard logic for hugetlb double-map" right below.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate
  2023-01-30 22:35     ` Peter Xu
@ 2023-02-01 18:53       ` Juan Quintela
  2023-02-06 21:40         ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: Juan Quintela @ 2023-02-01 18:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> On Mon, Jan 30, 2023 at 06:24:20AM +0100, Juan Quintela wrote:
>> I would consider here:
>> 
>>     uint8_t *host_doublemap;
>> 
>> as I have not a small name that means
>>     uint8_t *host_map_smaller_size_pages;
>> 
>> That explains why we need it.
>
> Sure, I can rename this one if it helps.
>
> One thing worth mention is that, it's not mapping things in small page size
> here with host_doublemap but in huge page size only.

Thanks.


> It's just that UFFDIO_CONTINUE needs another mapping to resolve the page
> faults. It'll be the guest hugetlb ramblocks that will be mapped in small
> pages during postcopy.

ok
>> Not initialized variables, remove the last two.
>
> I can do this.
>
>> 
>> > +    if (!migrate_hugetlb_doublemap()) {
>> > +        return 0;
>> > +    }
>> > +
>> 
>> I would move the declaration of the RAMBlock here.
>
> But isn't QEMU in most cases declaring variables at the start of any code
> block, rather than after or in the middle of any code segments?  IIRC some
> compiler should start to fail with it, even though not on the modern ones.

We can declare variables since c99.  Only 24 years have passed O:-)

Anyways:

Exhibit A: We already have that kind of code

static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
{
    MultiFDPages_t *pages = p->pages;

    for (int i = 0; i < p->normal_num; i++) {
        p->iov[p->iovs_num].iov_base = pages->block->host + p->normal[i];
        p->iov[p->iovs_num].iov_len = p->page_size;
        p->iovs_num++;
    }

    p->next_packet_size = p->normal_num * p->page_size;
    p->flags |= MULTIFD_FLAG_NOCOMP;
    return nocomp;
}

Exhibit B:

from configure:

#if defined(__clang_major__) && defined(__clang_minor__)
# ifdef __apple_build_version__
#  if __clang_major__ < 10 || (__clang_major__ == 10 && __clang_minor__ < 0)
#   error You need at least XCode Clang v10.0 to compile QEMU
#  endif
# else
#  if __clang_major__ < 6 || (__clang_major__ == 6 && __clang_minor__ < 0)
#   error You need at least Clang v6.0 to compile QEMU
#  endif
# endif
#elif defined(__GNUC__) && defined(__GNUC_MINOR__)
# if __GNUC__ < 7 || (__GNUC__ == 7 && __GNUC_MINOR__ < 4)
#  error You need at least GCC v7.4.0 to compile QEMU
# endif
#else
# error You either need GCC or Clang to compiler QEMU
#endif
int main (void) { return 0; }
EOF

gcc-7.4.0: supports C11, so we are good here
https://gcc.gnu.org/onlinedocs/gcc-7.4.0/gcc/Standards.html#C-Language

clang 6.0: supports c11 and c17 standard
https://releases.llvm.org/6.0.0/tools/clang/docs/ReleaseNotes.html


So as far as I can see, we are good here.

Later, Juan.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 15/21] migration: Teach qemu about minor faults and doublemap
  2023-01-30 22:50     ` Peter Xu
@ 2023-02-01 18:55       ` Juan Quintela
  0 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-02-01 18:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> On Mon, Jan 30, 2023 at 06:45:20AM +0100, Juan Quintela wrote:
>> Peter Xu <peterx@redhat.com> wrote:
>> > When a ramblock is backed by hugetlbfs and the user specified using
>> > double-map feature, we trap the faults on these regions using minor mode.
>> > Teach QEMU about that.
>> >
>> > Add some sanity check on the fault flags when receiving a uffd message.
>> > For minor fault trapped ranges, we should always see the MINOR flag set,
>> > while when using generic missing faults we should never see it.
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> 
>> Reviewed-by: Juan Quintela <quintela@redhat.com>
>> 
>> 
>> 
>> > -    if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {
>> 
>> Does qemu have a macro to do this bitmap handling?
>
> Not yet that's suitable.  It's open coded like this in many places of
> postcopy.  One thing close enough is bitmap_test_and_clear() but too heavy.
>
>> 
>> >  {
>> >      MigrationIncomingState *mis = opaque;
>> >      struct uffd_msg msg;
>> > +    uint64_t address;
>> >      int ret;
>> >      size_t index;
>> >      RAMBlock *rb = NULL;
>> > @@ -945,6 +980,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
>> >      }
>> >  
>> >      while (true) {
>> > +        bool use_minor_fault, minor_flag;
>> 
>> I think that something on the lines of:
>>            bool src_minor_fault, dst_minor_fault;
>> 
>> will make things simpler.  Reviewing, I have to go back to definition
>> place to know which is which.
>
> These two values represents "what we expect" and "what we got from the
> message", so the only thing is I'm not sure whether src/dst matches the
> best here.
>
> How about "expect_minor_fault" and "has_minor_fault" instead?

Perfect with me.

>> >              /*
>> >               * Send the request to the source - we want to request one
>> >               * of our host page sizes (which is >= TPS)
>> >               */
>> > -            ret = postcopy_request_page(mis, rb, rb_offset,
>> > -                                        msg.arg.pagefault.address);
>> > +            ret = postcopy_request_page(mis, rb, rb_offset, address);
>> 
>> This is the only change that I find 'problematic'.
>> On old code, rb_offset has been ROUND_DOWN, on new code it is not.
>> On old code we pass msg.arg.pagefault.address, now we use
>> ROUND_DOW(msg.arg.pagefault.address, mighration_ram_pagesize(rb)).
>
> Thanks for spotting such a detail even for a RFC series. :)
>
> It's actually rounded down to target psize, here since we're in postcopy we
> should require target psize equals to host psize (or I bet it won't really
> work at all).  So the relevant rounddown was actually done here:
>
>             rb = qemu_ram_block_from_host(
>                      (void *)(uintptr_t)msg.arg.pagefault.address,
>                      true, &rb_offset);
>
> In which there's:
>
>     *offset = (host - block->host);
>     if (round_offset) {
>         *offset &= TARGET_PAGE_MASK;
>     }
>
> So when I rework that chunk of code I directly dropped the ROUND_DOWN()
> because I find it duplicated.

Ok.

>
>> 
>> >              if (ret) {
>> >                  /* May be network failure, try to wait for recovery */
>> >                  postcopy_pause_fault_thread(mis);
>> > @@ -1694,3 +1745,13 @@ void *postcopy_preempt_thread(void *opaque)
>> >  
>> >      return NULL;
>> >  }
>> > +
>> > +/*
>> > + * Whether we should use MINOR fault to trap page faults?  It will be used
>> > + * when doublemap is enabled on hugetlbfs.  The default value will be
>> > + * false, which means we'll keep using the legacy MISSING faults.
>> > + */
>> > +bool postcopy_use_minor_fault(RAMBlock *rb)
>> > +{
>> > +    return migrate_hugetlb_doublemap() && qemu_ram_is_hugetlb(rb);
>> > +}
>> 
>> Are you planing using this function outside postocpy-ram.c?  Otherwise
>> if you move up its definition you can make it static and drop the header
>> change.
>
> Yes, it'll be further used in ram.c later in the patch "migration: Rework
> ram discard logic for hugetlb double-map" right below.

Aha.

Thanks.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 16/21] migration: Enable doublemap with MADV_SPLIT
  2023-01-17 22:09 ` [PATCH RFC 16/21] migration: Enable doublemap with MADV_SPLIT Peter Xu
@ 2023-02-01 18:59   ` Juan Quintela
  0 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-02-01 18:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> MADV_SPLIT enables doublemap on hugetlb.  Do that if doublemap=true
> specified for the migration.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/postcopy-ram.c | 16 ++++++++++++++++
>  migration/ram.c          | 18 ++++++++++++++++++
>  2 files changed, 34 insertions(+)

Reviewed-by: Juan Quintela <quintela@redhat.com>


>
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 86ff73c2c0..dbc7e54e4a 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -694,6 +694,22 @@ static int ram_block_enable_notify(RAMBlock *rb, void *opaque)
>       */
>      reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
>      if (minor_fault) {
> +        /*
> +         * MADV_SPLIT implicitly enables doublemap mode for hugetlb.  If
> +         * that fails (e.g. on old kernels) we need to fail the migration.
> +         *
> +         * It's a bit late to fail here as we could have migrated lots of
> +         * pages in precopy, but early failure will require us to allocate
> +         * hugetlb pages secretly in QEMU which is not friendly to admins
> +         * and it may affect the global hugetlb pool.  Considering it is
> +         * normally always limited, keep the failure late but tolerable.
> +         */
> +        if (qemu_madvise(qemu_ram_get_host_addr(rb), rb->postcopy_length,
> +                         QEMU_MADV_SPLIT)) {
> +            error_report("%s: madvise(MADV_SPLIT) failed (ret=%d) but "
> +                         "required for doublemap.", __func__, -errno);

Here you write errno

> +            return -1;
> +        }
>          reg_struct.mode |= UFFDIO_REGISTER_MODE_MINOR;
>      }
>  
> diff --git a/migration/ram.c b/migration/ram.c
> index 37d7b3553a..4d786f4b97 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3891,6 +3891,19 @@ static int migrate_hugetlb_doublemap_init(void)
>  
>      RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
>          if (qemu_ram_is_hugetlb(rb)) {
> +            /*
> +             * MADV_SPLIT implicitly enables doublemap mode for hugetlb on
> +             * the guest mapped ranges.  If that fails (e.g. on old
> +             * kernels) we need to fail the migration.  Note, the
> +             * host_mirror mapping below can be kept as hugely mapped.
> +             */
> +            if (qemu_madvise(qemu_ram_get_host_addr(rb), rb->mmap_length,
> +                             QEMU_MADV_SPLIT)) {
> +                error_report("%s: madvise(MADV_SPLIT) required for doublemap",
> +                             __func__);

Here you don't.

So I think you could change it.

I was thinking about creating a function for this, but as comments are
different I think it is overkill.

> +                return -1;
> +            }
> +
>              /*
>               * Firstly, we remap the same ramblock into another range of
>               * virtual address, so that we can write to the pages without
> @@ -3898,6 +3911,11 @@ static int migrate_hugetlb_doublemap_init(void)
>               */
>              addr = ramblock_file_map(rb);
>              if (addr == MAP_FAILED) {
> +                /*
> +                 * No need to undo MADV_SPLIT because this is dest node and
> +                 * we're going to bail out anyway.  Leave that for mm exit
> +                 * to clean things up.
> +                 */
>                  ret = -errno;
>                  error_report("%s: Duplicate mapping for hugetlb ramblock '%s'"
>                               "failed: %s", __func__, qemu_ram_get_idstr(rb),



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 17/21] migration: Rework ram discard logic for hugetlb double-map
  2023-01-17 22:09 ` [PATCH RFC 17/21] migration: Rework ram discard logic for hugetlb double-map Peter Xu
@ 2023-02-01 19:03   ` Juan Quintela
  0 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-02-01 19:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Hugetlb double map will make the ram discard logic different.
>
> The whole idea will still be the same: we need to a bitmap sync between
> src/dst before we switch to postcopy.
>
> When discarding a range, we only erase the pgtables that were used to be
> mapped for the guest leveraging the semantics of MADV_DONTNEED on Linux.
> This guarantees us that when a guest access triggered we'll receive a MINOR
> fault message rather than a MISSING fault message.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>
>                       length >> qemu_target_page_bits());
>      }
>  
> -    return ram_block_discard_range(rb, start, length);
> +    if (postcopy_use_minor_fault(rb)) {
> +        /*
> +         * We need to keep the page cache exist, so as to trigger MINOR
> +         * faults for every future page accesses on old pages.
> +         */
> +        return ram_block_zap_range(rb, start, length);
> +    } else {
> +        return ram_block_discard_range(rb, start, length);
> +    }

This is a question of style, so take it or leave as it as you wish.

You can change:

if (foo) {
    return X;
} else {
    return Y;
}

into:

if (foo) {
    return X;
}
return Y;

It is one line less of code, and in my eyes, makes it easier to see that
one exits in all cases.  But as said, this is a question of taste, and
that is as personal as it gets.

Later, Juan.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 18/21] migration: Allow postcopy_register_shared_ufd() to fail
  2023-01-17 22:09 ` [PATCH RFC 18/21] migration: Allow postcopy_register_shared_ufd() to fail Peter Xu
@ 2023-02-01 19:09   ` Juan Quintela
  0 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-02-01 19:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Let's fail double-map for vhost-user and any potential users that can have
> a remote userfaultfd for now.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

But

> -void postcopy_register_shared_ufd(struct PostCopyFD *pcfd)
> +int postcopy_register_shared_ufd(struct PostCopyFD *pcfd)
>  {
>      MigrationIncomingState *mis = migration_incoming_get_current();
>  
> +    if (migrate_hugetlb_doublemap()) {
> +        return -EINVAL;

I am not sure that -EINVAL is the best answer here.
There is not a problem with the value.  The problem is that both
features together don't work.

As an alternative:

ENOSYS 38 Function not implemented

But I am not sure that this is much better :-(

Later, Juan.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 19/21] migration: Add postcopy_mark_received()
  2023-01-17 22:09 ` [PATCH RFC 19/21] migration: Add postcopy_mark_received() Peter Xu
@ 2023-02-01 19:10   ` Juan Quintela
  0 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-02-01 19:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> We have a few maintainance work to do after we UFFDIO_[ZERO]COPY a page
> before, e.g. on requested list of pages or when measuring page latencies.
>
> Move those steps into a separate function so that it can be easily reused
> when we're going to support UFFDIO_CONTINUE.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 20/21] migration: Handle page faults using UFFDIO_CONTINUE
  2023-01-17 22:09 ` [PATCH RFC 20/21] migration: Handle page faults using UFFDIO_CONTINUE Peter Xu
@ 2023-02-01 19:24   ` Juan Quintela
  2023-02-01 19:52     ` Juan Quintela
  0 siblings, 1 reply; 69+ messages in thread
From: Juan Quintela @ 2023-02-01 19:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Teach QEMU to be able to handle page faults using UFFDIO_CONTINUE for
> hugetlbfs double mapped ranges.
>
> To copy the data, we need to use the mirror buffer created per ramblock by
> a raw memcpy(), then we can kick the faulted threads using UFFDIO_CONTINUE
> by installing the pgtables.
>
> Move trace_postcopy_place_page(host) upper so that it'll dump something for
> either UFFDIO_COPY or UFFDIO_CONTINUE.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

> ---
>  migration/postcopy-ram.c | 55 ++++++++++++++++++++++++++++++++++++++--
>  migration/trace-events   |  4 +--
>  2 files changed, 55 insertions(+), 4 deletions(-)
>
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 8a2259581e..c4bd338e22 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -1350,6 +1350,43 @@ int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
>      return 0;
>  }
>  
> +/* Returns the mirror_host addr for a specific host address in ramblock */
> +static inline void *migration_ram_get_mirror_addr(RAMBlock *rb, void *host)
> +{
> +    return (void *)((__u64)rb->host_mirror + ((__u64)host - (__u64)rb->host));

This is gross :-(
I hate this C miss-feature.

What about:
    return (char *)rb->host_mirror + (char*)host - (char*)rb->host;

But I don't know if it (much) clearer.  And no, I don't remember if we
ever need more parents.

gcc used to do "the right" thing on void * arithmetic, but it is not in
the standard, and I don't know what is worse.

> +}
> +
> +static int
> +qemu_uffd_continue(MigrationIncomingState *mis, RAMBlock *rb, void *host,
> +                   void *from)
> +{
> +    void *mirror_addr = migration_ram_get_mirror_addr(rb, host);
> +    /* Doublemap uses small host page size */
> +    uint64_t psize = qemu_real_host_page_size();
> +    struct uffdio_continue req;
> +
> +    /*
> +     * Copy data first into the mirror host pointer; we can't directly copy
> +     * data into rb->host because otherwise our thread will get trapped too.
> +     */
> +    memcpy(mirror_addr, from, psize);
> +
> +    /* Kick off the faluted threads to fetch data from the page cache
                       ^^^^^^^
> */

Faulted

> +    req.range.start = (__u64)host;
> +    req.range.len = psize;
> +    req.mode = 0;
> +	if (ioctl(mis->userfault_fd, UFFDIO_CONTINUE, &req)) {
> +        error_report("%s: UFFDIO_CONTINUE failed for start=%p"
> +                     " len=0x%"PRIx64": %s\n", __func__, host,
> +                     psize, strerror(-req.mapped));
> +        return req.mapped;
> +    }
> +
> +    postcopy_mark_received(mis, rb, host, psize / qemu_target_page_size());
> +
> +    return 0;
> +}
> +
>  /*
>   * Place a host page (from) at (host) atomically
>   * returns 0 on success
> @@ -1359,6 +1396,18 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>  {
>      size_t pagesize = migration_ram_pagesize(rb);
>  
> +    trace_postcopy_place_page(rb->idstr, (uint8_t *)host - rb->host, host);
> +
> +    if (postcopy_use_minor_fault(rb)) {
> +        /*
> +         * If minor fault used, we use UFFDIO_CONTINUE instead.
> +         *
> +         * TODO: support shared uffds (e.g. vhost-user). Currently we're
> +         * skipping them.
> +         */
> +        return qemu_uffd_continue(mis, rb, host, from);
> +    }
> +
>      /* copy also acks to the kernel waking the stalled thread up
>       * TODO: We can inhibit that ack and only do it if it was requested
>       * which would be slightly cheaper, but we'd have to be careful
> @@ -1372,7 +1421,6 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
>          return -e;
>      }
>  
> -    trace_postcopy_place_page(host);
>      return postcopy_notify_shared_wake(rb,
>                                         qemu_ram_block_host_offset(rb, host));
>  }
> @@ -1385,10 +1433,13 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
>                               RAMBlock *rb)
>  {
>      size_t pagesize = migration_ram_pagesize(rb);
> -    trace_postcopy_place_page_zero(host);
> +    trace_postcopy_place_page_zero(rb->idstr, (uint8_t *)host - rb->host, host);

It is me, or to be standard compliant, you need to cast also rb->host?


>      /* Normal RAMBlocks can zero a page using UFFDIO_ZEROPAGE
>       * but it's not available for everything (e.g. hugetlbpages)
> +     *
> +     * NOTE: when hugetlb double-map enabled, then this ramblock will never
> +     * have RAM_UF_ZEROPAGE, so it'll always go to postcopy_place_page().
>       */
>      if (qemu_ram_is_uf_zeroable(rb)) {
>          if (qemu_ufd_copy_ioctl(mis, host, NULL, pagesize, rb)) {
> diff --git a/migration/trace-events b/migration/trace-events
> index 6b418a0e9e..7baf235d22 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -265,8 +265,8 @@ postcopy_discard_send_range(const char *ramblock, unsigned long start, unsigned
>  postcopy_cleanup_range(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=0x%zx length=0x%zx"
>  postcopy_init_range(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=0x%zx length=0x%zx"
>  postcopy_nhp_range(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=0x%zx length=0x%zx"
> -postcopy_place_page(void *host_addr) "host=%p"
> -postcopy_place_page_zero(void *host_addr) "host=%p"
> +postcopy_place_page(const char *id, size_t offset, void *host_addr) "id=%s offset=0x%zx host=%p"
> +postcopy_place_page_zero(const char *id, size_t offset, void *host_addr) "id=%s offset=0x%zx host=%p"
>  postcopy_ram_enable_notify(void) ""
>  mark_postcopy_blocktime_begin(uint64_t addr, void *dd, uint32_t time, int cpu, int received) "addr: 0x%" PRIx64 ", dd: %p, time: %u, cpu: %d, already_received: %d"
>  mark_postcopy_blocktime_end(uint64_t addr, void *dd, uint32_t time, int affected_cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, affected_cpu: %d"

I think that you can split the part of the patch that changes the
traces.  But again, it is up to you.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 21/21] migration: Collapse huge pages again after postcopy finished
  2023-01-17 22:09 ` [PATCH RFC 21/21] migration: Collapse huge pages again after postcopy finished Peter Xu
@ 2023-02-01 19:49   ` Juan Quintela
  0 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-02-01 19:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> When hugetlb-doublemap enabled, the pages will be migrated in small page
> sizes during postcopy.  When the migration finishes, the pgtable needs to
> be rebuilt explicitly for these ranges to have huge page being mapped again.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/ram.c        | 31 +++++++++++++++++++++++++++++++
>  migration/trace-events |  1 +
>  2 files changed, 32 insertions(+)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 4da56d925c..178739f8c3 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3986,6 +3986,31 @@ static int ram_load_setup(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> +#define  MADV_COLLAPSE_CHUNK_SIZE  (1UL << 30) /* 1G */
> +
> +static void ramblock_rebuild_huge_mappings(RAMBlock *rb)
> +{
> +    unsigned long addr, size;

This makes my head explode.

We have:

unsigned long
__u64
uint64_t

Used and mixed all around.

> +    assert(qemu_ram_is_hugetlb(rb));
> +
> +    addr = (unsigned long)qemu_ram_get_host_addr(rb);

Don't this cast should be uintptr_t?
At least on win64 it should fail, no?

> +    size = rb->mmap_length;

this is ram_addr_t.  It is uint64_t except with xen.
So it should fail on any 32 bit host.

Later, Juan.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 20/21] migration: Handle page faults using UFFDIO_CONTINUE
  2023-02-01 19:24   ` Juan Quintela
@ 2023-02-01 19:52     ` Juan Quintela
  0 siblings, 0 replies; 69+ messages in thread
From: Juan Quintela @ 2023-02-01 19:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

Juan Quintela <quintela@redhat.com> wrote:
> Peter Xu <peterx@redhat.com> wrote:
>> Teach QEMU to be able to handle page faults using UFFDIO_CONTINUE for
>> hugetlbfs double mapped ranges.
>>
>> To copy the data, we need to use the mirror buffer created per ramblock by
>> a raw memcpy(), then we can kick the faulted threads using UFFDIO_CONTINUE
>> by installing the pgtables.
>>
>> Move trace_postcopy_place_page(host) upper so that it'll dump something for
>> either UFFDIO_COPY or UFFDIO_CONTINUE.
>>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>
>> ---
>>  migration/postcopy-ram.c | 55 ++++++++++++++++++++++++++++++++++++++--
>>  migration/trace-events   |  4 +--
>>  2 files changed, 55 insertions(+), 4 deletions(-)
>>
>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> index 8a2259581e..c4bd338e22 100644
>> --- a/migration/postcopy-ram.c
>> +++ b/migration/postcopy-ram.c
>> @@ -1350,6 +1350,43 @@ int postcopy_notify_shared_wake(RAMBlock *rb, uint64_t offset)
>>      return 0;
>>  }
>>  
>> +/* Returns the mirror_host addr for a specific host address in ramblock */
>> +static inline void *migration_ram_get_mirror_addr(RAMBlock *rb, void *host)
>> +{
>> +    return (void *)((__u64)rb->host_mirror + ((__u64)host - (__u64)rb->host));
>
> This is gross :-(
> I hate this C miss-feature.
>
> What about:
>     return (char *)rb->host_mirror + (char*)host - (char*)rb->host;

This was a generic suggestion.  But after looking at ramblock.h and
realizing that rb->host is not void*.

    return (uint8_t *)rb->host_mirror + (uint8_t *)host - rb->host;

Sorry for looking too late.

BTW, once here, why is the type of host_mirror different than the one
from host?  I don't know what is more confusing anymore.

Later, Juan.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate
  2023-02-01 18:53       ` Juan Quintela
@ 2023-02-06 21:40         ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2023-02-06 21:40 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Leonardo Bras Soares Passos, James Houghton,
	Dr . David Alan Gilbert

On Wed, Feb 01, 2023 at 07:53:28PM +0100, Juan Quintela wrote:
> Peter Xu <peterx@redhat.com> wrote:
> > On Mon, Jan 30, 2023 at 06:24:20AM +0100, Juan Quintela wrote:
> >> I would consider here:
> >> 
> >>     uint8_t *host_doublemap;
> >> 
> >> as I have not a small name that means
> >>     uint8_t *host_map_smaller_size_pages;
> >> 
> >> That explains why we need it.
> >
> > Sure, I can rename this one if it helps.
> >
> > One thing worth mention is that, it's not mapping things in small page size
> > here with host_doublemap but in huge page size only.
> 
> Thanks.
> 
> 
> > It's just that UFFDIO_CONTINUE needs another mapping to resolve the page
> > faults. It'll be the guest hugetlb ramblocks that will be mapped in small
> > pages during postcopy.
> 
> ok
> >> Not initialized variables, remove the last two.
> >
> > I can do this.
> >
> >> 
> >> > +    if (!migrate_hugetlb_doublemap()) {
> >> > +        return 0;
> >> > +    }
> >> > +
> >> 
> >> I would move the declaration of the RAMBlock here.
> >
> > But isn't QEMU in most cases declaring variables at the start of any code
> > block, rather than after or in the middle of any code segments?  IIRC some
> > compiler should start to fail with it, even though not on the modern ones.
> 
> We can declare variables since c99.  Only 24 years have passed O:-)

Oh OK :)

> 
> Anyways:
> 
> Exhibit A: We already have that kind of code
> 
> static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
> {
>     MultiFDPages_t *pages = p->pages;
> 
>     for (int i = 0; i < p->normal_num; i++) {
>         p->iov[p->iovs_num].iov_base = pages->block->host + p->normal[i];
>         p->iov[p->iovs_num].iov_len = p->page_size;
>         p->iovs_num++;
>     }
> 
>     p->next_packet_size = p->normal_num * p->page_size;
>     p->flags |= MULTIFD_FLAG_NOCOMP;
>     return nocomp;
> }
> 
> Exhibit B:
> 
> from configure:
> 
> #if defined(__clang_major__) && defined(__clang_minor__)
> # ifdef __apple_build_version__
> #  if __clang_major__ < 10 || (__clang_major__ == 10 && __clang_minor__ < 0)
> #   error You need at least XCode Clang v10.0 to compile QEMU
> #  endif
> # else
> #  if __clang_major__ < 6 || (__clang_major__ == 6 && __clang_minor__ < 0)
> #   error You need at least Clang v6.0 to compile QEMU
> #  endif
> # endif
> #elif defined(__GNUC__) && defined(__GNUC_MINOR__)
> # if __GNUC__ < 7 || (__GNUC__ == 7 && __GNUC_MINOR__ < 4)
> #  error You need at least GCC v7.4.0 to compile QEMU
> # endif
> #else
> # error You either need GCC or Clang to compiler QEMU
> #endif
> int main (void) { return 0; }
> EOF
> 
> gcc-7.4.0: supports C11, so we are good here
> https://gcc.gnu.org/onlinedocs/gcc-7.4.0/gcc/Standards.html#C-Language
> 
> clang 6.0: supports c11 and c17 standard
> https://releases.llvm.org/6.0.0/tools/clang/docs/ReleaseNotes.html
> 
> 
> So as far as I can see, we are good here.

Thanks, I'll switch over.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2023-02-06 21:40 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-17 22:08 [PATCH RFC 00/21] migration: Support hugetlb doublemaps Peter Xu
2023-01-17 22:08 ` [PATCH RFC 01/21] update linux headers Peter Xu
2023-01-17 22:08 ` [PATCH RFC 02/21] util: Include osdep.h first in util/mmap-alloc.c Peter Xu
2023-01-18 12:00   ` Dr. David Alan Gilbert
2023-01-25  0:19   ` Philippe Mathieu-Daudé
2023-01-30  4:57   ` Juan Quintela
2023-01-17 22:08 ` [PATCH RFC 03/21] physmem: Add qemu_ram_is_hugetlb() Peter Xu
2023-01-18 12:02   ` Dr. David Alan Gilbert
2023-01-30  5:00   ` Juan Quintela
2023-01-17 22:08 ` [PATCH RFC 04/21] madvise: Include linux/mman.h under linux-headers/ Peter Xu
2023-01-18 12:08   ` Dr. David Alan Gilbert
2023-01-30  5:01   ` Juan Quintela
2023-01-17 22:08 ` [PATCH RFC 05/21] madvise: Add QEMU_MADV_SPLIT Peter Xu
2023-01-30  5:01   ` Juan Quintela
2023-01-17 22:08 ` [PATCH RFC 06/21] madvise: Add QEMU_MADV_COLLAPSE Peter Xu
2023-01-18 18:51   ` Dr. David Alan Gilbert
2023-01-18 20:21     ` Peter Xu
2023-01-30  5:02   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 07/21] ramblock: Cache file offset for file-backed ramblocks Peter Xu
2023-01-30  5:02   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 08/21] ramblock: Cache the length to do file mmap() on ramblocks Peter Xu
2023-01-23 18:51   ` Dr. David Alan Gilbert
2023-01-24 20:28     ` Peter Xu
2023-01-30  5:05   ` Juan Quintela
2023-01-30 22:07     ` Peter Xu
2023-01-17 22:09 ` [PATCH RFC 09/21] ramblock: Add RAM_READONLY Peter Xu
2023-01-23 19:42   ` Dr. David Alan Gilbert
2023-01-30  5:06   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 10/21] ramblock: Add ramblock_file_map() Peter Xu
2023-01-24 10:06   ` Dr. David Alan Gilbert
2023-01-24 20:47     ` Peter Xu
2023-01-25  9:24       ` Dr. David Alan Gilbert
2023-01-25 14:46         ` Peter Xu
2023-01-30  5:09   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 11/21] migration: Add hugetlb-doublemap cap Peter Xu
2023-01-24 12:45   ` Dr. David Alan Gilbert
2023-01-24 21:15     ` Peter Xu
2023-01-30  5:13   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 12/21] migration: Introduce page size for-migration-only Peter Xu
2023-01-24 13:20   ` Dr. David Alan Gilbert
2023-01-24 21:36     ` Peter Xu
2023-01-24 22:03       ` Peter Xu
2023-01-30  5:17   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 13/21] migration: Add migration_ram_pagesize_largest() Peter Xu
2023-01-24 17:34   ` Dr. David Alan Gilbert
2023-01-30  5:19   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 14/21] migration: Map hugetlbfs ramblocks twice, and pre-allocate Peter Xu
2023-01-25 14:25   ` Dr. David Alan Gilbert
2023-01-30  5:24   ` Juan Quintela
2023-01-30 22:35     ` Peter Xu
2023-02-01 18:53       ` Juan Quintela
2023-02-06 21:40         ` Peter Xu
2023-01-17 22:09 ` [PATCH RFC 15/21] migration: Teach qemu about minor faults and doublemap Peter Xu
2023-01-30  5:45   ` Juan Quintela
2023-01-30 22:50     ` Peter Xu
2023-02-01 18:55       ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 16/21] migration: Enable doublemap with MADV_SPLIT Peter Xu
2023-02-01 18:59   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 17/21] migration: Rework ram discard logic for hugetlb double-map Peter Xu
2023-02-01 19:03   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 18/21] migration: Allow postcopy_register_shared_ufd() to fail Peter Xu
2023-02-01 19:09   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 19/21] migration: Add postcopy_mark_received() Peter Xu
2023-02-01 19:10   ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 20/21] migration: Handle page faults using UFFDIO_CONTINUE Peter Xu
2023-02-01 19:24   ` Juan Quintela
2023-02-01 19:52     ` Juan Quintela
2023-01-17 22:09 ` [PATCH RFC 21/21] migration: Collapse huge pages again after postcopy finished Peter Xu
2023-02-01 19:49   ` Juan Quintela

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.